HCA Data Portal
  • Contributing Data to the Human Cell Atlas
  • Submitting Count Matrices and Tier 1 Metadata to CELLxGENE Discover
  • Submitting FASTQs and Tier 2 metadata to the HCA Data Repository
  • Submitting Cell Annotation Metadata to the Cell Annotation Platform (CAP)

Submitting Count Matrices and Tier 1 Metadata to CELLxGENE Discover

File format

Data contributors wishing to submit datasets to HCA for integration into atlases are required to submit an AnnData file. AnnData files combine matrices and metadata into a single file, thus metadata should be captured on a per-cell basis.

An AnnData file has several components, including:

  • uns (Dataset metadata), which describe the dataset as a whole
  • X (Matrix layers), which describe the data required for different assays
  • obs (Cell metadata), which describe each cell in the dataset
  • obsm (Embeddings), which describe each embedding in the dataset
  • obsp, which describe pairwise annotation of observations
  • var and raw.var (Gene metadata), which describe each gene in the dataset
  • varm, which describe multidimensional annotation of variables/features
  • varp, which describe pairwise annotation of variables/features

File format specifications

  • Dataset-level metadata in uns:

    • title
    • study_pi
    • batch_condition
    • default_embedding
    • comments
  • Data in .X and raw.X:

    • raw counts are required
    • normalized counts are strongly recommended
    • raw counts should be in raw.X if normalized counts are in .X
    • if there is no normalized matrix, raw counts should be in .X
  • Cell metadata in obs: Metadata fields are to be captured on a per-cell basis.

  • protocol_url
  • manner_of_death
  • donor_id
  • sample_source
  • sample_id
  • sex_ontology_term_id
  • institute
  • sample_collection_method
  • sample_collection_site
  • tissue_type
  • sample_collection_relative_time_point
  • sampled_site_condition
  • library_id
  • tissue_ontology_term_id
  • library_id_repository
  • tissue_free_text
  • author_batch_notes
  • sample_preservation_method
  • organism_ontology_term_id
  • suspension_type
  • cell_enrichment
  • cell_viability_percentage
  • cell_number_loaded
  • sample_collection_year
  • assay_ontology_term_id
  • library_preparation_batch
  • library_sequencing_run
  • sequenced_fragment
  • sequencing_platform
  • is_primary_data
  • reference_genome
  • gene_annotation_version
  • alignment_software
  • intron_inclusion
  • author_cell_type
  • cell_type_ontology_term
  • In addition to the HCA obs fields above, there are an additional three fields that are required for submission into CELLxGENE. These fields are not part of the HCA Tier 1 metadata fields.

    Tier 1 metadata fields

    Tier 1 metadata fields provide the foundational information used to build tissue and organ atlases. Specifically, the fields help to identify factors that can cause ‘batch effects’ that may arise when combining datasets into an atlas.

    The full list of fields with detailed information on each field can be found here. Some fields are are mandatory (labelled ‘must’) whereas others are ‘recommended’.

    Submission process

    AnnData files will be stored and are accessible on CELLxGENE Discover. To submit files, please follow the process for submission:

    1. Please reach out to the curation team at cellxgene@chanzuckerberg.com with an email containing the following information.

      • Title
      • Description
      • Contact: name and email
      • Publication/preprint DOI: The publication digital object identifier (DOI) for the protocol. If no pre-print nor publication exists, please write 'not applicable'.
      • URLs: any additional URLs for related data or resources, such as GEO or protocols.io - can be added later
      • Consortia (i.e. HCA)
    2. The team confirms acceptance of your data.

    3. You prepare your AnnData file and send the file to cellxgene@chanzuckerberg.com.

    4. The team will upload your dataset to a private collection where you can review.

    5. The team will make your dataset either: public or private. Private datasets need to be shared with the integration team for inclusion in the integrated object.