Submitting Count Matrices and Tier 1 Metadata to CELLxGENE Discover

File format

Data contributors wishing to submit datasets to HCA for integration into atlases are required to submit an AnnData file. AnnData files combine matrices and metadata into a single file, thus metadata should be captured on a per-cell basis.

An AnnData file has several components, including:

uns (Dataset metadata), which describe the dataset as a whole
X (Matrix layers), which describe the data required for different assays
obs (Cell metadata), which describe each cell in the dataset
obsm (Embeddings), which describe each embedding in the dataset
obsp, which describe pairwise annotation of observations
var and raw.var (Gene metadata), which describe each gene in the dataset
varm, which describe multidimensional annotation of variables/features
varp, which describe pairwise annotation of variables/features

File format specifications

Dataset-level metadata in uns:
- title
- study_pi
- batch_condition
- default_embedding
- comments
Data in .X and raw.X:
- raw counts are required
- normalized counts are strongly recommended
- raw counts should be in raw.X if normalized counts are in .X
- if there is no normalized matrix, raw counts should be in .X
Cell metadata in obs: Metadata fields are to be captured on a per-cell basis.


protocol_url	manner_of_death
donor_id	sample_source
sample_id	sex_ontology_term_id
institute	sample_collection_method
sample_collection_site	tissue_type
sample_collection_relative_time_point	sampled_site_condition
library_id	tissue_ontology_term_id
library_id_repository	tissue_free_text
author_batch_notes	sample_preservation_method
organism_ontology_term_id	suspension_type
cell_enrichment	cell_viability_percentage
cell_number_loaded	sample_collection_year
assay_ontology_term_id	library_preparation_batch
library_sequencing_run	sequenced_fragment
sequencing_platform	is_primary_data
reference_genome	gene_annotation_version
alignment_software	intron_inclusion
author_cell_type	cell_type_ontology_term

In addition to the HCA obs fields above, there are an additional three fields that are required for submission into CELLxGENE. These fields are not part of the HCA Tier 1 metadata fields.

disease_ontology_term_id
development_stage_ontology_term_id
self_reported_ethnicity_ontology_term_id
Embeddings in obsm:
- One or more two-dimensional embeddings, prefixed with 'X_'
Features in var & raw.var (if present):
- index is Ensembl ID
- preference is that gene have not been filtered in order to maximise future data integration efforts

Should you require support creating an AnnData file or converting your file from another single cell file format click here or reach out to cellxgene@chanzuckerberg.com and mention that you are contributing to the HCA.

Tier 1 metadata fields

Tier 1 metadata fields provide the foundational information used to build tissue and organ atlases. Specifically, the fields help to identify factors that can cause ‘batch effects’ that may arise when combining datasets into an atlas.

The full list of fields with detailed information on each field can be found here. Some fields are are mandatory (labelled ‘must’) whereas others are ‘recommended’.

Submission process

AnnData files will be stored and are accessible on CELLxGENE Discover. To submit files, please follow the process for submission:

Please reach out to the curation team at cellxgene@chanzuckerberg.com with an email containing the following information.
- Title
- Description
- Contact: name and email
- Publication/preprint DOI: The publication digital object identifier (DOI) for the protocol. If no pre-print nor publication exists, please write 'not applicable'.
- URLs: any additional URLs for related data or resources, such as GEO or protocols.io - can be added later
- Consortia (i.e. HCA)
The team confirms acceptance of your data.
You prepare your AnnData file and send the file to cellxgene@chanzuckerberg.com.
The team will upload your dataset to a private collection where you can review.
The team will make your dataset either: public or private. Private datasets need to be shared with the integration team for inclusion in the integrated object.