Submitting Count Matrices and Tier 1 Metadata to CELLxGENE Discover
File format
Data contributors wishing to submit datasets to HCA for integration into atlases are required to submit an AnnData file. AnnData files combine matrices and metadata into a single file, thus metadata should be captured on a per-cell basis.
An AnnData file has several components, including:
- uns (Dataset metadata), which describe the dataset as a whole
- X (Matrix layers), which describe the data required for different assays
- obs (Cell metadata), which describe each cell in the dataset
- obsm (Embeddings), which describe each embedding in the dataset
- obsp, which describe pairwise annotation of observations
- var and raw.var (Gene metadata), which describe each gene in the dataset
- varm, which describe multidimensional annotation of variables/features
- varp, which describe pairwise annotation of variables/features
File format specifications
-
Dataset-level metadata in uns:
- title
- study_pi
- batch_condition
- default_embedding
- comments
-
Data in .X and raw.X:
- raw counts are required
- normalized counts are strongly recommended
- raw counts should be in raw.X if normalized counts are in .X
- if there is no normalized matrix, raw counts should be in .X
-
Cell metadata in obs: Metadata fields are to be captured on a per-cell basis.
In addition to the HCA obs fields above, there are an additional three fields that are required for submission into CELLxGENE. These fields are not part of the HCA Tier 1 metadata fields.
-
Embeddings in obsm:
- One or more two-dimensional embeddings, prefixed with 'X_'
-
Features in var & raw.var (if present):
- index is Ensembl ID
- preference is that gene have not been filtered in order to maximise future data integration efforts
Tier 1 metadata fields
Tier 1 metadata fields provide the foundational information used to build tissue and organ atlases. Specifically, the fields help to identify factors that can cause ‘batch effects’ that may arise when combining datasets into an atlas.
The full list of fields with detailed information on each field can be found here. Some fields are are mandatory (labelled ‘must’) whereas others are ‘recommended’.
Submission process
AnnData files will be stored and are accessible on CELLxGENE Discover. To submit files, please follow the process for submission:
-
Please reach out to the curation team at cellxgene@chanzuckerberg.com with an email containing the following information.
- Title
- Description
- Contact: name and email
- Publication/preprint DOI: The publication digital object identifier (DOI) for the protocol. If no pre-print nor publication exists, please write 'not applicable'.
- URLs: any additional URLs for related data or resources, such as GEO or protocols.io - can be added later
- Consortia (i.e. HCA)
-
The team confirms acceptance of your data.
-
You prepare your AnnData file and send the file to cellxgene@chanzuckerberg.com.
-
The team will upload your dataset to a private collection where you can review.
-
The team will make your dataset either: public or private. Private datasets need to be shared with the integration team for inclusion in the integrated object.