Cell-by-gene matrices (commonly referred to as "count matrices" or "expression matrices") are files that contain a measure of gene expression for every gene in every cell in your single-cell sample(s). These matrices can be used for downstream analyses like filtering, clustering, differential expression testing, and annotating cell types.
This overview describes the HCA Data Portal matrix types, how to download them, and how to link them back to the HCA metadata.
Overall, three types of matrices are currently available for HCA Data Portal data:
Each HCA Data Portal project that is processed with uniform pipelines has two types of HCA Data Portal-generated matrices available for download:
Both matrix types are in Loom file format, and contain standard metrics and counts that are specific to the data processing pipeline used to generate the file.
HCA Data Portal-generated Loom matrices have three types of attributes containing metadata and metrics:
For more information on working with Loom attributes and format, see the Loom documentation.
Step-by-step Jupyter Notebook tutorials for analyzing Loom matrices with community tools are available in the cloud-based platform Terra. After registering, get started by navigating to the Intro-to-HCA-data-on-Terra workspace.
Both project matrices and library-level matrices have unique filenames.
Project matrices have filenames in the format
Library-level matrices have filenames matching the numerical ID in the HCA metadata
Project-level matrices are Loom files that contain standardized cell-by-gene measures and metrics for all the data in a project that are of the same species, organ, and sequencing method.
The gene measures in project matrices vary based on the pipeline used for analysis.
Each project matrix also has metadata stored in the Loom file's global attributes, described in the table below. In contrast to the other metadata, the input_id is stored in both the global and column attributes, as this is the ID used to link individual library preparations back to the Data Manifest.
Read more about each metadata field in the Metadata Dictionary.
|Metadata Attribute Name in HCA Data Portal-Generated Matrix||Metadata Description|
|Species information; human or mouse|
|Technology used for library preparation, i.e 10x or Smart-seq2|
|Metadata values for |
|Metadata values for |
Library-level matrices (also Loom files) are cell-by-gene matrices for each individual library preparation in a project. Overall, library-level matrices:
sequencing_process.provenance.document_id, allowing you to only use a sub-sampling of all the project's data.
input_name(described in the table above).
Contributor-generated matrices are optionally provided by the data contributors and can be useful when trying to annotate cell types or when comparing results back to a contributor’s published results.
When these contributor-generated matrices are available, you can download them from the individual Project page. They will vary in file format and content across projects. For questions about the Contributor-generated matrix, reach out to the contributors listed in the Project page Contact section.
DCP-generated project-level matrices and contributor-generated matrices may be downloaded from the individual Project page.
You can also download all matrices (including library-level matrices) using a curl command as described in the Accessing HCA Data and Metadata guide, or export matrices to Terra, a cloud-based platform for bioinformatic analysis (see the Exporting to Terra guide).
HCA Data Portal project-level matrices only contain some of the available project metadata (species, organs, library methods, etc.). However, there are several metadata facets in the Metadata Manifest, such as disease state or donor information, that you might want to link back to the HCA Data Portal-generated cell-by-gene matrix.
To link a metadata field in the Metadata Manifest back to an individual sample in a HCA Data Portal-generated matrix,
input_id field. This field includes all the values for the
sequencing_process.provenance.document_id, the ID used to demarcate each library preparation.
Data normalization and batch correction account for technical noise introduced during sample processing, as well as differences between datasets generated from different contributors or at different times. Both techniques are crucial for identifying differentially expressed genes.
Normalization and batch correction techniques vary between processing methods and individual data contributors, and may not be consistent across the matrices available from the Data Portal.