This document describes the structure of the HCA metadata standard as implemented and utilized by the Data Coordination Platform (DCP). More detailed specification of the format and syntax of the metadata schemas and their instantiation can be found in the metadata schema structure specification on Google Drive.
What is in this document?
Who should be reading this document?
What isn't in this document?
The primary motivations for having a structured HCA metadata standard are to:
The primary motivations for the entity model chosen to describe the HCA metadata standard are to:
There are five major entities supported by the HCA metadata standard: Projects, Biomaterials (biological samples), Protocols, Processes, and Files.
The entities are arranged in units that represent different parts of an experiment. For example, the diagram below is an abstract illustration of an input biomaterial (e.g., a tissue sample) undergoing a process (e.g., dissociation) to produce another biomaterial (e.g., a sample of dissociated cells). The process that was executed followed a specific protocol - or intended plan - to produce the output biomaterial.
The metadata entity model supports units that can have one or more biomaterials or files as inputs or outputs. If both the input and output of a process are biomaterials, this unit represents some form of sample preparation. If the input is a biomaterial and the output is a file, the unit represents what many scientists refer to as an "assay". If both the input and output of a process are files, the unit represents an analysis. This flexible model allows for the possibility of modeling synthetic biology experiments - for example a file is used as an input to produce a custom biomaterial - in the future.
Below is an example single cell sequencing experiment modeled using the HCA metadata entity model.
Each instance of a project, biomaterial, protocol, process, and file entity is represented by a specific Type metadata schema. For example, in the HCA metadata standard each donor biomaterial is represented and described by the donor schema. Each Type schema includes a set of fields specific to describing that particular entity, and also inherits a set of core fields for that entity type - represented in a Core entity schema - and an optional set of thematically related fields - each set represented in a Module entity schema.
Most metadata fields in Core, Type, and Module schemas are provided by data contributors during the submission process. A subset of fields in the metadata standard, however, are provided by the Ingestion Service component of the DCP. These ingest-supplied fields include the following:
provenance
field in all Type schemasdescribedBy
, schema_version
, schema_type
fields in all schemaslinks.json
and provenance.json
schemasIngest-supplied fields contain metadata about the submission (e.g. when metadata was uploaded to the DCP) and about the schema being used. Data contributors will never submit values to these fields directly.
In addition to ingest-supplied metadata fields, the ontology
and ontology_label
fields present in all ontology schemas will be populated by a look-up service based on the value supplied by data contributors in the corresponding text
field in each ontology schema.
The metadata standard is stored as a series of individual schemas which represent the entities and fields associated with them (e.g., project.json
, biomaterial_core.json
, sequencing_protocol.json
). The schemas are stored in a single versioned control GitHub repository alongside documentation about the schema, the meaning of their content, and the update process. Anyone can propose changes to the schema through GitHub pull requests and issues. Only a specific list of committers will be allowed to approve pull requests and release new versions of the metadata schemas.
Each JSON metadata schema is self-describing using the $id
field with a URL to the location of that specific version of the schema.
In the donor_organism.json
schema, the $id
field looks like:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://schema.humancellatlas.org/type/biomaterial/10.1.1/donor_organism",
...
}
Each metadata document for a Type entity requires a property - describedBy
- that explicitly records the URI of the metadata schema which represents it.
In the donor_organism.json
schema, the describedBy
field is specified as:
"describedBy": {
"description": "The URL reference to the schema.",
"type": "string",
"pattern": "^(http|https)://schema.(.*?)humancellatlas.org/type/biomaterial/(([0-9]{1,}.[0-9]{1,}.[0-9]{1,})|([a-zA-Z]*?))/donor_organism"
}
In a metadata document representing a Donor organism entity, the describedBy
field is expressed as:
{
"describedBy": "http://schema.staging.data.humancellatlas.org/type/biomaterial/10.1.1/donor_organism",
...
}
The structure of metadata schema URIs follows the convention:
http://schema.humancellatlas.org/{primary_directory}/{secondary/directory/structure}/{version}/{unqualified_schema_name}
where
{primary_directory}
is one of [core
, type
, module
]{secondary/directory/structure}
describes the path to the schema, e.g. biomaterial
, process/sequencing
{version}
is the version number of the schema, e.g. 10.1.1
{unqualified_schema_name}
is the unqualified name of the schema, e.g. donor_organism
Some example URIs include:
http://schema.humancellatlas.org/core/biomaterial/5.0.1/biomaterial_core
http://schema.humancellatlas.org/type/biomaterial/5.0.0/cell_line
http://schema.humancellatlas.org/type/protocol/sequencing/5.0.0/library_preparation_protocol
http://schema.humancellatlas.org/module/ontology/5.0.0/cell_type_ontology
http://schema.humancellatlas.org/module/process/sequencing/5.2.0/barcode