The long-term goal of the Optimus workflow is to support any 3 prime single-cell or single- nucleus transcriptomics assay selected by the HCA project. Using the correct modularity, we hope to grow a generic pipeline that has specific modules to address differences in assays, while leveraging common code where steps of the assays are the same. We offer this as a community resource for community development and improvement.
The introduction of droplet-based technologies such as inDrop (Klein, et al., 2015) and Drop-seq (Macosko, et al., 2015) moved the throughput of a single-cell RNA sequencing experiment from hundreds to thousands of cells. Technology developed by 10x Genomics further increased throughput to hundreds of thousands of cells and has opened up the possibility of creating datasets for millions of cells. Common among many of the single-cell transcriptomics high-throughput technologies is the use of:
The bead-specific barcodes and UMIs are encoded on sequencing primers that also contain polyT tracts to enable binding of the primers to polyA+ mRNA transcripts. After lysing cells, mRNA transcripts bind to the polyT tracts in the primer and transcripts are reverse transcribed to generate barcoded cDNA. Note that all cDNA molecules from a single cell have the same barcode, but they have different UMIs. Thus every transcript that is captured from an individual cell can be mapped to its cognate cell and also counted as a single transcript, correcting for PCR bias. cDNAs are pooled for amplification and construction of libraries to facilitate 3’ DNA sequencing.
|Assay Type||10x Single Cell/Nucleus Expression (v2 and v3)||10x Genomics|
|Overall Workflow||Quality control module and transcriptome quantification module||Code available from Github|
|Genomic Reference Sequence||GRCh38 human genome primary sequence and M21 (GRCm38.p6) mouse genome primary sequence||GENCODE Human and Mouse|
|Transcriptomic Reference Annotation||V27 GenCode human transcriptome and M21 mouse transcriptome||GENCODE Human and Mouse|
|Aligner||STAR||Dobin, et al.,2013|
|Transcript Quantification||Utilities for processing large-scale single cell datasets||sctools|
|Data Input File Format||File format in which sequencing data is provided||FASTQ|
|Data Output File Format||File formats in which Optimus output is provided||BAM, Loom version 3|
The workflow runs in two modes: single-cell (
sc_rna) or single-nucleus (
sn_rna). When appropriate, differences between the modes are noted.
Overall, the workflow:
Special care is taken to avoid the removal of reads that are not aligned or that do not contain recognizable barcodes. This design (which differs from many pipelines currently available) allows the use of the entire dataset by those who may want to use alternative filtering or leverage the data for methodological development associated with the data processing.
A general overview of the pipeline is shown below, followed by more detailed descriptions of the steps.
Each 10x v2 and v3 3’ sequencing experiment generates triplets of FASTQ files:
Because the pipeline processing steps require a BAM file format, the first step of Optimus is to convert the R2 FAST files, containing the alignable genomic information, to BAM files. Next, the FastqProcessing step appends the UMI and Cell Barcode sequences from R1 to the corresponding R2 sequence as tags, in order to properly label the genomic information for alignment.
Although the function of the cell barcodes is to identify unique cells, barcode errors can arise during sequencing (such as the incorporation of the barcode into contaminating DNA or sequencing and PCR errors), making it difficult to distinguish unique cells from artifactual appearances of the barcode. Barcode errors are evaluated in the FastqProcessing step mentioned above, which compares the sequences against a whitelist of known barcode sequences.
The output BAM files contain the reads with correct barcodes, including barcodes that came within one edit distance (Levenshtein distance) of matching the whitelist of barcode sequences and were corrected by this tool. Correct barcodes are assigned a “CB” tag. Uncorrected barcodes (with more than one error) are preserved and given a “CR” (Cell barcode Raw) tag. Cell barcode quality scores are also preserved in the file under the “CY” tag.
The STAR alignment software (Dobin, et al., 2013) is used to map barcoded reads in the BAM files to the human genome primary assembly reference. STAR (Spliced Transcripts Alignment to a Reference) is widely used for RNA-seq alignment and identifies the best matching location(s) on the reference for each sequencing read.
The TagGeneExon task then annotates each read with the type of sequence to which it aligns. These annotations differ between single-cell and single-nuclei modes.
Annotations include INTERGENIC, INTRONIC, UTR and CODING (EXONIC), and are stored using the 'XF' BAM tag. In cases where the gene corresponds to an exon or UTR, the name of the gene that overlaps the alignment is associated with the read and stored using the GE BAM tag.
Annotations include INTERGENIC, INTRONIC, UTR and CODING (EXONIC), and are stored using the 'XF' BAM tag. In cases where the gene corresponds to an exon, UTR, or intron, the name of the gene that overlaps the alignment is associated with the read and stored using the GE BAM tag.
UMIs are designed to distinguish unique transcripts present in the cell at lysis from those arising from PCR amplification of these same transcripts. But, like cell barcodes, UMIs can also be incorrectly sequenced or amplified. Optimus uses the UMI-tools software package, which applies a network-based method to account for such errors (Smith, et al., 2017). Optimus uses the “directional” method.
A number of quality control tools are used to assess the quality of the data output each time this pipeline is run. For a list of the tools and information about each one please see our QC Metrics page.
For the single-cell mode, the pipeline runs the EmptyDrops function from the dropletUtils R package to identify cell barcodes that correspond to empty droplets. Empty droplets are those that did not encapsulate a cell but instead acquired cell-free RNA from the solution in which the cells resided -- such as secreted RNA or RNA released when some cells lysed in solution (Lun, et al., 2018). This ambient RNA can serve as a substrate for reverse transcription, leading to a small number of background reads. Cell barcodes that are not believed to represent cells are identified in the metrics and raw information from dropletUtils is provided to the user.
The pipeline outputs a count matrix that contains, for each cell barcode and for each gene, the number of molecules that were observed. The script that generates this matrix evaluates every read. It discards any read that maps to more than one gene, and counts any remaining reads provided the triplet of cell barcode, molecule barcode, and gene name is unique, indicating the read originates from a single transcript present at the time of lysis of the cell represented by that respective barcode.
Outputs of the pipeline include:
In addition to viewing the Optimus code in WARP or on Dockstore, you can try the Optimus Pipeline on the cloud-based platform Terra. After registering on Terra, navigate to the Optimus Featured Workspace which is preloaded with instructions and sample data.
Additionally, you can use the public Intro-to-HCA-data-on-Terra workspace to analyze an example Optimus cell-by-gene count matrix (Loom file) with multiple downstream community tools, such as Seurat, Scanpy, Cumulus, and Pegasus.
For more information on using the Terra platform, please view the Support Center.
All Optimus workflow versions are detailed in the Optimus Changelog in GitHub.
This documentation applies to Optimus v4.1.7 and later. If you are working with data processed with a previous version, please check the Optimus changelog for any data processing changes that may be applicable to your data.
For more detailed information about the Optimus pipeline, please see the Optimus Overview in the WARP repository documentation.