There are several ways to access the data in the Data Store. This section briefly reviews how to find and download data and associated metadata using the most common methods, the Data Browser and the CLI, and finally, it points to some software programs that demonstrate some programmatic access patterns.
Downloading data via the Data Browser and the CLI both first require installation of the HCA CLI.
The Explore section of the data portal provides an interactive data browser. Select a subset of data by checking various boxes in the Organ, Method, Donor, Specimen sections. You can see how many specimens have been selected in the Specimens tab. It also gives an estimate of the size of the data set if the entire list were downloaded.
Once you have selected data through the user interface, click the Download button on the right hand side of the page to download the list of data sets (note that the actual data specified by the list is NOT downloaded in this step). We call this list a manifest. The Download dialog box gives you the option to further refine the types of files you would like to be included in the manifest. Select which files to include in the manifest. Be cognizant that the sizes listed are for the actual files and not the manifest itself.
Press the Download Manifest button and a file called <uuid>.tsv
will be saved to your local file system. Note that the
file name will use a UUID to avoid overwriting previous downloads.
The format of the manifest file is a simple tab separated text file, with the first line representing the header title for each column. It is OK to remove rows for unwanted files but the header row must remain, and the columns should remain the same.
The CLI is a powerful tool that can be used to find and download data from the Data Store. There are several subsections to the hca
tool. Data search, inspection and download are all available from the hca dss
section. Help text is available by typing:
hca dss --help
Help is also available for the commands under the dss
section. For example help on the get-bundle command can be seen by entering
hca dss get-bundle --help
Now let's try using the manifest file to download files.
Print the help for how to download the manifest of files:
hca dss download-manifest --help
Now execute the command to begin the download of the files listed in the <uuid>.tsv
file.
hca dss download-manifest --manifest <uuid>.tsv --replica aws
Note that the download could take a long time depending on the number and size of files included in the manifest file.
You can easily get a list of bundles in the Data Store by using the following Elastic Search command:
hca dss post-search --es-query "{}" --replica=aws | less
Note that this will return the first page of results found with the command. Searching for all bundles in the Data Store is not very useful though.
hca dss post-search --replica aws --es-query '{"query": {"bool": {"must": {"match": {"files.project_json.insdc_project.text": "SRP075496"}} , "must_not": { "exists": { "field": "files.analysis_file_json" }} } } }'
This command will find all the ids of the original files associated with the project SRP075496
, and not the analysis files.
Once you find bundles that you would like to download use the command below. (Note that this command can be scripted to iterate through a list of bundle IDs with some basic shell scripting).
For example, if your search returned the following bundle ID information:
{
"bundle_fqid": "2f08b7cd-2e39-44f2-b7fa-d4a373266104.2018-08-28T213422.136870Z",
"bundle_url": "https://dss.data.humancellatlas.org/v1/bundles/2f08b7cd-2e39-44f2-b7fa-d4a373266104?version=2018-08-28T213422.136870Z&replica=aws",
"search_score": null
}
then to download that bundle from the AWS replica you would use this command:
hca dss download --bundle-uuid 2f08b7cd-2e39-44f2-b7fa-d4a373266104 --version 2018-08-28T213422.136870Z --replica aws
The Data Coordination Platform (DCP) offers a number of different programatic ways to access the data. The Application Programming Interfaces (APIs) that we provide are described in the API documentation. The developers of the DCP have also created a number of example programs demonstrating how to use the APIs, and they can be found in the consumer vignettes. These examples are designed to demonstrate basic access patterns, but they are not intended to demonstrate any type of analysis.