NCBI download GEO/SRA data¶
usage: sra_download.py [-h] [-j JID] [--pipeline_type PIPELINE_TYPE] -f
DATA_LIST
given a list of SRA ID, download them
optional arguments:
-h, --help show this help message and exit
-j JID, --jid JID enter a job ID, which is used to make a new directory.
Every output will be moved into this folder. (default:
sra_download_yli11_2019-07-11)
--pipeline_type PIPELINE_TYPE
Not for end-user. (default: sra_download)
-f DATA_LIST, --data_list DATA_LIST
a list of SRA IDs, one per line (default: None)
Summary¶
This is an automate pipeline to download SRA fastq data given a list of SRA IDs. The specific commands and one bad thing about SRA database is discussed in sra_tools.
Input¶
The following screenshots show you how to get a list of SRA IDs.
The first step is to have a GEO ID, usually it can be found in the paper (e.g., search for GSE
). If you have an SRA ID, you can directly go to step 3.
Google the GEO ID and click the first hit, you will see the GEO webpage below.
Scroll down to the bottom, click on the SRA RUN selector
.
In the SRA webpage, if you want to download all the data, click on the accession list
in the Total row. You can also select the data you need by clicking on the checkbox on the leftside and then click on the accession list
in the Selected row. Once you have a list of accession numbers, copy them to HPC. Suppose I name this file as data.list
, then I will use the following command to download the data.
Usage¶
Go to your data directory and type the following.
Step 0: Load python version 2.7.13.
module load python/2.7.13
Step 1: Prepare input file, which is data.list
sra_download.py -f data.list
Output¶
Once the job is finished, you will receive a notification email. Data is downloaded in the Job ID folder.
Download large collection of data¶
Search NCBI SRA databases, find all the data you need
They are probably belong to different SRA project. In that case, I will download all info table and accession list, rename them has PRJNAxxxxxx.list
and PRJNAxxxxxx.info
. For example:
PRJNA396940.info
PRJNA401837.list
PRJNA413473.info
PRJNA396940.list
PRJNA401837.info
PRJNA413473.list
Submit multiple jobs
hpcf_interactive
module load python/2.7.13
for i in *.list ; do sra_download.py -f $i -j ${i%.list};done
Check downloaded data
For this latest sra-tools version, it should have no problem downloading files. However, if you see something like fasterq.tmp.nodecn002.23272
in your result folder, then it means a corrupted data. And you have to download this particular SRR data again.
A simple way to get a list of failed SRR ids:
cd log_files
grep gzip *err | cut -d " " -f 2 | cut -d "*" -f 1 > failed.list
Comments¶
code @ github.