Downloading data from NCBI via the command line

Description

The National Center for Biotechnology Information (NCBI) offers a wealth of databases, analysis tools and reports for use in research by the medical and scientific community.

These resources are freely available to download from the NCBI website. Because of the large sizes of most of the datasets (on the level of gigabytes or terabytes), the recommended method of transfer is with the Aspera Connect browser plugin.

You can use Aspera Connect directly through the NCBI website on your browser by clicking and downloading the datasets of your choice.

Alternatively, you can also choose to download data from NCBI through the command line with ascp, Aspera’s transfer tool which comes bundled with your Connect installation.

Usage

The general syntax for downloading data from NCBI is the following:

/path/to/ascp -T -k 1 -i path/to/private/key anonftp@ftp.ncbi.nlm.nih.gov:/path/to/data /local/location

The components of the command can be broken down as follows:

  • /path/to/ascp You will need to specify the full path to the ascp program, a reference for which can be found in the next section.
  • -k 1 If the transfer stops because of connection loss or other issues, the k option tells the transfer to resume from where it left off rather than restarting the entire transfer over. This is important because of the large size of most NCBI data. The 1 specifies that a sparse checksum will be performed before resuming a transfer, which is the best choice for NCBI data because a full checksum on large files may be slow. For more information on the resume transfer option, see this Knowledge Base article.
  • -T This option tells the server not to encrypt the transfer, as NCBI’s download server doesn’t offer encryption.
  • -i /path/to/private/key This is an option which specifies the path to the private key used to authenticate this transfer. Ensure that you specify the FULL path to the key (in other words, ~/path/to/key or similar shortcuts will not work).
  • anonftp is the transfer user configured on NCBI’s Aspera server.
  • ftp.ncbi.nlm.nih.gov is the hostname of NCBI’s Aspera server.
  • /path/to/data is the path to the data you are downloading. You can find a reference of these paths here.
  • /local/location is the path to the folder on your own machine that you want the NCBI files to be downloaded to.

Private key and ascp locations

The private key you will use is asperaweb_id_dsa.openssh, which comes with your Connect installation.

Below are locations where you can generally find the private key and the ascp executable. Where applicable replace username with the name of the user you're logged in as.

Mac

Private key

  • Local installation of connect - /Users/username/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh
  • System wide installation of Connect - /Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh

ascp

  • Local installation of connect - /Users/username/Applications/Aspera\ Connect.app/Contents/Resources/ascp
  • System wide installation of Connect - /Applications/Aspera\ Connect.app/Contents/Resources/ascp

Linux

Private key

  • /home/username/.aspera/connect/etc/asperaweb_id_dsa.openssh
  • /opt/aspera/etc/asperaweb_id_dsa.openssh

ascp

  • /opt/aspera/bin/ascp

Windows

Private key

  • "C:\Program Files (x86)\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh"
  • “C:\Users\username\AppData\Local\Programs\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh”

ascp

  • “C:\Program Files\Aspera\Aspera Connect\bin\ascp.exe”
  • “C:\Users\username\AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe”

Examples

The following examples demonstrate usage of ascp to download real data from NCBI. Commands for Mac, Linux and Windows will be shown, with the assumption that we are downloading from a user account on the system named janedoe, and downloaded data will go to the folder NCBI_data in janedoe’s home directory. The path locations of the datasets are shown on NCBI's public download directory.

1. Say you need to download all the data NCBI offers on epigenomics. There is a 223.79 GB sized folder on the topic containing 5 subfolders worth of data. In order to download the entire folder via ascp, you would use the following command:

On a Mac:

$ /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/ascp -T -k 1 -i  /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/epigenomics /Users/janedoe/NCBI_data

On a Windows:

> “C:\Users\aspera\AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe” -T -k 1 -i “C:\Users\janedoe\AppData\Local\Programs\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh” anonftp@ftp.ncbi.nlm.nih.gov:/epigenomics C:\Users\janedoe\NCBI_data"

On Linux:

# /opt/aspera/bin/ascp -T -k1 -i /home/janedoe/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/epigenomics /home/janedoe/NCBI_data

2. Perhaps you are conducting a study on tree-dwelling lizards and want to examine the genome data NCBI offers for the Anolis carolinensis species. To download the genome data for this species, you would use the following command:

On a Mac:

$ /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/ascp -T -k 1 -i  /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/genomes/anolis_carolinensis /Users/janedoe/NCBI_data

On a Windows:

> “C:\Users\aspera\AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe” -T -k 1 -i “C:\Users\janedoe\AppData\Local\Programs\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh” anonftp@ftp.ncbi.nlm.nih.gov:/genomes/anolis_carolinensis C:\Users\janedoe\NCBI_data"

On Linux:

# /opt/aspera/bin/ascp -T -k 1 -i /home/janedoe/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/genomes/anolis_carolinensis /home/janedoe/NCBI_data



3. As part of a research paper you’re writing you need to look at NCBI’s RefSeq project data concerning protein and RNA sequencing data in humans. You know there is 1.69 GB worth of available data on NCBI, and you proceed to download it with the following command:

On a Mac:

$ /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/ascp -T -k 1 -i  /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/refseq/H_sapiens/mRNA_prot /Users/janedoe/NCBI_data

On a Windows:

> “C:\Users\aspera\AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe” -T -k 1 -i “C:\Users\janedoe\AppData\Local\Programs\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh” anonftp@ftp.ncbi.nlm.nih.gov:/refseq/H_sapiens/mRNA_prot C:\Users\janedoe\NCBI_data"

On Linux:

# /opt/aspera/bin/ascp -T -k 1 -i /home/janedoe/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/refseq/H_sapiens/mRNA_prot /home/janedoe/NCBI_data
Have more questions? Submit a request

0 Comments

Article is closed for comments.
Powered by Zendesk