Marvelous Misadventures in Bioinformatics

A blog on some snippets of my work in bioinformatics. Hopefully you find something useful here and avoid stupid mistakes I made.

View My GitHub Profile

Getting genomes from NCBI

This is a short tutorial on getting genomes from NCBI using the datasets CLI tool. I will be using E. coli O157:H7 (Tax id: 83334) as an example.

Prerequisite

Installation

conda install -c conda-forge ncbi-datasets-cli

Usage

  1. Call the dataset command and specify the tax id as 83334 (E. coli O157:H7)

     datasets download genome taxon 83334 --dehydrated
    

    The dehydrated flag downloads a “dehydrated” index file for later use. This is generally good practice to use this method, especially for larger datasets to prevent corruption.

    dehydrated

  2. Unzip the downloaded dehydrated file

     unzip ncbi_dataset.zip
    
  3. Rehydrate the unzipped contents. You should specify to the path where the file is unzipped.

     datasets rehydrated --directory .
    

    You should see in the example, 1603 files will be downloaded.

    Note: if for any reason the connection is severed or the job is prematurely terminated, invoke the same command to resume the job. The index file will automatically pick up where the job last left.

    Note: This also extends to missing files. For any reason the files are removed or moved away from the original download directory, invoking the command will regenerate the missing files. Neat.

    rehydrate

    Your genomes will be in the ./ncbi_dataset/data/ directories.

back