Database requirements

You can automatically download and create all required data by running DaisySuite_setup <dir>. This will put the NCBI database and corresponding indices into the directory <dir>.
Alternatively, the individual requirements are explained in this section.

DaisySuite requires following data:

  • NCBI RefSeq
  • bwa index of the NCBI RefSeq
  • yara index of the NCBI RefSeq
  • MicrobeGPS taxonomy files

We will provide example code snippets to solving each one of the requirements.

NCBI RefSeq

First, download all NCBI entries that are tagged as “complete genomes” into a directory of your choice, e.g.

### Download NCBI database
mkdir ncbi
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
wget --directory-prefix ncbi -q ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/367/745/GCF_000367745.1_ASM36774v1/GCF_000367745.1_ASM36774v1_genomic.fna.gz
cat assembly_summary.txt | \
awk '{FS="\t"} !/^#/ $12 ~ "Complete Genome" {print $20}' | \
sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/[0-9]*/[0-9]*/[0-9]*/)(GCF_.+)|\1\2/\2_genomic.fna.gz|' | \
xargs -n 1 -P 12 wget --directory-prefix ncbi -q
rm assembly_summary.txt

Note that we download an additionial file in line 4. It contains the acceptor reference of one of our datasets and is wrongly labeled as not complete, hence the manual download.

Preprocessing NCBI

Next, we want to

  • extract the downloaded references
  • split references containing multiple sequences
  • remove plasmid references
  • rename the fasta files to <accession.version>.fasta
  • get a list of all accessions for future use
  • merge the fasta files (while keeping the originals)
  • gzip all fasta files

First, we need to extract the references.

### Extract NCBI database
gzip -d /ncbi/*_genomic.fna.gz

The pipeline requires the fasta files to only contain a single reference. This can be done, for example, with the python script split_fasta.py in the data/scripts/src directory of your DaisySuite installation.

### Split fasta files
for f in ncbi/*.fna;
do
    src/split_fasta.py $f
done
#!/usr/bin/python3
import os

def split_multifasta(fname):
    count = 0
    g = None
    with open(fname, 'r') as f:
        comp = fname.split('.')
        for line in f:
            if '>' in line:
                if g is not None:
                    g.close()
                g = open('{}_split{}.{}'.format('.'.join(comp[:-1]), count, comp[-1]), 'w')
                count += 1
                g.write(line)
            else:
                g.write(line)
    os.remove(fname)

if __name__ == '__main__':
    import sys
    import os
    split_multifasta(os.path.realpath(sys.argv[1]))

The DaisyGPS part of DaisySuite currently can not detect plasmids as potential HGT donors. The plasmids need to be excluded from the database and the remaining references should be named <accession.version>.fasta

### Rename fasta files and exclude plasmids
for f in ../ncbi/*.fna;
do
    if [[ $(head -n 1 $f) != *"Plasmid"* ]] && [[ $(head -n 1 $f) != *"plasmid"* ]]
    then
        accession=$(sed 's/>\([A-Za-z0-9\._]*\) .*/\1/' <(head -n 1 $f))
        new=$(dirname $f)/$accession.fasta
               mv $f $new
    else
        rm -f $f
    fi
done

For the creation of a taxonomic file for MicrobeGPS we need a list of all references included in our database:

### Get accession list
for i in ../ncbi/*.fasta*;
do
    n=$(sed 's/\(.*\)\.fasta.*/\1/' <(basename $i))
    cat <(echo $n) >> acc.txt
done

Most mapper indexer accept only one fasta file containing all references, so we merge our sequences. We also compress the resulting fasta to save space.

### Merge fasta files
cat ncbi/*.fasta > ncbi/reference.fasta

### Gzip fasta files
gzip ncbi/*.fasta

Taxonomy

MicrobeGPS is used in DaisyGPS and leverages taxonomic information for the profiling of the sample. Therefore, several taxonomic files are needed:

  • NCBI’s nodes.dmp
  • NCBI’s names.dmp
  • a catalog (bact.catalog) containing accession.taxid, taxid and name of each reference present in the NCBI database

The nodes.dmp, names.dmp and bact.catalog must be saved in the data/microbeGPS/data folder of your DaisySuite installation.

### Get names.dmp and nodes.dmp
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xfz taxdump.tar.gz names.dmp nodes.dmp
mv names.dmp data/microbeGPS/data/names.dmp
mv nodes.dmp data/microbeGPS/data/nodes.dmp
rm taxdump.tar.gz

To create the catalog file, you can use the createMinimalMGPSCatalog.py in the data/scripts/src directory of your DaisySuite installation.
It requires the NCBI nucl_gb.accession2taxid file, names.dmp and the list of all accessions in the NCBI database generated in the preprocessing step.

### Get catalog
wget -T 1800 -qO- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz | gzip -dc > nucl_gb.accession2taxid
data/src/createMinimalMGPSCatalog.py nucl_gb.accession2taxid data/microbeGPS/data/names.dmp data/microbeGPS/data/bact.catalog -ref acc.txt
rm nucl_gb.accession2taxid

The accession list is now no longer needed and can be deleted.

### Remove accession list
rm acc.txt

Indices

In a last step, we need to generate the yara or bwa index (one is sufficient to run DaisySuite). During development we found Yara to be more sensitive regarding the mapping, but it needs up to 1 TB of temporary disk space to build its index.

### Create bwa index
bwa index ncbi/reference.fasta.gz -p bwa/bwa_index

### Create yara index
yara_indexer ncbi/reference.fasta.gz -o yara/yara_index

After this, we can delete the merged reference file.

### Remove merged fasta file
rm ../ncbi/reference.fasta.gz