Database requirements¶
You can automatically download and create all required data by running DaisySuite_setup <dir>
. This will put the NCBI database and corresponding indices into the directory <dir>
.
Alternatively, the individual requirements are explained in this section.
DaisySuite requires following data:
- NCBI RefSeq
- bwa index of the NCBI RefSeq
- yara index of the NCBI RefSeq
- MicrobeGPS taxonomy files
We will provide example code snippets to solving each one of the requirements.
NCBI RefSeq¶
First, download all NCBI entries that are tagged as “complete genomes” into a directory of your choice, e.g.
### Download NCBI database
mkdir ncbi
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
wget --directory-prefix ncbi -q ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/367/745/GCF_000367745.1_ASM36774v1/GCF_000367745.1_ASM36774v1_genomic.fna.gz
cat assembly_summary.txt | \
awk '{FS="\t"} !/^#/ $12 ~ "Complete Genome" {print $20}' | \
sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/[0-9]*/[0-9]*/[0-9]*/)(GCF_.+)|\1\2/\2_genomic.fna.gz|' | \
xargs -n 1 -P 12 wget --directory-prefix ncbi -q
rm assembly_summary.txt
Note that we download an additionial file in line 4. It contains the acceptor reference of one of our datasets and is wrongly labeled as not complete, hence the manual download.
Preprocessing NCBI¶
Next, we want to
- extract the downloaded references
- split references containing multiple sequences
- remove plasmid references
- rename the fasta files to <accession.version>.fasta
- get a list of all accessions for future use
- merge the fasta files (while keeping the originals)
- gzip all fasta files
First, we need to extract the references.
### Extract NCBI database
gzip -d /ncbi/*_genomic.fna.gz
The pipeline requires the fasta files to only contain a single reference. This can be done, for example, with the python script split_fasta.py
in the data/scripts/src
directory of your DaisySuite installation.
### Split fasta files
for f in ncbi/*.fna;
do
src/split_fasta.py $f
done
#!/usr/bin/python3
import os
def split_multifasta(fname):
count = 0
g = None
with open(fname, 'r') as f:
comp = fname.split('.')
for line in f:
if '>' in line:
if g is not None:
g.close()
g = open('{}_split{}.{}'.format('.'.join(comp[:-1]), count, comp[-1]), 'w')
count += 1
g.write(line)
else:
g.write(line)
os.remove(fname)
if __name__ == '__main__':
import sys
import os
split_multifasta(os.path.realpath(sys.argv[1]))
The DaisyGPS part of DaisySuite currently can not detect plasmids as potential HGT donors. The plasmids need to be excluded from the database and the remaining references should be named <accession.version>.fasta
### Rename fasta files and exclude plasmids
for f in ../ncbi/*.fna;
do
if [[ $(head -n 1 $f) != *"Plasmid"* ]] && [[ $(head -n 1 $f) != *"plasmid"* ]]
then
accession=$(sed 's/>\([A-Za-z0-9\._]*\) .*/\1/' <(head -n 1 $f))
new=$(dirname $f)/$accession.fasta
mv $f $new
else
rm -f $f
fi
done
For the creation of a taxonomic file for MicrobeGPS we need a list of all references included in our database:
### Get accession list
for i in ../ncbi/*.fasta*;
do
n=$(sed 's/\(.*\)\.fasta.*/\1/' <(basename $i))
cat <(echo $n) >> acc.txt
done
Most mapper indexer accept only one fasta file containing all references, so we merge our sequences. We also compress the resulting fasta to save space.
### Merge fasta files
cat ncbi/*.fasta > ncbi/reference.fasta
### Gzip fasta files
gzip ncbi/*.fasta
Taxonomy¶
MicrobeGPS is used in DaisyGPS and leverages taxonomic information for the profiling of the sample. Therefore, several taxonomic files are needed:
- NCBI’s nodes.dmp
- NCBI’s names.dmp
- a catalog (bact.catalog) containing accession.taxid, taxid and name of each reference present in the NCBI database
The nodes.dmp
, names.dmp
and bact.catalog
must be saved in the data/microbeGPS/data
folder of your DaisySuite installation.
### Get names.dmp and nodes.dmp
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xfz taxdump.tar.gz names.dmp nodes.dmp
mv names.dmp data/microbeGPS/data/names.dmp
mv nodes.dmp data/microbeGPS/data/nodes.dmp
rm taxdump.tar.gz
To create the catalog file, you can use the createMinimalMGPSCatalog.py
in the data/scripts/src
directory of your DaisySuite installation.
It requires the NCBI nucl_gb.accession2taxid file, names.dmp and the list of all accessions in the NCBI database generated in the preprocessing step.
### Get catalog
wget -T 1800 -qO- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz | gzip -dc > nucl_gb.accession2taxid
data/src/createMinimalMGPSCatalog.py nucl_gb.accession2taxid data/microbeGPS/data/names.dmp data/microbeGPS/data/bact.catalog -ref acc.txt
rm nucl_gb.accession2taxid
The accession list is now no longer needed and can be deleted.
### Remove accession list
rm acc.txt
Indices¶
In a last step, we need to generate the yara or bwa index (one is sufficient to run DaisySuite). During development we found Yara to be more sensitive regarding the mapping, but it needs up to 1 TB of temporary disk space to build its index.
### Create bwa index
bwa index ncbi/reference.fasta.gz -p bwa/bwa_index
### Create yara index
yara_indexer ncbi/reference.fasta.gz -o yara/yara_index
After this, we can delete the merged reference file.
### Remove merged fasta file
rm ../ncbi/reference.fasta.gz