Using DaisySuite

Preparation

If you want to simulate reads and then analyze them, you only need to specify the outputdir and your database directories in the config.
If you already have (gzipped) fastq reads, these files need to be present within the `reads` folder in the outputdir.
Furthermore, the reads need to follow a specific naming scheme:
{sample}.1.fq for single end reads
{sample}.1.fq and {sample}.2.fq for paired end reads
{sample} is an arbitrary name for the read data set
The reads can be either in plain fastq format or gzipped (fq.gz).
Please note that Laser cannot be used with single end reads.

The read names (the read names within the fastq file) can not contain a slash (‘/’) symbol.
For example, simulated.1/1 will not work. (remove, e.g. by using `sed -i 's/\(@simulated\.[0-9]*\)\/[0-9]*/\1/' <fq-file>`)

Running with known acceptors and donors

If you have a read dataset and know possible acceptor and donor organisms, you can skip the DaisyGPS and directly use Daisy to detect HGT events.
To do this, you need to set `sim` and `daisygps` in the config to `false` and `daisy` to `true`.
Now create the following directory structure: .
├──  outputdir
│    ├──  reads
│        └──  {sample}.1.fq.gz                      read file 1
│        └──  {sample}.2.fq.gz                      read file 2
│    ├──  candidates
│        └──  {sample}_acceptors.fa             Sequences for all acceptor candidates
│        └──  {sample}_acceptors.txt            Accession.Version for all acceptor candidates
│        └──  {sample}_donors.fa                 Sequences for all donor candidates
│        └──  {sample}_donors.txt               Accession.Version for all donor candidates

The `reads` folder contains your read dataset (paired or single end) and is prefixed with `sample`.
`candidates/{sample}_acceptors.fa` is the concatenation of all the sequences of the acceptor candidates (a (multi-)FASTA file).
`candidates/{sample}_acceptors.txt` lists all the accesions that appear in `candidates/{sample}_acceptors.fa` (one accession per line).
Analogously, you have to provide `{sample}_donors.fa` and `{sample}_donors.txt`.

Example directory

.
├──  outputdir
│        ├──  reads
│                  └──  {sample1}.1.fq.gz
│                  └──  {sample1}.2.fq.gz
│                  └──  {sample2}.1.fq.gz
│                  └──  {sample2}.2.fq.gz
│                  └──  {sample3}.1.fq.gz

All samples inside the outputdir will be run.

Config file

All necessary parameters are set in a configfile. A template file can be generated by running DaisySuite_template .
For further information, visit the Configuration section.

Running a job

This pipeline uses Snakemake and therefore supports all parameters implemented by Snakemake.

DaisySuite --configfile config/example.yaml

See DaisySuite -h or the Command line help section for additional options.

Output

The results of DaisyGPS are in the candidates directory.
The results of Daisy are in the hgt_eval directory.

.
├──  outputdir
│        └──  log-{timestamp}.txt       logfile containing information about the run
│    ├──  reads
│        └──  {sample1}.1.fq.gz                      read file 1
│        └──  {sample1}.2.fq.gz                      read file 2
│    ├──  candidates
│        └──  {sample1}_acceptors.fa             Sequences for all acceptor candidates
│        └──  {sample1}_acceptors.txt            Accession.Version for all acceptor candidates
│        └──  {sample1}_candidates.tsv          All reported candidates including several metrics
│        └──  {sample1}_complete.tsv            All mapped references including several metrics
│        └──  {sample1}_donors.fa                 Sequences for all donor candidates
│        └──  {sample1}_donors.txt               Accession.Version for all donor candidates
│    ├──  hgt_eval
│        └──  {sample1}.fasta                        Sequences of filtered horizontally transferred regions
│        └──  {sample1}.tsv                           All hgt regions
│        └──  {sample1}.vcf                           All filtered hgt regions

Additonally, there are many intermediate directories, which can be kept with `--nt`.

Directory Description
bwa/yara [Daisy] contains index for acceptor/donor candidates
candidate_mapping [Daisy] contains mapping of candidates against ncbi
fasta_npa [Daisy] contains not properly aligned reads in fasta format
fastq_npa [Daisy] contains not properly aligned reads in fastq format
gustaf [Daisy] contains results of gustaf
hgt_candidates [Daisy] contains hgt candidates generated in the hgt evaluation step
hgt_eval_pair [Daisy] single result files, merged files are found in hgt_eval
joined [Daisy] contains joined sequences of acceptor and donor candidates
mapping [DaisyGPS] contains mapping of reads against ncbi
mgps [DaisyGPS] contains results of MicrobeGPS
npa_joined [Daisy] contains merged fasta_npa files
process [DaisyGPS] contains intermediate results of candidate detection
sort_mapping [Daisy] contains sorted alignment file from candidate_mapping
sort_npa [Daisy] contains sorted fasta file from npa_joined
stellar [Daisy] contains results of stellar

Explanation

The objective of the DaisyGPS workflow is to identify possible acceptor (organism that receives genetic information) and donor (organism that the information is transferred from) candidates given reads of a potential HGT organism. The HGT organism’s genome consists mainly of the acceptor genome. When the reads of the HGT organism are mapped against the acceptor reference, most reads should map properly. Therefore a high and continuous mapping coverage pattern of the acceptor genome can be expected.

In contrast to that, only a small part of the donor genome is present within the HGT organism’s genome, hence only a small fraction of the reads should map against the donor reference and then only within a zoned part (i.e. the part that has been transferred). This results in a discontinuous mapping coverage pattern where only a small part of the reference shows a high mapping coverage.

Given only the reads of the HGT organism, the acceptor and donor candidate identification problem is similar to what is done in metagenomic profiling. We hence use the metagenomic profiling tool MicrobeGPS to gather a profile of our given HGT organism and mapping coverage metrics. MicrobeGPS fits our requirements as it uses metrics that can be adapted to represent acceptor and donor attributes. The candidates are ranked by this score and a list with putative acceptor and donor candidates is generated. These acceptor and donor candidates can then be further analysed with tools like Daisy.

MicrobeGPS

MicrobeGPS is a metagenomic profiling tool. Given a set of references and sequencing reads from a sample it uses references to describe the potential (new) organisms in the sample. That is, it uses available references to model the composition of the organisms present in the sample instead of directly assigning reference organisms to characterize the sample. To do this, MicrobeGPS developed metrics to compare the genomes in the sample with the reference genomes. In doing so, MicrobeGPS computes a genome coverage and a metric that corresponds to whether a mapping pattern is continuous or not. Therefore, it is possible to use MicrobeGPS to identify not only acceptor but also the donor by leveraging those metrics. We generated the alignment file required for MicrobeGPS by mapping the HGT organism’s reads against the NCBI RefSeq using bwa and yara.

To to describe the similarity between an organism found in the sample and the corresponding reference genome based on read evidence, MicrobeGPS facilitates a genomic distance measurement, namely the Genome Dataset Validity (GDV, validity for short). It describes the fraction of the reference genome for which there is read evidence. A taxonomic position then can be estimated by using the GDV as a measurement for the distance between an organism found in the sample and the closest reference genome. To account for possible biases arising from the use of short NGS reads, the read mapping is modelled as a mixture distribution:

\(f(x|\overline{\alpha},\lambda) = \alpha_{1}\cdot z(x)+\alpha_{2}\cdot\ P(x|\lambda)+\alpha_{3}\cdot T_{\alpha}(x)\)

where \(z(x)\) (zero distribution) and \(P(x)\) (Poisson distribution) form a zero-inflated Poisson distribution that allows to account for zero coverage regions. The tail distribution accounts for differences between reads and reference genome. \(\alpha_{1}\), \(\alpha_{2}\) and \(\alpha_{3}\) are the mixture coefficients of the different distributions, respectively. Since \(z(x)\) represents the zero coverage regions, the validity can be easily calculated by \(1-\alpha_{1}\)

Secondly, another key point of interest is to determine how evenly the reads are distributed. In MicrobeGPS, a Kolmogorov-Smirnov (KS)-Test is used to compare the actual read distribution with an theoretical read distribution that is generated by placing the reads uniformly across the genome, i.e. after having seen \(x\%\) of the genome also \(x\%\) of the reads should have been seen. The heterogeneity (note: called homogeneity in the original work) is then the KS-Statistic:

\(D_{n}=\sup_{x}\mid F_{n}(x) - F(x)\mid\)

\(D_{n}\) is the KS statistic, \(F_{n}(x)\) is the empirical distribution function (what we expect), \(F(x)\) the given cumulative distribution function (what we have)

\(F_{n}(x) = \frac{1}{n}\sum_{i=1}^nI_{[-\infty,x]}(X_{i}) I_{[-\infty,x]}(X_{i})=\begin{cases}1 & X_{i} < x 0 0 & else\end{cases}\)

Validity and Heterogeneity describe how much of the reference is covered and how evenly the reads are distributed, respectively. Therefore, they can be used to classify acceptor and donor candidates for an HGT event.

Acceptor and donor classification

For an acceptor, the mapping coverage is relatively high and homogeneous. Hence, we expect a high validity score and a low heterogeneity score. In contrast to that, we expect a low validity and high heterogeneity for donors. These assumptions can be used to define a property score that reflects how much the attributes of a given organism matches the properties of an acceptor or donor:

\(property = validity - heterogeneity\)
with \(property \in (-1,1]\)

Therefore, the value for a completely covered acceptor with uniform read distribution would approach \(+1\). Likewise, the value for a donor that is only covered in a small region would approach \(-1\). We tested weighted scores that include additional information such as the number of unique reads or k-mer counts (taken, e.g., directly from Kraken), but they did not improve the classification of suitable candidates. For both mentioned criteria, a high count did not correlate with the candidate being an acceptor or donor (data not shown). There is, however, a high evidence by sheer read numbers for acceptors:

\(property score = \#mapped\_reads/\#total\_reads * property\)

Where \(\#mapped\_reads/\#total\_reads\) is the fraction of all mapped reads that mapped to the specific acceptor candidate.

For the donor, however, the size of the transferred region is not known in advance, hence, we do not expect a specific read number evidence and therefore omit the weighting, which yields \(property\) as metric. We select the acceptors with the highest property scores (approximating \(1\)) and the donors with the lowest property (approximating \(-1\)) as candidates.

In case acceptor and donor are very similar, the donor might not express the attributes we are looking for. In particular, the donor might have a significant read number evidence arising from acceptor reads also mapping to the donor.

As a result, the property score will likely indicate an acceptor instead of a donor (approximating \(+0\)), while still being relatively low compared to the property of other acceptors.

To account for this, we also classify candidates with a \(property score > 0\) as possible acceptor-like donors. The entries with the lowest property among them are selected as acceptor-like donors.

For more information and publications, see Citations