Reference Databases For Taxonomic Assignment In Metagenomics Of The Human
Results for Simulation Study 1
Using the same abundance setup as in , 150,000 reads are generated for each of the three complexity datasets, simLC, simMC, and simHC, with average length of 100 bp. For the simSC dataset, 100 genomes with the same abundance are randomly selected and 150,000 reads are generated. The characteristics of the datasets are listed in Table S1. For this simulation study, we compare TAMER with MEGAN. The proportions of reads correctly (TP) and incorrectly (FP) assigned at different taxonomy ranks are reported in Table 1. Here TP = number of correctly assigned reads / total number of reads×100, and FP = number of incorrectly assigned reads/ total number of reads×100. For instance, for the simLC data, 146,880 reads are assigned to the corresponding species correctly, and 30 reads are assigned incorrectly, then TP = 146,880/150,000×100 = 97.92 and FP = 30/150,000×100 = 0.02. Note that the sum of TP and FP is not 100 as some reads do not have hits in the reference database.
Figure 1. Reads assignment at rank Species and Genus for simLC dataset.
Numbers of reads assigned to rank (A) Species and (B) Genus using TAMER and MEGAN are compared with the true values (TRUTH) for the simLC dataset with 150,000 reads and average read length of 100 bp in simulation study 1.
The simLC dataset consists of 25,926 reads generated from E. coli str. K-12 substr. MG1655 and 124,074 reads generated from Methanoculleus marisnigri JR1. Totally there are about 160 million base pairs and the simulated error rate is 0.027. The estimated probability of observing a mismatched base pair is 0.025 by TAMER. Using MegaBLAST, hits are found for 97.94% of the 150,000 reads in 4,407 unique taxa. At rank Species, TAMER accurately assigns 25,221 reads to species Escherichia coli which is close to the true value of 25,926 reads, while MEGAN only assigns 5,583 reads to this taxon (Figure 1 (a)). At rank Genus, MEGAN assigns 5,974 reads to Escherichia which is only about 23% of the true value and the number of reads assigned by TAMER to that genus (Figure 1 (b)). Considering the low proportion of incorrect assignment (Table 1), TAMER accurately identifies and quantifies the different genomes at low taxonomic ranks.
Figure 2. Reads assignment at rank Species and Genus for simMC dataset.
Numbers of reads assigned to rank (A) Species and (B) Genus using TAMER and MEGAN are compared with the true values (TRUTH) for the simMC dataset with 150,000 reads and average read length of 100 bp in simulation study 1.
The simMC dataset consists of nine microbial organisms from phylum Proteobacteria with diverse relative abundance. Hits are found for 97.00% of the 150,000 reads in 9,925 unique taxa. TAMER is able to dramatically reduce the huge number of taxa, and accurately identifies the nine organisms and assigns the reads to the corresponding originated organisms. TAMER assigns 96.14% of the reads correctly at rank Species, while MEGAN only assigns 67.57% of reads (Table 1). At rank Genus, the proportion of assigned reads by MEGAN is increased to 72.95%, however it is 23% less than that by TAMER. The percentage of incorrectly assigned reads is about 0.8% for both TAMER and MEGAN at both ranks of Species and Genus. It is evident that the number of reads assigned to different taxa by TAMER is very close to the true value, while MEGAN assigns 6,643 less reads to Francisella tularensis and 39,184 less reads to Shigella dysenteriae than TAMER does (Figure 2 (a)). At rank Genus, TAMER assigns 39,191 reads to Shigella which is close to the true value and is about eight times as many as MEGAN does (Figure 2 (b)).
The simHC dataset consists of 11 microbial organisms. Using MegaBLAST, hits are found for 97.11% of 150,000 reads in 2,511 unique taxa. TAMER identifies all 11 genomes and assigns the reads accurately to the original organisms. For these 11 distantly related organisms, MEGAN also does a satisfactory work by assigning about 92% of reads at rank Species which is 5% less than TAMER does (Table 1). Population distributions of reads at rank Species (Figure S1) and Genus (Figure S2) show that the assignments of reads by both methods are similarly accurate.
Figure 3. Reads assignment at rank Genus for CARMA3 dataset.
Numbers of reads assigned to rank Genus using TAMER, MEGAN, and CARMA3 are compared with the true values (TRUTH) for the CARMA3 evaluation dataset in simulation study 2.
The simSC dataset is generated from 100 microbial organisms. About 96.90% of 150,000 reads have matches in 14,205 unique taxa. TAMER identifies 149 genomes with 103 of them having at least 5 assigned reads. Summarizing the results at different taxonomic ranks, TAMER assigns about 8% more reads than MEGAN at rank Species, and TAMER and MEGAN are comparable at higher taxonomic ranks (Table 1).
Figure 4. Heatmaps for oral samples.
Heatmaps for the abundant (A) classes and (B) genera represent the estimated proportion of reads assigned to each of the eight samples based on TAMER.
For simMC and simHC, we also perform a simulation study using 10,000 reads with average read length of 400 bp. With longer read length, the proportion of correctly assigned reads at low taxonomic ranks is improved for both methods. This further confirms the very well-known fact that longer reads are more sensitive in estimating the relative abundance of the multiple species. For simMC data, TAMER and MEGAN assign about 99.9% and 71.4% of reads correctly at rank Species, respectively, while the proportion of incorrectly assigned reads only increases about 0.1% for TAMER (Table S2). At rank Genus, TAMER assigns about 23% more reads correctly than MEGAN (99.91% for TAMER and 76.53% for MEGAN) while the false positive rate only increases about 0.08%. For simHC simulation study, the results of TAMER and MEGAN are highly comparable.
Results for Simulation Study 2
For the CARMA3 evaluation dataset, the results based on TAMER and MEGAN are listed in Table 2, where we also list the results of CARMA3 which are reported in the original paper . At rank Species, the percentage of correctly assigned reads is 99.24% for TAMER, 81.45% for MEGAN, and 4.57% for CARMA3 (Table 2). At rank Genus, the proportion of correctly assigned reads by TAMER (99.26%) is 7% and 35% more than MEGAN (91.52%) and CARMA3 (64.10%), respectively.
Consistent with the conclusions from simulation study 1, the numbers of reads assigned by TAMER are very close to the true values, the true positive rate is high, and the false positive rate is very low. TAMER gives more accurate assignments than MEGAN and CARMA3 at rank Genus (Figure 3). For example, it assigns about 14 times as many as reads to Shigella than MEGAN and CARMA3.
Results for Real Data Analysis
Identifying and quantifying bacterial species in the normal and diseased samples will help understand the development of dental caries. About 46% of the 2 million reads have hits and could be assigned to taxonomic ranks by TAMER. The number of identified species varies from about 700 to 1,400 across the eight samples. Totally 2,500 unique species are detected from this study, about 1,300 of them have at least 5 assigned reads, and about 400 species are shared by all samples.
Estimated proportions of reads for the dominant classes based on TAMER are shown in Figure 4 (a). Generally, normal sample contains more Bacilli and Gammaproteobacteria but less Bacteroidia than the diseased sample, which agrees with taxonomic assignment using MEGAN approach  (Figure S3). We also observe a large variation among the individual samples although the eight samples were selected with homogeneous clinical features. For instance, Actinobacteria is abundant in the two control samples, and depleted in the remaining samples except for one sample from within cavities where it shows high proportion. Betaproteobacteria is high in one of the two controls and one of the samples with treated cavities, but low in the remaining six samples. Examining the population distribution at the genus level (Figure 4 (b)), Streptococcus is enriched in the normal samples, Prevotella and Veillonella are associated with the disease, and Fusobacterium is not abundant in the disease samples. Our findings about these genera are also reported in a recent study  which hence further verified our results.
Using BLAST, about 97% of reads in sample 1 and 94% of reads in sample 2 have hits in the nt database and could be assigned to taxonomic ranks by TAMER. There are about 900 and 1,400 species detected in sample 1 and 2, respectively. TAMER assigns more reads than MEGAN and CARMA3 at different taxonomic ranks (Table 3). At rank Species, TAMER assigns about 50% more reads than MEGAN and about 90% more reads than CARMA3 for sample 1. Candidatus Pelagibacter ubique is dominant in both samples (Figures S4). In fact this organism is highly dominant in both salt and fresh water worldwide . At rank Genus, the differences among the number of assigned reads using different methods become smaller. However TAMER still assigns about 18% more reads than MEGAN and about 37% more reads than CARMA3 for sample 1. The two seawater samples are characterized as differing from each other based on relative frequency with sample 1 containing more Shewanella and Burkholderia than sample 2 (Figures S5), which is consistent with previous conclusions , .
Protein-level sequence classification
Kaiju translates metagenomic sequencing reads into the six possible reading frames and searches for maximum exact matches (MEMs) of amino acid sequences in a given database of annotated proteins from microbial reference genomes. If matches to one or more database sequences are found for a read, Kaiju outputs the taxonomic identifier of the corresponding taxon, or it determines the LCA in the case of equally good matches to different taxa. Kaiju’s underlying sequence comparison algorithm uses the Burrows–Wheeler transform (BWT) of the protein database, which enables exact string matching in time proportional to the length of the query, to achieve a high classification speed.
In k-mer-based methods, the size of k governs the sensitivity and precision of the search. If k is chosen too large, no identical k-mers between read and database might be found, especially for short or erroneous reads, as well as for evolutionary distant sequences. If k is chosen too small, more false positive matches will be found. Therefore, in order to not be restricted by a prespecified k-mer size, Kaiju finds MEMs between reads and database to achieve both a high sensitivity and precision. Reads are directly assigned to a species or strain, or in case of ambiguity, to higher level nodes in the taxonomic tree. For example, if a read contains an amino acid sequence that is identical in two different species of the same genus then the read will be classified to this genus. Kaiju also offers the possibility to extend matches by allowing a certain number of amino acid substitutions at the end of an exact match in a greedy heuristic approach using the BLOSUM62 substitution matrix. See the Methods section for a detailed description of Kaiju’s algorithm.
Genome exclusion benchmark
Benchmarking a classifier’s accuracy can be done by simulation studies, which, knowing the ground truth about the origin of the simulated reads, can assess the sensitivity and precision of the classification. However, the benchmark protocol needs to reflect the real obstacles in metagenomic studies, which do not only include the bias and errors of the sequencing technology, but also the microbial composition of the sample at hand. Thus, we devised a simulation benchmark, which emulates the often limited availability of reference genomes and its impact on the classification performance when faced with a novel strain or species found in the metagenomic sample. To this end, we created a reference database of 2,724 bacterial and archaeal genomes and selected the subset of genomes belonging to genera that have at least 2 and most 10 genomes in the database. For each of the 882 genomes in this subset, we simulated 4 sets of Illumina and 1 set of Roche/454 sequencing reads and created a version of the reference database excluding that genome. This stripped reference (now containing 2,723 genomes) was then used to classify the simulated reads and we measured the number of classified reads, sensitivity and precision on genus, as well as phylum-level (see Methods). The number of genomes per genus serves as an indicator for the difficulty of the classification problem. For example, it is much harder to assign a novel genome to its genus when there is only one other genome of the same genus already available in the database. On the other hand, if there are 10 genomes available in a genus, it is typically much easier to classify the reads from the excluded genome to its genus with 9 remaining genomes available.
We compared the performance of Kaiju with the two k-mer-based programs Kraken and Clark, which performed best in speed and accuracy in a recent benchmark study18. Both Kraken and Clark use a reference database comprising whole genomes and construct an index of the contained nucleotide k-mers. While Kraken uses a default length of k=31, the user can chose k in Clark during database construction and values of k=20 and k=31 are recommended for highest sensitivity and highest precision, respectively. Therefore we chose values of k=20 and k=31 in Clark to illustrate the influence of the choice of k on the classification performance. Kaiju was run in the fastest MEM mode (with minimum fragment length m=11), as well as in the heuristic Greedy mode (with minimum score s=65), allowing either only one (Greedy-1) or up to five (Greedy-5) amino acid substitutions during the search.
Genomes were binned into categories in the range 2–10 according to the total number of genomes in the genus. Sensitivity and precision were calculated as the mean across all genomes in each category for each program and the five different types of simulated reads. Figure 1 compares the genus-level sensitivity and precision and Supplementary Fig. 1 shows the mean percentage of classification attempts for each genus category.
As expected, all programs have the lowest percentage of classified reads and the lowest sensitivity in those genera with only few available genomes and highest sensitivity in genera with seven or more genomes. Second, the read length is a major determinant for sensitivity as there is a much higher chance of finding a matching sequence to the reference database with increasing read length. Especially Kaiju gains a further increase of sensitivity from longer reads, as the chance of an overlap to a protein-coding region additionally increases with read length. For example, when looking at the Illumina single-end 100 nt reads, Greedy-5 achieves the highest sensitivity of 29% of all programs in genera with only two genomes, whereas Clark-k31 has the lowest sensitivity of 16%. In contrast for Illumina paired-end 250 nt reads, Greedy-5 achieves 59% sensitivity, whereas Clark-k31 only achieves 36%. With increasing number of genomes per genus, the difference between Greedy-5 and both Clark and Kraken shrinks to a few per cent, as the chance of finding at least one k-mer per read increases with more available reference genomes. Kaiju’s MEM mode has lower sensitivity compared with Greedy modes in all cases, because it only searches for exact matches, which is especially visible in short reads.
Similarly, the precision of all programs is lowest in genera with only two genomes and increases with higher number of available genomes. However, the differences between the programs is much smaller compared with sensitivity, with Clark-k31 showing the highest precision by a small margin in most cases. When comparing Clark-k31 and Kraken, Clark has consistently a little bit lower sensitivity and a bit higher precision than Kraken. The difference between Clark-k20 and Clark-k31 nicely illustrates the trade-off between sensitivity and precision depending on the k-mer size. However, the loss in precision is generally higher than the gain in sensitivity when using k=20.
Supplementary Fig. 2 shows the phylum-level sensitivity and precision. At this level, the difference in sensitivity between Kaiju and Kraken is generally higher, because more reads are assigned to ranks higher than genus by Kaiju’s LCA algorithm, whereas Kraken’s weighted path algorithm usually assigns reads to the lowest possible level. Again, the increase in sensitivity with increasing read length is higher in Kaiju compared with Kraken and Clark. For example, in genera with only two genomes, Greedy-5 achieves between 41% (Illumina single-end 100 nt) and 84% (Illumina paired-end 250 nt), whereas Clark achieves between 17 and 44%. On phylum-level, all modes of Kaiju achieve ∼10% higher sensitivity than Clark and Kraken up to the highest category with genera containing 10 genomes.
Phylum-level precision is generally much higher (>90%) for all methods and all read types compared with genus-level, because the chance of false positive matches outside the correct phylum is lower. Again, Clark-k20 consistently yields a much lower precision compared with Clark-k31 and the other programs, however, it also gains more sensitivity on phylum-level classification compared with genus-level. This can be attributed to the removal of k-mers that are shared across genera for the genus-level classification, which, however, can be used on the phylum-level.
Figure 2 shows the mean genus-level and phylum-level sensitivity and precision across all 882 measured genomes for the five different read types. The biggest gap for sensitivity and precision between the read types occurs for all programs between both paired-end and single-end 100 nt and the single-end 250 nt Illumina reads. Highest sensitivity is achieved by Greedy-5, followed by Greedy-1, MEM, Kraken and Clark in the paired-end 250 nt reads. Precision is highest for Clark closely followed by Kraken both on genus-level and phylum-level. Especially in the 100 nt reads, Kaiju's precision is lower, but the gain in sensitivity remains higher than the loss in precision. For the 250 nt reads and longer, Kaiju’s precision is marginally lower than Kraken and Clark-k31, while sensitivity is much higher.
In this analysis, we used cutoff values of minimum required match length m=11 in Kaiju’s MEM mode and minimum required match score s=65 in the Greedy modes. These parameters govern the accuracy of the classification, similar to the choice of k in the k-mer-based classifiers. Thus, we also examined Kaiju’s accuracy using different values for m and s. Supplementary Figure 3 shows the trade-off between sensitivity and precision of the classification depending on the choice of m or s. Similar to the choice of k in Clark, the sensitivity is highest and precision is lowest for small cutoff values. Increasing the cutoffs results in lower sensitivity but higher precision. However, the increase in sensitivity between m=11 and m=12 is higher than the loss in precision in all data sets both on genus-level, as well as phylum-level. Similarly in the Greedy modes, s=65 also yields higher gain in sensitivity than loss in precision.
To assess how many reads can actually be classified in real metagenomic data sets, we arbitrarily selected 10 previously published data sets from different microbiomes that were sequenced using various different HTS instruments. The two data sets from human saliva and vagina samples were already used by Ounit et al.9. The other eight samples are from human and cat gut, a freshwater lake, the Amazon river plume and Baltic sea, xeric desert soil and from two bioreactors that were inoculated with microbes from Wadden Sea sediment and compost environments. Supplementary Table 1 lists metadata and accession numbers for the data sets. The same database comprising 2,724 genomes from our exclusion benchmark serves as a reference database. We classified the 10 data sets using Kraken (k=31) and Kaiju in MEM and Greedy-5 modes with more conservative cutoff values of m=12 and s=70, respectively, which showed on average a similar precision as Kraken across the five types of reads in our exclusion benchmark, see Supplementary Fig. 3. We also mapped the four human and cat samples to their respective host genomes using BWA19, and the percentage of mapped reads was at most 2%.
Figure 3 shows the percentage of classified reads from each data set for MEM, Greedy-5 and Kraken, as well as the overlap and combined percentage of Greedy-5 and Kraken. Generally, Kaiju’s MEM mode classifies between 13.1% (Human vagina) and 48.8% (Bioreactor sediment) more reads than Kraken, which is further increased to 17.8 and 56.6% in Kaiju’s Greedy-5 mode. The percentages of reads that are classified by Kraken, but unclassified by Greedy-5 range between 0.3% (Desert soil and Lake) and 4.4% (Human gut). Across all data sets, the number of reads that were classified by both Greedy-5 and Kraken (overlap) varies between 2.8% (Lake) and 42.4% (Human vagina). By merging the results from Greedy-5 and Kraken, between 24.7% (Desert soil) and 73.1% (Bioreactor sediment) of the total reads can be classified.
As expected, the environmental samples, especially from the extreme xeric desert, but also the aquatic microbiomes, pose the biggest challenges for taxonomic assignment. In those samples Kaiju’s protein-level comparison with substitutions allows for a more sensitive sequence comparison resulting in more classified reads. However, even in the human microbiomes Kaiju’s protein-level classification adds >20% additional classified reads to Kraken’s result.
We also run each data set through Clark (k=31) with its phylum-level database and it classified fewer reads than Kraken in all cases (data not shown). In principle, if only a small fraction of reads were classified, classification could be done using a smaller k-mer size in Kraken and Clark or smaller cutoff values in Kaiju to increase the number of classified reads. The trade-off, however, would be a decreased precision as shown in our benchmark and also discussed in the study by Ounit et al.9.
HiSeq and MiSeq mock communities
In addition to the real metagenomes, we also measured Kaiju’s and Kraken’s performance using the same metrics and reference database on the HiSeq and MiSeq mock community data sets from previous benchmarks8,9. They comprise 10k real sequencing reads from 10 bacterial strains with mean read length of 92 nt (HiSeq) and 156 nt (MiSeq). All strains belong to genera that are associated with human microbiomes or human pathogens and have typically many reference genomes available. Supplementary Table 2 shows sensitivity and precision on both genus- and phylum-level of Kaiju in Greedy-5 mode and Kraken (k=31) using the same reference database as above. In the HiSeq data set, Kaiju has 73.3% sensitivity (Kraken: 78.0%) and 94.4% precision (Kraken: 99.2%) on genus-level, and 78.1% sensitivity (Kraken: 78.8%) and 98.3% precision (Kraken: 99.7%) on phylum-level. Because the short reads can only yield short amino acid fragments that are more likely found across genera, many reads are assigned to higher ranks resulting in a lower genus-level sensitivity. In addition, the short read length results in generally lower overlap with protein-coding regions and therefore Kraken yields a higher sensitivity, because it can classify those. In the MiSeq data set, the difference between both programs on genus-level is similar, whereas Greedy-5 yields 8% higher sensitivity and 1% higher precision on phylum-level compared with Kraken.
Runtime and memory
The read data set for the runtime benchmark contained 27.24 m reads comprised of 10k randomly sampled reads from each of the 2,724 genomes in our accuracy benchmark, which served again as the reference database. For the five different types of reads, the classification speed of Clark and Kraken using k=31 and of Kaiju’s modes MEM, Greedy-1 and Greedy-5 was measured using 25 parallel threads (see Methods section for specification of the hardware).
Figure 4 shows the number of processed reads per second (r.p.s.). The classification of the short single-end 100 nt reads is the fastest in all programs (Kaiju MEM: 173k r.p.s., Kraken: 165k r.p.s., Clark: 93k r.p.s.), whereas classification of the paired-end 250 nt (MEM: 97k r.p.s., Kraken: 24k r.p.s., Clark: 19k r.p.s.) takes the longest time. In the long reads, Kaiju can benefit from search space pruning by finding long MEMs first, whereas Kraken and Clark have to analyse more k-mers compared with the shorter reads. Naturally, Kaiju’s MEM mode is much faster than the Greedy modes, which extend the search space and also need to calculate the scores for each match. Depending on the read type, Greedy-5 classifies between 36k and 76k r.p.s. Greedy-1 with only one allowed mismatch is faster than Greedy-5 and can classify between 54k and 100k r.p.s.. Interestingly, Kaiju’s Greedy mode is faster in longer and paired-end reads compared with the 100 nt reads. This is due to the pruning of the search space by discarding query sequences that cannot achieve higher scores than the best scoring match, which is usually found earlier in longer and paired-end reads (see Methods). While Kaiju MEM is the fastest program in most cases, especially for the short reads, Greedy-5 generally takes the longest time, nicely demonstrating the trade-off between speed and sensitivity.
The measured peak memory consumption during the classification is 5.6 GB for Kaiju, 72 GB for Kraken and between 65 and 78 GB for Clark depending on the read length.
The construction of Kaiju’s index from the protein sequence database takes 8 min with peak memory usage of 24 GB using 25 threads (Kraken: 1h26m/165 GB, Clark: 3h57m/152 GB). The memory requirement is largely determined by the number of parallel threads for sorting the suffix array, which can, for example, be reduced to only 6.6 GB with only five threads. Kaiju’s final index size on disk is 4.8 GB (Kraken: 73 GB, Clark: 39 GB).