TY - JOUR
T1 - Open-Source Sequence Clustering Methods Improve the State Of the Art
AU - Kopylova, Evguenia
AU - Navas-Molina, Jose A.
AU - Mercier, Céline
AU - Xu, Zhenjiang Zech
AU - Mahé, Frédéric
AU - He, Yan
AU - Zhou, Hong Wei
AU - Rognes, Torbjørn
AU - Gregory Caporaso, J.
AU - Knight, Rob
N1 - Funding Information:
This work was partially supported by the Howard Hughes Medical Institute and the Alfred P. Sloan Foundation.
Funding Information:
HHS | National Institutes of Health (NIH) provided funding to Evguenia Kopylova, José Antonio Navas-Molina, and Rob Knight under grant number 1S10OD012300. Deutsche Forschungsgemeinschaft (DFG) provided funding to Frédéric Mahé under grant number DU1319/1-1.
Funding Information:
HHS | National Institutes of Health (NIH) provided funding to Evguenia Kopylova, José Antonio Navas-Molina, and Rob Knight under grant number 1S10OD012300. Deutsche Forschungsgemeinschaft (DFG) provided funding to Frédéric Mahé under grant number DU1319/1-1. This work was partially supported by the Howard Hughes Medical Institute and the Alfred P. Sloan Foundation.
Publisher Copyright:
Copyright © 2016 Kopylova et al.
PY - 2016/1/1
Y1 - 2016/1/1
N2 - Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).
AB - Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).
KW - Amplicon sequencing
KW - Microbial community analysis
KW - Operational taxonomic units
KW - Sequence clustering
UR - http://www.scopus.com/inward/record.url?scp=85041918285&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85041918285&partnerID=8YFLogxK
U2 - 10.1128/mSystems.00003-15
DO - 10.1128/mSystems.00003-15
M3 - Article
AN - SCOPUS:85041918285
SN - 2379-5077
VL - 1
JO - mSystems
JF - mSystems
IS - 1
M1 - e00003-15
ER -