TY - JOUR
T1 - Effect of the mutation rate and background size on the quality of pathogen identification
AU - Reed, Chris
AU - Fofanov, Viacheslav
AU - Putonti, Catherine
AU - Chumakov, Sergei
AU - Slezak, Tom
AU - Fofanov, Yuriy
N1 - Funding Information:
Part of this work was funded the Department of Homeland Security Science and Technology Directorate, award NBCHC070054 (Y.F.) and the Texas Learning and Computation Center (Y.F.). A portion of this work was performed under the auspices of the US Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48 (T.S.). The authors would like to thanks Marisa W. Lam for testing early versions of software involved in this work at LLNL.
PY - 2007/10/15
Y1 - 2007/10/15
N2 - Motivation: Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes. Results: In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is > 5%.
AB - Motivation: Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes. Results: In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is > 5%.
UR - http://www.scopus.com/inward/record.url?scp=35748932362&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35748932362&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btm420
DO - 10.1093/bioinformatics/btm420
M3 - Article
C2 - 17881407
AN - SCOPUS:35748932362
SN - 1367-4803
VL - 23
SP - 2665
EP - 2671
JO - Bioinformatics
JF - Bioinformatics
IS - 20
ER -