TY - JOUR
T1 - Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning
AU - Hocking, Toby Dylan
AU - Goerner-Potvin, Patricia
AU - Morin, Andreanne
AU - Shao, Xiaojian
AU - Pastinen, Tomi
AU - Bourque, Guillaume
N1 - Funding Information:
This work was supported by computing resources provided by Calcul Quebec and Compute Canada, Natural Sciences and Engineering Council of Canada RGPGR 448167-2013, and by Canadian Institutes of Health Research grants EP1-120608 and EP1-120609, awarded to GB.
Publisher Copyright:
© 2017 The Author.
PY - 2017/2/15
Y1 - 2017/2/15
N2 - Motivation: Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given dataset. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome. Results: We created 7 new histone mark datasets with 12 826 visually determined labels, and analyzed 3 existing transcription factor datasets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms.
AB - Motivation: Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given dataset. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome. Results: We created 7 new histone mark datasets with 12 826 visually determined labels, and analyzed 3 existing transcription factor datasets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms.
UR - http://www.scopus.com/inward/record.url?scp=85028304189&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85028304189&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btw672
DO - 10.1093/bioinformatics/btw672
M3 - Article
C2 - 27797775
AN - SCOPUS:85028304189
SN - 1367-4803
VL - 33
SP - 491
EP - 499
JO - Bioinformatics
JF - Bioinformatics
IS - 4
ER -