TY - JOUR
T1 - Does choice of mutation tool matter?
AU - Gopinath, Rahul
AU - Ahmed, Iftekhar
AU - Alipour, Mohammad Amin
AU - Jensen, Carlos
AU - Groce, Alex
N1 - Publisher Copyright:
© 2016, Springer Science+Business Media New York.
PY - 2017/9/1
Y1 - 2017/9/1
N2 - Though mutation analysis is the primary means of evaluating the quality of test suites, it suffers from inadequate standardization. Mutation analysis tools vary based on language, when mutants are generated (phase of compilation), and target audience. Mutation tools rarely implement the complete set of operators proposed in the literature and mostly implement at least a few domain-specific mutation operators. Thus different tools may not always agree on the mutant kills of a test suite. Few criteria exist to guide a practitioner in choosing the right tool for either evaluating effectiveness of a test suite or for comparing different testing techniques. We investigate an ensemble of measures for evaluating efficacy of mutants produced by different tools. These include the traditional difficulty of detection, strength of minimal sets, and the diversity of mutants, as well as the information carried by the mutants produced. We find that mutation tools rarely agree. The disagreement between scores can be large, and the variation due to characteristics of the project—even after accounting for difference due to test suites—is a significant factor. However, the mean difference between tools is very small, indicating that no single tool consistently skews mutation scores high or low for all projects. These results suggest that experiments yielding small differences in mutation score, especially using a single tool, or a small number of projects may not be reliable. There is a clear need for greater standardization of mutation analysis. We propose one approach for such a standardization.
AB - Though mutation analysis is the primary means of evaluating the quality of test suites, it suffers from inadequate standardization. Mutation analysis tools vary based on language, when mutants are generated (phase of compilation), and target audience. Mutation tools rarely implement the complete set of operators proposed in the literature and mostly implement at least a few domain-specific mutation operators. Thus different tools may not always agree on the mutant kills of a test suite. Few criteria exist to guide a practitioner in choosing the right tool for either evaluating effectiveness of a test suite or for comparing different testing techniques. We investigate an ensemble of measures for evaluating efficacy of mutants produced by different tools. These include the traditional difficulty of detection, strength of minimal sets, and the diversity of mutants, as well as the information carried by the mutants produced. We find that mutation tools rarely agree. The disagreement between scores can be large, and the variation due to characteristics of the project—even after accounting for difference due to test suites—is a significant factor. However, the mean difference between tools is very small, indicating that no single tool consistently skews mutation scores high or low for all projects. These results suggest that experiments yielding small differences in mutation score, especially using a single tool, or a small number of projects may not be reliable. There is a clear need for greater standardization of mutation analysis. We propose one approach for such a standardization.
KW - Empirical analysis
KW - Mutation analysis
KW - Software testing
UR - http://www.scopus.com/inward/record.url?scp=84966667381&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84966667381&partnerID=8YFLogxK
U2 - 10.1007/s11219-016-9317-7
DO - 10.1007/s11219-016-9317-7
M3 - Article
AN - SCOPUS:84966667381
SN - 0963-9314
VL - 25
SP - 871
EP - 920
JO - Software Quality Journal
JF - Software Quality Journal
IS - 3
ER -