TY - GEN
T1 - Where are my intelligent assistant's mistakes? A systematic testing approach
AU - Kulesza, Todd
AU - Burnett, Margaret
AU - Stumpf, Simone
AU - Wong, Weng Keen
AU - Das, Shubhomoy
AU - Groce, Alex
AU - Shinsel, Amber
AU - Bice, Forrest
AU - McIntosh, Kevin
PY - 2011
Y1 - 2011
N2 - Intelligent assistants are handling increasingly critical tasks, but until now, end users have had no way to systematically assess where their assistants make mistakes. For some intelligent assistants, this is a serious problem: if the assistant is doing work that is important, such as assisting with qualitative research or monitoring an elderly parent's safety, the user may pay a high cost for unnoticed mistakes. This paper addresses the problem with WYSIWYT/ML (What You See Is What You Test for Machine Learning), a human/computer partnership that enables end users to systematically test intelligent assistants. Our empirical evaluation shows that WYSIWYT/ML helped end users find assistants' mistakes significantly more effectively than ad hoc testing. Not only did it allow users to assess an assistant's work on an average of 117 predictions in only 10 minutes, it also scaled to a much larger data set, assessing an assistant's work on 623 out of 1,448 predictions using only the users' original 10 minutes' testing effort.
AB - Intelligent assistants are handling increasingly critical tasks, but until now, end users have had no way to systematically assess where their assistants make mistakes. For some intelligent assistants, this is a serious problem: if the assistant is doing work that is important, such as assisting with qualitative research or monitoring an elderly parent's safety, the user may pay a high cost for unnoticed mistakes. This paper addresses the problem with WYSIWYT/ML (What You See Is What You Test for Machine Learning), a human/computer partnership that enables end users to systematically test intelligent assistants. Our empirical evaluation shows that WYSIWYT/ML helped end users find assistants' mistakes significantly more effectively than ad hoc testing. Not only did it allow users to assess an assistant's work on an average of 117 predictions in only 10 minutes, it also scaled to a much larger data set, assessing an assistant's work on 623 out of 1,448 predictions using only the users' original 10 minutes' testing effort.
KW - Intelligent assistants
KW - end-user development
KW - end-user programming
KW - end-user software engineering
KW - machine learning
KW - testing
UR - http://www.scopus.com/inward/record.url?scp=79959982085&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959982085&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-21530-8_14
DO - 10.1007/978-3-642-21530-8_14
M3 - Conference contribution
AN - SCOPUS:79959982085
SN - 9783642215292
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 171
EP - 186
BT - End-User Development - Third International Symposium, IS-EUD 2011, Proceedings
T2 - 3rd International Symposium on End-User Development, IS-EUD 2011
Y2 - 7 June 2011 through 10 June 2011
ER -