Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and decid-ing when to stop testing. Test the least-Tested code, and stop when all code is well-Tested, is a reasonable answer. Many measures of "testedness" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite qual-ity. The first measure is statement coverage, the simplest and best-known code coverage measure. The second mea-sure is mutation score, a supposedly more powerful, though expensive, measure. We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer fu-ture bug-fixes than a "poorly tested" element. If not, then it seems likely that we are not effectively measuring tested-ness. Using a large number of open source Java programs from Github and Apache, we show that both statement cov-erage and mutation score have only a weak negative corre-lation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Pro-gram elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.