Stop using the language of formal hypothesis testing

Punchy title eh? This post was inspired by a mate of mine who I will name if they want me to, but at this point can remain anonymous. The question arose as to what constituted a Type I error and what was a Type II error when we were talking about matching of DNA profiles. There are a number of publicly available talks and books that use these terms with respect to this subject. The quantities of actual interest are the rates or probabilities of false inclusion and false exclusion. That is, what is the probability that a contributor will be falsely included as a contributor to an evidential DNA sample, or, alternatively, be falsely excluded as a contributor to an evidential DNA sample. I can only assume, because I really do not know, that the authors chose to switch from these relatively easily understood terms to the Neyman’s and Pearson’s Type I and Type II errors in an attempt to be more formal, or to show off a deeper statistical knowledge. Whatever the intent, the result is an increased level of confusion.

It all depends on the null hypothesis

You might reasonably think that Type I error and Type II is well-defined, therefore there should not be scope for confusion. However that is predicated on there being common agreement as to what is the null hypothesis and what is the alternative. In most formal hypothesis tests the null hypothesis is that the quantity of interest is equal to some hypothesised value. For example: In the two sample t-test, the null hypothesis is that there is no difference between the population means. In a chi-squared test of independence, the null hypothesis is that the joint probabilities are a product of the marginal probabilities, e.g. pij = pipj. Therefore, when it comes to DNA profiles you might, again quite reasonably, think that the null hypothesis is that a person of interest’s (POI) profile matches the crime scene sample (or is a contributor to the crime scene sample), and subsequently a Type I error is incurred when that POI is incorrectly designated as not having contributed to the sample, i.e. a false exclusion. However, the exact opposite is true. The null hypothesis in this example is, in fact, one of innocence—i.e. the null hypothesis is that the POI is not a contributor. Therefore, a Type I error occurs when we falsely included the POI as a contributor to the stain. Confused? You are not alone. And I hope you can see now why I suggest we stop using this language for this situation. It is confusing and it does not impart the necessary information without additional clarification. In contrast, everyone understands what we mean by false inclusion, and false exclusion.

Hang on a minute, don’t you use this language?

A number of my readers will know that I often talk and write about the interpretation of trace evidence. That is evidence types such as glass, paint, ink, electrical tape and so on. In this work, I often talk about formal hypothesis tests, usually in the context of discussing and criticising frequentist approaches to evidence interpretation, but even in the Bayesian approach there are remnants. These approaches use classical hypothesis tests such as the two-sample t-test and Hotelling’s T2 to help decide if the evidence recovered from a POI has a common source, that being the crime scene source. The null hypothesis under consideration is usually one of equality of population means—the thinking being that equal population means is equivalent to common source. This is not however a null hypothesis of innocence. If you explain the problem to another statistician, especially one who had experience of clinical trials, then they will often say “Oh you really should use an equivalence test.” This may be true but we almost never have the information to carry this out. It also misses the point—we are not actually required to make a decision in forensic science, but rather to help the court understand whether the evidence increases or decreases the odds of the hypotheses under consideration (which may be simply the POI is guilty, or not guilty, but things are rarely at that level or that simple.

And it gets worse

Many authors are interested in understanding the performance characteristics of a procedure that produces a likelihood ratio (LR). In order to do this, it is common to compute

$latex \Pr(LR > 1 | H_2~\mathrm{is true})$

and

$latex \Pr(LR < 1 | H_1~\mathrm{is TRUE})$

These probabilities are termed false positive and false negative respectively. The value of 1 is chosen, because when the likelihood ratio is one, then the evidence is equally likely under each hypothesis and thus has no weight. These false positive and false negative rates are sometimes called Type I and Type II error rates, even though there is no actual hypothesis test under consideration. There is no need for the terms, and confusion is only added by using them.

Finally

I hope I have convinced you to stop using this language outside the realms of formal hypothesis testing. I might argue that even there it is not very helpful. I often find myself looking at the definitions to decide which is sensitivity and which is specificity.

Share Button

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.