# Stop using the language of formal hypothesis testing

Punchy title eh? This post was inspired by a mate of mine who I will name if they want me to, but at this point can remain anonymous. The question arose as to what constituted a Type I error and what was a Type II error when we were talking about matching of DNA profiles. There are a number of publicly available talks and books that use these terms with respect to this subject. The quantities of actual interest are the rates or probabilities of false inclusion and false exclusion. That is, what is the probability that a contributor will be falsely included as a contributor to an evidential DNA sample, or, alternatively, be falsely excluded as a contributor to an evidential DNA sample. I can only assume, because I really do not know, that the authors chose to switch from these relatively easily understood terms to the Neyman’s and Pearson’s Type I and Type II errors in an attempt to be more formal, or to show off a deeper statistical knowledge. Whatever the intent, the result is an increased level of confusion.

### It all depends on the null hypothesis

You might reasonably think that Type I error and Type II is well-defined, therefore there should not be scope for confusion. However that is predicated on there being common agreement as to what is the null hypothesis and what is the alternative. In most formal hypothesis tests the null hypothesis is that the quantity of interest is equal to some hypothesised value. For example: In the two sample t-test, the null hypothesis is that there is no difference between the population means. In a chi-squared test of independence, the null hypothesis is that the joint probabilities are a product of the marginal probabilities, e.g. pij = pipj. Therefore, when it comes to DNA profiles you might, again quite reasonably, think that the null hypothesis is that a person of interest’s (POI) profile matches the crime scene sample (or is a contributor to the crime scene sample), and subsequently a Type I error is incurred when that POI is incorrectly designated as not having contributed to the sample, i.e. a false exclusion. However, the exact opposite is true. The null hypothesis in this example is, in fact, one of innocence—i.e. the null hypothesis is that the POI is not a contributor. Therefore, a Type I error occurs when we falsely included the POI as a contributor to the stain. Confused? You are not alone. And I hope you can see now why I suggest we stop using this language for this situation. It is confusing and it does not impart the necessary information without additional clarification. In contrast, everyone understands what we mean by false inclusion, and false exclusion.

### Hang on a minute, don’t you use this language?

A number of my readers will know that I often talk and write about the interpretation of trace evidence. That is evidence types such as glass, paint, ink, electrical tape and so on. In this work, I often talk about formal hypothesis tests, usually in the context of discussing and criticising frequentist approaches to evidence interpretation, but even in the Bayesian approach there are remnants. These approaches use classical hypothesis tests such as the two-sample t-test and Hotelling’s T2 to help decide if the evidence recovered from a POI has a common source, that being the crime scene source. The null hypothesis under consideration is usually one of equality of population means—the thinking being that equal population means is equivalent to common source. This is not however a null hypothesis of innocence. If you explain the problem to another statistician, especially one who had experience of clinical trials, then they will often say “Oh you really should use an equivalence test.” This may be true but we almost never have the information to carry this out. It also misses the point—we are not actually required to make a decision in forensic science, but rather to help the court understand whether the evidence increases or decreases the odds of the hypotheses under consideration (which may be simply the POI is guilty, or not guilty, but things are rarely at that level or that simple.

### And it gets worse

Many authors are interested in understanding the performance characteristics of a procedure that produces a likelihood ratio (LR). In order to do this, it is common to compute

$latex \Pr(LR > 1 | H_2~\mathrm{is true})$

and

$latex \Pr(LR < 1 | H_1~\mathrm{is TRUE})$

These probabilities are termed false positive and false negative respectively. The value of 1 is chosen, because when the likelihood ratio is one, then the evidence is equally likely under each hypothesis and thus has no weight. These false positive and false negative rates are sometimes called Type I and Type II error rates, even though there is no actual hypothesis test under consideration. There is no need for the terms, and confusion is only added by using them.

### Finally

I hope I have convinced you to stop using this language outside the realms of formal hypothesis testing. I might argue that even there it is not very helpful. I often find myself looking at the definitions to decide which is sensitivity and which is specificity.

# Forensic anthropologists — lend me your data

Please note: This is not a new post, but a restored post from August that I lost in a WordPress upgrade

## Friends, Romans, forensic anthropologists, lend me your data

I have been reading the Journal of Forensic Sciences (JFS) over the last couple of days to see what sort of research is being done in forensic science and to see how many studies are using statistics to make, or to reinforce their conclusions. The answer the the second question is “quite a few.” There has been quite significant adoption of multivariate analysis, most commonly PCA, in a wide variety of forensic disciplines fields and that is very pleasing to see.

## Forensic anthropology

Anthropologists, and in particular forensic anthropologists have long been heavy users of statistical methodology. Many studies use linear regression, linear discriminant analysis, principal component analysis and logistic regression. The well-known and widely used forensic anthropology computer programme FORDISC uses LDA. It is interesting to me to see the appearance of some newer/different classification techniques such as k-nearest neighbour, quadratic discriminant analysis, classification and regression trees, support vector machines, random forests, and neural networks.

Forensic anthropology features heavily in JFS, and the papers contain a large amount of statistical analysis of data. The focus of the articles is often on classification of remains in to age, gender, or racial groups, or on age estimation. The articles are generally quite interesting and well written.
Show me the data

However, there is almost never any provision of the raw data, and my experience whilst writing my data analysis book was that there was no response to my requests for data from even a single anthropologist of the dozen or so that I wrote to. Not even a polite “sorry but we are unable to release the data.” I understand that in all scientific disciplines data can be expensive, in terms of time or money, to collect, and so a researcher might justifiably want to retain a data set as long as possible to get as much research value from it as possible. However, surely there must be a point where the data could be released in the public domain? The University of Tennessee Knoxville does have a forensic anthropology databank, but, at least from the webpage, it seems that there is an emphasis on deposits rather than withdrawals.

## A challenge

I therefore issue a challenge to the forensic anthropology community – release some of your data into the wild. It will benefit your discipline as it will others, and you might find your work cited more as people give you credit for producing the data that they are applying their novel techniques to.