Category Archives: Uncategorized

Estimating the mean and standard deviation from a histogram or interval data

NB: I am unsure why the LaTeX is not rendering here. The plug in says that it is not testing with WordPress 5.4, but seems to render other posts correctly.

This post is how to estimate the mean and standard deviation for a data set where we do not have the original values, but rather “binned” data, or a histogram. This is not particularly complex. In fact, we used to teach this in our first year statistics course—perhaps we still do. However, it came up yesterday in a post to Stack Overflow which someone closed because they thought the person was asking an ill-formed question. It is true that the question poster could have illustrated her question a little better, but equally the person who closed the post did not get it either.

The situation is this. Imagine you are working with data that has been collected as part of the standard sort of demographics that survey writers like to ask. In particular, we often get asked to provide an age range, rather than our exact age. I will simulate a little data that looks like this so that we can see how well the method I describe actually works.

set.seed(123)
library(dplyr)
df = data.frame(trueAge = round(rnorm(100, mean = 37, sd = 12.5)),
                stringsAsFactors = FALSE) %>% 
  mutate(ageInt = case_when(
    trueAge > 0 & trueAge <= 17 ~ "0-17",
    trueAge >= 18 & trueAge <= 24 ~ "18-24",
    trueAge >= 25 & trueAge <= 29 ~ "25-29",
    trueAge >= 30 & trueAge <= 34 ~ "30-34",
    trueAge >= 35 & trueAge <= 39 ~ "35-39",
    trueAge >= 40 & trueAge <= 44 ~ "40-44",
    trueAge >= 45 & trueAge <= 49 ~ "45-49",
    trueAge >= 50 & trueAge <= 54 ~ "50-54",
    trueAge >= 55 & trueAge <= 59 ~ "55-59",
    trueAge >= 60 & trueAge <= 64 ~ "60-64",
    trueAge >= 65 & trueAge <= 69 ~ "65-69",
    trueAge >= 70 & trueAge <= 74 ~ "70-74",
    trueAge >= 75 & trueAge <= 79 ~ "75-79",
    trueAge >= 80 & trueAge <= 99 ~ "80-99"
  ))

My data.frame, df, has 100 random observations on two variables: trueAge, and ageInt. Imagine now that we only have the second. How would I go about estimating the mean and the standard deviation from such data? In the original post the questioner was trying to use the sd function on ageInt to get the standard deviation. Clearly this was never going to work, and this is what seemed to confuse the Stack Overflow contributor who closed the question off. The “trick”, if one really wants to call it that, is simply to replace the intervals with their midpoints. For example, “0-17” gets replaced with 8.5, 18-24 gets replaced with 21 and so on. The remainder of this post is how to do that programmatically.

I will break this into two steps. Firstly I am going to extract the end points of each interval using a regular expression, and then I will convert these into numbers and average them. There are lots of ways to do this, but I will use stringr as it provides a simpler mechanism for recovering regular expression capture groups.

library(stringr)
endPoints = data.frame(str_match(df$ageInt, "^([0-9]+)[-]([0-9]+)$")[,2:3], stringsAsFactors = FALSE)

In case someone wants to tell me that I can use \\d+ instead of [0-9]+, thanks but I know this. I find the latter more readable than the former. I will now add these to my data.frame and convert them to numeric values at the same time. I will also calculate the midpoints.

df = df %>% 
  mutate(X1 = as.numeric(endPoints$X1),
         X2 = as.numeric(endPoints$X2),
         midPoint = 0.5 * (X1 + X2))

Now we are in a position to estimate the standard deviation, and we can do this by using df$midPoint as the input to the sd function:

> sd(df$midPoint)
[1] 11.84036

So how did we do? Here is the sample standard deviation of the actual age values:

> sd(df$trueAge)
[1] 11.41703

I would say we did pretty well! The title of this article says we could do this with a histogram as well. How do we do that? Let us say I did not have trueAge, but I had a histogram, i.e.:

Could I estimate the standard deviation from this? The answer is, of course, “yes.” We simply follow the same steps as before. We can see from the plot that the x-axis starts at 5 and finishes at 65. Therefore the midpoints will be a regular sequence from 7.5 to 62.5 progressing in steps of 5. We can code that in R with

mids = seq(from = 7.5, to = 62.5, by = 5)

and although it is painful, we can read off the counts in each bin fairly from the histogram:

counts = c(1, 1, 2, 10, 10, 20, 15, 15, 11,  7,  5,  3)

I could go two-ways at this point. One is to use counts with mids in conjunction with with the rep function to generate a vector of length 100 with the appropriate midpoint for each value, e.g.

ageInt = rep(mids, counts)

However, I am going to use a bit of statistics theory, namely the formulae for the mean and standard deviation of a discrete random variable. You might recall having seen this before. If X is a discrete random variable which can take any value, x from the set Ω with probability f(x) = Pr(X = x), then the expected value of X is

\(
{\mathrm E}[X] = \sum_{x \in \Omega}xf(x)
\)

and the variance

\(
{\mathrm E}[(X-\mu)^2] = \sum_{x \in \Omega}(x-\mu)^2f(x)
\)

where μ = E[X]. This works for us because our counts can be converted to frequencies, and we can use the frequencies to approximate the probabilities in the expressions above, i.e.

\(
\hat{f}(x) \approx \frac1n\sum_{i=1}^{n}I(x_i = x)
\)

Therefore in R, we type:

freqs = counts / 100
EX = sum(mids * freqs)
EX2 = sum(mids^2 * freqs)
sigma = sqrt(EX2 - EX^2)

And this estimates the standard deviation as

> sigma
[1] 11.4

This is a better estimate than our previous effort purely because we have split the 0-17 age group into more bins.

Share Button