fbpx

Demystifying Degrees of Freedom

  1. Definition of Degrees of Freedom (DF)

  2. Understanding DF

  3. Everyday Analogy

  4. Distinguishing Between Nominal and Effective Sample Size

  5. Application in Clustered Data

  6. Significance

When doing hypothesis testing, everyone has encountered tests that depend on the so-called ‘degrees of freedom’. However, explanations of what degrees of freedom actually mean are lacking from most textbooks, and if they exist, these explanations are highly formulaic and lack intuition. Indeed, the phrase ‘degrees of freedom’ was introduced by Ronald Fisher in 1922, without any explanation of how the formulae to compute them were derived.

The degrees of freedom in statistics can be defined as the number of independent observations in the sample minus the number of relevant estimated summary statistics, parameters, or restraints. For instance, let’s say that you have 3 independent observations of the number of hours people sleep, e.g., 𝑥1 = 1 hour, 𝑥2 = 2 hours, and 𝑥3 = 3 hours, and nothing else is known about the sample, there are no restrictions. In this case, the number of degrees of freedom is 3. However, if the mean (a summary statistic) is known to be 5, then the degrees of freedom are actually 2. This is because we can set  𝑥1 = 1, and 𝑥2 = 2, but if we do this, we are not free to choose 𝑥3 anymore. For the mean to be 5, x3 has to be 12. This is the only way that we will reach a mean of 5. Thus, the number of degrees of freedom is 2 instead of 3 as we could only choose the first two 𝑥-values.

Everyone has been confronted with the concept of degrees of freedom in their daily lives. For instance, consider a student who is setting a schedule of which of the three courses to study on Monday, Tuesday, and Wednesday. The first course can be scheduled freely, the second one also, but once the two courses are scheduled, the last course will be scheduled on the remaining day by default. Thus, the number of degrees of freedom in this case is two, not three. And if there is a constraint, for instance that the first course has to be scheduled on Monday, because the exam is on Tuesday, then the number of degrees of freedom will be one.

Getting back to statistics, this is why it is important to distinguish between the nominal sample size, i.e., the number of observations, and the effective sample size, i.e., the number of degrees of freedom. For instance, let’s say that you have a sample of 1,000 observations. But you also want to estimate the effect of 999 variables on the outcome. In this case, your nominal sample size is 1,000, but the effective sample size is 1 (1,000 observations minus the 999 parameters). Hence, your model is virtually meaningless as it is based on one effective observation. Another common example includes clustered data. Let’s say that you observe test scores of 50 students in two classes of 25 students with each class having a different teacher. It is likely that the test scores of students within a class are correlated as they have the same teacher, but that they are uncorrelated across classes as students in different classes have different teachers. We say that there are two clusters, and although the nominal sample size is 50, the effective sample size is 2. This is because once you have a test score of one student in class 1, having a second student in this class does not really give you any additional information as their test scores are correlated. It is only when you also observe a student from class 2 that you receive additional information about test scores.

In sum, the degrees of freedom are an important tool to distinguish between the total number of observations and the number of observations that is effectively used to calculate various useful statistics.