Assumptions Under Which The OLS Estimators are BLUE

An Ordinary Least Squares (OLS) linear regression is one of the most common and widely used techniques in empirical research. In a nutshell, an OLS linear regression relates an outcome Y to a independent variable X by minimizing the sum of squared errors. The errors are essentially the differences between the observed outcomes and the outcomes that we predicted using X.

The reason why the OLS linear regression has had such a large impact on research likely lies in its simplicity and its desirable properties. Namely, the Gauss-Markov theorem states that—under certain assumptions—the OLS estimator is BLUE: the Best Linear Unbiased Estimator, a linear unbiased estimator with the lowest sampling variance. So, the expected value of the OLS estimate you obtain in a given sample are actually equal to the estimates of the entire population (unbiased). And what is more, the OLS estimate also varies less throughout the samples than all other linear unbiased estimators (best).

So what are the assumptions under which the OLS estimator is BLUE? This is not an easy question to answer because there have been many reformulations and extensions of the Gauss-Markov theorem over the years. In addition, many assumptions have different names in different disciplines, but actually mean the same thing. In what follows, we will try to provide an exhaustive list of the assumptions, although all the exposition including the original notation will be simplified for the general reader.

Let’s start with the earliest formulation by Carl Friedrich Gauss in 1821. Gauss’ paper was written in Latin, but was later translated to French by Joseph Bertrand in 1855. Gauss considered a linear additive model of the form:

Y_i= β₀ + β₁X_1i + β₂X_2i + … + β_kX_ki + ε_i

Where Yi is an outcome for individual i, e.g., earnings, sleep, or health and X_1i, X_2i, … , X_ki are independent variable of interest, e.g., education, gender, or ethnicity. The error i can be seen as measurement error, but also any variable that we do not observe in our data and thus cannot include as X in our model e.g., motivation, ability, or personality, which are typically very hard to measure. The parameters we need to estimate are the parameters β₀, β₁, β₂, … , β_k. All -parameters apart from the intercept 0 provide the effect of the independent variables on the outcome. So Gauss showed that the OLS estimator of the unknown -parameters is BLUE under the following conditions:

Linearity in parameters. Technically, this is not an assumption Gauss made explicitly, but he did start from the above model which is indeed linear in parameters. Another formulation of this assumption that means the same thing—but that is difficult to explain without matrix algebra—is that the -parameters are a linear function of Y. What this assumption means is that the -parameters enter the model linearly, they cannot be squared, cubed, or in logarithmic form, they cannot be multiplied with each other etc. This, however, does not mean that the X-variables have to be linear. For instance, the following models are linear in the parameters, but not linear in X:

Y_i= β₀ + β₁X_i1 + β₂X²_i2 + β₃X³_i3 + ε_i

Y_i= β₀ + β₁lnX_i1 + β₂lnX_i2 + ε_i

Y_i= β₀ + β₁X_i1 + β₂X_i2 + β₃X_i1X_i2 + ε_i

This model, however, is neither linear in the parameters nor in X:

Y_i= β₀X^{^β₁}_{_i1}X^{^β₂}_{_i2}ε_i

Note, however, that we can make the above model linear in the parameters, by taking the logarithm of both sides. We then get:

lnY_i= lnβ₀ + β₁lnX_i1 + β₂lnX_i2 + ε_i

This model is still nonlinear in parameter 0, but we can then estimate different parameters α₀=lnβ₀, α₁=β₁, and α₂=β₃. We then get:

lnY_i= α₀ + α₁lnX_i1 + α₂lnX_i2 + ε_i

It should be noted, however, that the parameters that minimize the sum of squared errors of the transformed equation do not necessarily minimize the sum of squared errors of the original equation.

X-variables are fixed in repeated samples. As above, this is not an assumption Gauss made explicitly, but his formulation indirectly suggests this. This means that the X-variables are not random and observable (by extension, -parameters are not random but unobservable). We can choose their values, and we can set their values to be equal in each sample we take. For instance, if we conduct an experiment on the effect of gym training on health, we can divide people into a group that trains 2 hours (X=2), 1 hour (X=1), 30 mins (X=0.5), and no hours in the gym (X=0). We can set these four groups in each sample of people we take. On the other hand, the errors are random and unobservable.

The errors are independently and identically distributed with an expected value of zero and a constant variance. Mathematically, this is expressed as i~iid(0,2). This assumption is the most confusing one, because it is often shortened to ‘independence’ leading to the rest of the asummption erroneously being forgotten. Essentially, this assumption indicates that the error terms have an expected value of zero, and the variance does not vary in the cross section (at one point in time) and also does not vary throughout time. Namely, this assumption consists of three different assumptions (although mathematically, the i.i.d. assumption is stronger than these three assumptions separately):

1. The expected value of the error term in the population is zero. Mathematically, this can be expressed as E(ε_i) = 0. Intuitively, this means that the variables we unobserve, and are therefore in the error term, e.g., motivation or personality, may affect our outcome, but the effects of all these unobserved variables counteract each other so that the mean effect is zero.

1. Homoscedasticity. Mathematically, Var(ε_i) = 2 for all i and α² < ∞. Intuitively, this means that the variance is constant in the cross section and finite (not infinite), or said otherwise, constant at one point in time. If you take different values of X, the variance in Y does not change. If this is not the case, and the assumption is violated, we speak of heteroscedasticity. For instance, consider the effect of education (X) on earnings (Y). People with only high school are likely to vary in earnings a lot. Some may be working for a minimum wage, while others are successful entrepreneurs. On the other hand, people with a Ph.D. are not as likely to vary in earnings as most of them will be earning high wages. It is unlikely for workers with a Ph.D. to be unemployed or to earn minimum wage. Hence, the variance will be higher for low levels of education than for high levels of education, leading to a violation of the homoscedasticity assumption.

1. No autocorrelation. Also referred to as ‘no serial correlation’. Mathematically, Cov(ε_i, ε_j) = 0 for i ≠ j. Intuitively, this means that the variance is constant over time. Said otherwise, the errors are uncorrelated meaning that the past value of an outcome should not provide information about the current value of the outcome. For instance, there is likely to be autocorrelation in earnings as an individual is unlikely to change their earnings over a short interval. In this case, the assumption of no autocorrelation would be violated. An example of no autocorrelation and autocorrelation is plotted on the figures below.

No autocorrelation

Autocorrelation

Normally distributed error terms. Thus, the errors can be represented in a Bell-shaped curve. Intuitively, this means that most errors cluster toward the middle of the range of values, whereas the rest of the errors taper off symmetrically toward the extremes. So about 68.2% of the errors will be within one standard deviation around the mean and about 95.4% within two standard deviations.

An attentive reader would notice that the name of the theorem is not the Gauss theorem, but the Gauss-Markov theorem. This is because in 1900, Andrei Andreyevich Markov showed that the assumption of normality—and also the strict i.i.d. assumption—are actually not necessary. Even in the absence of normally distributed errors, the OLS estimator is BLUE, as long as the sample size is sufficiently large. This is thanks to the so-called Central Limit Theorem. What ‘sufficiently large’ means is unclear, but it is common to consider a sample size of at least 30 observations as sufficiently large. In very small samples, or to make sure that the OLS estimator is equivalent to another very common estimator called the Maximum Likelihood Estimator, the normality assumption is still necessary. Thus, Markov reformulated the assumptions under which the OLS estimator is BLUE. These assumptions are as follows:

Linearity in parameters.
X-variables are fixed in repeated samples.
The expected value of the error term in the population is zero.
Homoscedasticity.
No autocorrelation.

As above, the first two assumptions were not explicitly mentioned by Markov, but his formulation indirectly suggests this. Note that the Gauss-Markov assumptions correspond to the Gauss assumptions, with the exception of the normality assumption that has been relaxed.

In modern textbooks, however, yet another formulation of the Gauss-Markov assumptions is stated. This is because researchers have over time also relaxed the assumption that X-variables are fixed in repeated samples. Although this assumption seems reasonable for Randomized Control Trials (RCTs) in which we can set X-variables to a certain value in each sample we take, e.g., two hours of gym training, in much of the social science research, this assumption is rather unrealistic. For instance, when measuring the effect of income inequality on the Human Development Index of a country, the income inequality will vary depending on the sample of countries we take and the time period. Hence, X will be stochastic rather than fixed. Thus, various researchers have restated the assumptions of the Gauss-Markov theorem conditional on X. These assumptions are as follows:

Linearity in parameters.
Strict exogeneity. Mathematically, E(ε_i | X_1i, X_2i) = 0. This is merely an extention of the assumption that the expected value of the error term in the population is zero, namely E(ε_i) =0. It should be noted, however, that having the expected value of the error term to be zero is less strong than having the expected value of the error term conditional on X-variables to be zero. Namely, it can be shown that if E(ε_i | X_1i, X_2i) = 0, then E(ε_i) =0, but not vice versa.
Full rank. The name of this assumption comes from matrix algebra. Essentially, it assumes that there is no perfect multicollinearity. This means that the X-variables should not have a correlation of 1. Note that only perfect multicollinearity violates this assumption, the X-variables can be correlated as long as the correlation between them is not 1.
Spherical errors. Again, the name of this assumption comes from matrix algebra. Essentially, this assumption means that the errors should be homoscedastic and that there should be no autocorrelation.

Under the above assumptions, the OLS estimator is BLUE, meaning the unbiased linear estimator with the lowest variance. The requirement that the OLS estimator is unbiased cannot be relaxed, because there are biased estimators with a lower variance, e.g., ridge regression. A significant recent development came in 2022, when Bruce Hansen tried to relax the assumption of linearity, saying that the OLS estimator is actually BUE: the best unbiased estimator. However, Stephen Portnoy showed that there are actually no unbiased estimators that are not linear. So saying that the OLS estimator is BUE is actually saying that the OLS estimator is BLUE as all nonlinear estimators will be biased.

Master the Science of Causal Inference

Curious if our courses align with your academic or research goals? Explore one of our core modules for free. Just enter your email to gain access and see how our courses can elevate your understanding of causality and empirical research.