Introduction to Linear Regression

Often, in practice, one is called upon to solve problems involving sets of variables when it is known that there exists some inherent relationship among the variables. For example, in an industrial situation it may be known that the tar content in the outlet stream in a chemical process is related to the inlet temperature. It may be of interest to develop a method of prediction, that is, a procedure for estimating the tar content for various levels of the inlet temperature from experimental information.

Now, of course, it is highly likely that for many example runs in which the inlet temperature is the same, say , the outlet tar content will not be the same. This is much like what happens when we study several automobiles with the same engine volume. They will not all have the same gas mileage. Houses in the same part of the country that have the same square footage of living space will not all be sold for the same price.

Tar content, gas mileage, and the price of houses (in thousands of dollars) are natural dependent variables, or responses, in these three scenarios. Inlet temperature, engine volume (cubic feet), and square feet of living space are, respectively, natural independent variables, or regressors.

A reasonable form of a relationship between the response and the regressor is the linear relationship

where, of course, is the intercept and is the slope. The relationship is illustrated in the following figure:

bookhue

Figure 8.1: A linear relationship; : intercept; : slope. (Walpole et al., 2017).

If the relationship is exact, then it is a deterministic relationship between two scientific variables and there is no random or probabilistic component to it. However, in the examples listed above, as well as in countless other scientific and engineering phenomena, the relationship is not deterministic (i.e., a given does not always give the same value for ). As a result, important problems here are probabilistic in nature since the relationship above cannot be viewed as being exact.

The concept of regression analysis deals with finding the best relationship between and , quantifying the strength of that relationship, and using methods that allow for prediction of the response values given values of the regressor .

In many applications, there will be more than one regressor (i.e., more than one independent variable that helps to explain ). For example, in the case where the response is the price of a house, one would expect the age of the house to contribute to the explanation of the price, so in this case the multiple regression structure might be written

where is price, is square footage, and is age in years.

As a second illustration of multiple regression, a chemical engineer may be concerned with the amount of hydrogen lost from samples of a particular metal when the material is placed in storage. In this case, there may be two inputs, storage time in hours and storage temperature in degrees centigrade. The response would then be hydrogen loss in parts per million.

In this chapter, we deal with the topic of simple linear regression, treating only the case of a single regressor variable in which the relationship between and is linear.

Denote a random sample of size by the set . If additional samples were taken using exactly the same values of , we should expect the values to vary. Hence, the value in the ordered pair is a value of some random variable .

The Simple Linear Regression (SLR) Model

We have already confined the terminology regression analysis to situations in which relationships among variables are not deterministic (i.e., not exact). In other words, there must be a random component to the equation that relates the variables.

This random component takes into account considerations that are not being measured or, in fact, are not understood by the scientists or engineers. Indeed, in most applications of regression, the linear equation, say , is an approximation that is a simplification of something unknown and much more complicated.

For example, in our illustration involving the response tar content and inlet temperature, is likely a reasonable approximation that may be operative within a confined range on . More often than not, the models that are simplifications of more complicated and unknown structures are linear in nature (i.e., linear in the parameters and or, in the case of the model involving the price, size, and age of the house, linear in the parameters , , and ). These linear structures are simple and empirical in nature and are thus called empirical models.

Statistical Model Definition

An analysis of the relationship between and requires the statement of a statistical model. A model is often used by a statistician as a representation of an ideal that essentially defines how we perceive that the data were generated by the system in question. The model must include the set of data involving pairs of values.

One must bear in mind that the value depends on via a linear structure that also has the random component involved. The basis for the use of a statistical model relates to how the random variable moves with and the random component. The model also includes what is assumed about the statistical properties of the random component.

Definition:

Figure 8.2: The response is related to the independent variable through the equation

In the above, and are unknown intercept and slope parameters, respectively, and is a random variable that is assumed to be distributed with and . The quantity is often called the error variance or residual variance.

From the model above, several things become apparent:

  1. The quantity is a random variable since is random.
  2. The value of the regressor variable is not random and, in fact, is measured with negligible error.
  3. The quantity , often called a random error or random disturbance, has constant variance. This portion of the assumptions is often called the homogeneous variance assumption.
  4. The presence of this random error, , keeps the model from becoming simply a deterministic equation.

Now, the fact that implies that at a specific the -values are distributed around the true, or population, regression line . If the model is well chosen (i.e., there are no additional important regressors and the linear approximation is good within the ranges of the data), then positive and negative errors around the true regression are reasonable.

Important Note:

Figure 8.3: We must keep in mind that in practice and are not known and must be estimated from data. In addition, the model described above is conceptual in nature. As a result, we never observe the actual values in practice and thus we can never draw the true regression line (but we assume it is there). We can only draw an estimated line.

The following figure depicts the nature of hypothetical data scattered around a true regression line for a case in which only observations are available:

bookhue

Figure 8.4: Hypothetical data scattered around the true regression line for . (Walpole et al., 2017).

Let us emphasize that what we see in this figure is not the line that is used by the scientist or engineer. Rather, the picture merely describes what the assumptions mean! The regression that the user has at his or her disposal will now be described.

The Fitted Regression Line

An important aspect of regression analysis is, very simply, to estimate the parameters and (i.e., estimate the so-called regression coefficients). The method of estimation will be discussed in the next section. Suppose we denote the estimates for and for . Then the estimated or fitted regression line is given by

where is the predicted or fitted value. Obviously, the fitted line is an estimate of the true regression line. We expect that the fitted line should be closer to the true regression line when a large amount of data are available.

In the following example, we illustrate the fitted line for a real-life pollution study. One of the more challenging problems confronting the water pollution control field is presented by the tanning industry. Tannery wastes are chemically complex. They are characterized by high values of chemical oxygen demand, volatile solids, and other pollution measures.

Consider the experimental data in the following table, which were obtained from 33 samples of chemically treated waste in a study conducted at Virginia Tech. Readings on , the percent reduction in total solids, and , the percent reduction in chemical oxygen demand, were recorded.

Solids Reduction, (%)Oxygen Demand Reduction, (%)Solids Reduction, (%)Oxygen Demand Reduction, (%)
353634
7113736
11213838
15163937
18163936
27283945
29274039
30254141
30354240
31304244
31404337
32324444
33344546
33324646
34344749
36375051
3638

Figure 8.5: Measures of Reduction in Solids and Oxygen Demand

The data from this table are plotted in a scatter diagram in the following figure:

bookhue

Figure 8.6: Scatter diagram with regression lines. (Walpole et al., 2017).

From an inspection of this scatter diagram, it can be seen that the points closely follow a straight line, indicating that the assumption of linearity between the two variables appears to be reasonable.

The fitted regression line and a hypothetical true regression line are shown on the scatter diagram.

Another Look at the Model Assumptions

It may be instructive to revisit the simple linear regression model presented previously and discuss in a graphical sense how it relates to the so-called true regression. Let us expand on the Figure 8.1 by illustrating not merely where the fall on a graph but also what the implication is of the normality assumption on the .

Suppose we have a simple linear regression with evenly spaced values of and a single -value at each . Consider the graph in the following figure:

bookhue

Individual observations around true regression line. (Walpole et al., 2017).

This illustration should give the reader a clear representation of the model and the assumptions involved. The line in the graph is the true regression line. The points plotted are actual points which are scattered about the line. Each point is on its own normal distribution with the center of the distribution (i.e., the mean of ) falling on the line.

This is certainly expected since . As a result, the true regression line goes through the means of the response, and the actual observations are on the distribution around the means. Note also that all distributions have the same variance, which we referred to as .

Of course, the deviation between an individual and the point on the line will be its individual value. This is clear since

Thus, at a given , and the corresponding both have variance .

Note:

We have written the true regression line here as in order to reaffirm that the line goes through the mean of the random variable.

Least Squares and the Fitted Model

In this section, we discuss the method of fitting an estimated regression line to the data. This is tantamount to the determination of estimates for and for . This of course allows for the computation of predicted values from the fitted line and other types of analyses and diagnostic information that will ascertain the strength of the relationship and the adequacy of the fitted model.

Before we discuss the method of least squares estimation, it is important to introduce the concept of a residual. A residual is essentially an error in the fit of the model .

Residual: Error in Fit

Given a set of regression data and a fitted model, , the -th residual is given by

Definition:

Obviously, if a set of residuals is large, then the fit of the model is not good. Small residuals are a sign of a good fit. Another interesting relationship which is useful at times is the following:

The use of the above equation should result in clarification of the distinction between the residuals, , and the conceptual model errors, . One must bear in mind that whereas the are not observed, the not only are observed but also play an important role in the total analysis.

The following figure depicts the line fit to this set of data, namely , and the line reflecting the model . Now, of course, and are unknown parameters. The fitted line is an estimate of the line produced by the statistical model. Keep in mind that the line is not known.

bookhue

Comparing with the residual, . (Walpole et al., 2017).

The Method of Least Squares

We shall find and , the estimates of and , so that the sum of the squares of the residuals is a minimum. The residual sum of squares is often called the sum of squares of the errors about the regression line and is denoted by . This minimization procedure for estimating the parameters is called the method of least squares.

Hence, we shall find and so as to minimize

Differentiating SSE with respect to and , we have

Setting the partial derivatives equal to zero and rearranging the terms, we obtain the equations (called the normal equations):

which may be solved simultaneously to yield computing formulas for and .

Estimating the Regression Coefficients

Formula: Least Squares Estimators:

Given the sample , the least squares estimates and of the regression coefficients and are computed from the formulas:

and

The calculations of and , using the data from the pollution study, are illustrated by the following example.

Example: Estimate the regression line for the pollution data

Estimate the regression line for the pollution data from the table presented earlier.

Solution:
From the data with observations:

Therefore,

and

Thus, the estimated regression line is given by

Using the regression line from this example, we would predict a reduction in the chemical oxygen demand when the reduction in the total solids is . The reduction in the chemical oxygen demand may be interpreted as an estimate of the population mean or as an estimate of a new observation when the reduction in total solids is .

Such estimates, however, are subject to error. Even if the experiment were controlled so that the reduction in total solids was , it is unlikely that we would measure a reduction in the chemical oxygen demand exactly equal to . In fact, the original data show that measurements of and were recorded for the reduction in oxygen demand when the reduction in total solids was kept at .

Properties of the Least Squares Estimators

Model With Only Y (Before X Enters)

Before considering the regressor to explain , we start with the simplest model where varies purely due to random fluctuations. In this case, our estimate of the mean response is simply the average of the values:

This serves as our alternative to any model involving (or several ‘s in multiple regression). When we introduce a regressor, we’re essentially asking: “Can explain the variation in better than just using the overall mean?”

Distributional Properties of the Estimators

In addition to the assumptions that the error term in the model is a random variable with mean and constant variance , suppose that we make the further assumption that are independent from run to run in the experiment.

The values and , based on a given sample of observations, are only estimates of true parameters and . If the experiment is repeated over and over again, each time using the same fixed values of , the resulting estimates will most likely differ from experiment to experiment.

The distributional assumptions imply that the , , are also independently distributed, with mean and equal variances .

The least squares estimators have the following properties:
For the slope estimator :

  • Mean: (unbiased)
  • Variance: where
  • Distribution:

For the intercept estimator :

  • Mean: (unbiased)
  • Distribution:

The point is always on the fitted regression line.

The accuracy depends on:

  1. The error variance - smaller is better
  2. The spread of values - larger spread gives more accurate slope estimates
  3. Sample size - larger sample gives better estimates

To draw inferences on and , we need an estimate of . Using the notation:

Theorem:

An unbiased estimate of is:

The quantity is called the mean squared error (MSE). The divisor represents the degrees of freedom (we subtract 2 because we estimated two parameters: and ).

Inferences Concerning the Regression Coefficients

We can test hypotheses regarding and , and construct confidence intervals. Assuming normality of the errors:

follows a -distribution with degrees of freedom. The statistic can be used to construct a confidence interval for the coefficient .

Formula: Confidence Interval for

A confidence interval for is:

Hypothesis Testing on the Slope

To test the null hypothesis that against a suitable alternative, we again use the -distribution with degrees of freedom to establish a critical region and then base our decision on the value of

One important -test on the slope is the test of the hypothesis

When the null hypothesis is not rejected, the conclusion is that there is no significant linear relationship between and the independent variable . The plot of the data for the previous example would suggest that a linear relationship exists. However, in some applications in which is large and thus considerable “noise” is present the data, a plot, while useful, may not produce clear information for the researcher. Rejection of above implies that a significant linear regression exists.
The -test for the hypothesis above is of the more simple form

The failure to reject suggests that there is no linear relationship between and . The following figure is an illustration of the implication of this result. It may mean that changing has little impact on changes in , as seen in (a). However, it may also indicate that the true relationship is nonlinear, as indicated by (b).

bookhue

The hypothesis is not rejected - scenarios where no linear relationship is evident. (Walpole et al., 2017).

When is rejected, there is an implication that the linear term in residing in the model explains a significant portion of variability in . The two plots in the following figure illustrate possible scenarios. As depicted in (a) of the figure, rejection of may suggest that the relations is, indeed, linear. As indicated in (b), it may suggest that while the model does contain a linear effect, a better representation may be found by including a polynomial (perhaps quadratic) term (i.e., terms that supplement the linear term).
bookhue

The hypothesis H₀: β₁ = 0 is rejected - scenarios showing linear relationships. (Walpole et al., 2017).

A Measure of Quality of Fit: Coefficient of Determination

is the proportion of the variation in explained by the model:

where :

  • (Total Sum of Squares)
  • (Regression Sum of Squares)
  • (Error Sum of Squares)

represents the variation in the response values that ideally would be explained by the model. The value is the variation due to error, or variation unexplained. Clearly, if , all variation is explained. The quantity that represents variation explained is .

Note that if the fit is perfect, all residuals are zero, and thus . But if is only slightly smaller than , .

bookhue

Plots depicting a very good fit () and a poor fit (). (Walpole et al., 2017).

Pitfalls in the Use of

Large does NOT mean “the model is good”.

Analysts quote values of quite often, perhaps due to its simplicity. However, there are pitfalls in its interpretation. The reliability of is a function of the size of the regression data set and the type of application. Clearly, and the upper bound is achieved when the fit to the data is perfect (i.e., all of the residuals are zero). What is an acceptable value of ? This is a difficult question to answer. A chemist, charged with doing a linear calibration of a high-precision piece of equipment, certainly expects to experience a very high -value (perhaps exceeding ), while a behavioral scientist, dealing in data impacted by variability in human behavior, may feel fortunate to experience an as large as . An experienced model fitter senses when a value is large enough, given the situation confronted. Clearly, some scientific phenomena lend themselves to modeling with more precision than others.

The criterion is dangerous to use for comparing competing models for the same data set. Adding additional terms to the model (e.g., an additional regressor) decreases and thus increase (or at least does not decrease is). This implies that can be made artificially high by an unwise practice of overfitting (i.e., the inclusion of too many model terms). Thus, the inevitable increase in enjoyed by adding an additional term does not imply the additional term was needed. In face, the simple model may be superior for predicting response values.

TODO: לעבור

Prediction

There are several reasons for building a linear regression model. One of the most important is to predict response values at one or more values of the independent variable. In this section, we focus on the errors associated with prediction and the construction of appropriate intervals for predicted values.

The fitted regression equation may be used for two distinct purposes:

  1. Estimate the mean response at , where is a specific value of the regressor
  2. Predict a single future value of the variable when

We would expect the error of prediction to be higher in the case of predicting a single value than when estimating a mean. This difference will affect the width of our prediction intervals.

Confidence Interval for the Mean Response

Suppose the experimenter wishes to construct a confidence interval for . We use the point estimator to estimate .

It can be shown that the sampling distribution of is normal with:

Mean:

Variance:

This variance formula follows from the fact that .

Therefore, a confidence interval for the mean response can be constructed using the statistic:

which follows a -distribution with degrees of freedom.

Formula: Confidence Interval for Mean Response

A confidence interval for the mean response is:

where is the critical value from the -distribution with degrees of freedom.

Example: Confidence Interval for Mean Response

Using the pollution data from our previous examples, construct confidence limits for the mean response when solids reduction.

Solution:
From the regression equation, for solids reduction:

We have:

  • for degrees of freedom

Therefore, the confidence interval for is:

This simplifies to:

We are confident that the population mean reduction in chemical oxygen demand is between and when solids reduction is .

bookhue

Confidence limits for the mean value of , showing the regression line with confidence bands. (Walpole et al., 2017).

Prediction Interval for Individual Response

Another type of interval that is often confused with the confidence interval for the mean is the prediction interval for a future observed response. In many instances, the prediction interval is more relevant to the scientist or engineer than the confidence interval on the mean.

For example, in practical applications, there is often interest not only in estimating the mean response at a specific value but also in constructing an interval that reflects the error in predicting a future individual observation at that value.

To obtain a prediction interval for a single value of the variable , we need to estimate the variance of the difference . We can think of this difference as a value of the random variable .

The sampling distribution of is normal with:

Mean:

Variance:

Note the additional "" term in the variance, which accounts for the uncertainty in the individual future observation .

Therefore, a prediction interval can be constructed using the statistic:

which follows a -distribution with degrees of freedom.

Formula: Prediction Interval for Individual Response

A prediction interval for a single response is:

where is the critical value from the -distribution with degrees of freedom.

Example: Prediction Interval for Individual Response

Using the pollution data, construct a prediction interval for when .

Solution:
We have the same values as before:

  • , ,
  • ,
  • ,

Therefore, the prediction interval for is:

This simplifies to:

bookhue

Confidence and prediction intervals for the chemical oxygen demand reduction data; inside bands indicate the confidence limits for the mean responses and outside bands indicate the prediction limits for the future responses. (Walpole et al., 2017).

Key Differences Between Confidence and Prediction Intervals

Important Distinctions:

Confidence Interval for Mean Response ():

  • Estimates a population parameter (the mean response)
  • Narrower interval
  • Variance:
  • Interpretation: We are confident that the true mean response lies within this interval

Prediction Interval for Individual Response ():

  • Predicts a future individual observation (not a parameter)
  • Wider interval due to additional uncertainty
  • Variance:
  • Interpretation: There is a probability that a future observation will fall within this interval

The key difference is the additional "" in the variance for prediction intervals, which accounts for the inherent variability in individual observations around the regression line.

Both intervals become wider as moves farther from , reflecting increased uncertainty when extrapolating away from the center of the data. The plot shows that prediction intervals are consistently wider than confidence intervals, with both being narrowest at .

Correlation

Up to this point, we have assumed that the independent regressor variable is a physical or scientific variable that is measured with negligible error, but not a random variable. In fact, in this context, is often called a mathematical variable. In many applications of regression techniques, it is more realistic to assume that both and are random variables, and the measurements are observations from a population having a joint density function .

We shall consider the problem of measuring the relationship between two random variables and . For example:

  • If and represent the length and circumference of a particular kind of bone in the adult body, we might expect large values of to be associated with large values of , and vice versa.
  • If represents the age of a used automobile and represents the retail book value, we would expect large values of to correspond to small values of .

Correlation analysis attempts to measure the strength of such relationships between two variables by means of a single number called a correlation coefficient.

The Bivariate Normal Distribution

In theory, it is often assumed that the conditional distribution of , for fixed values of , is normal with mean and variance , and that is likewise normally distributed with mean and variance .

The joint density of and is then:

for and .

Let us write the random variable in the form:

where is now a random variable independent of the random error . Since the mean of the random error is zero, it follows that:

After substitution and algebraic manipulation, we obtain the bivariate normal distribution:

for and , where:

Definition:

The constant is called the population correlation coefficient and measures the linear association between two random variables and .

Properties of :

  1. Range:
  2. Zero correlation: when (no linear relationship)
  3. Perfect correlation: when (perfect linear relationship)
    • : perfect positive linear relationship
    • : perfect negative linear relationship

Values of close to imply strong linear association between and , whereas values near zero indicate little or no linear correlation.

The Sample Correlation Coefficient

To obtain a sample estimate of , recall from our earlier work that the error sum of squares is:

Dividing both sides by and replacing by , we obtain:

Since , we conclude that must be between and . Consequently, must range from to , with:

  • Negative values corresponding to lines with negative slopes
  • Positive values corresponding to lines with positive slopes
  • Values of occurring when (perfect linear relationship)

Definition: Sample Correlation Coefficient

The sample correlation coefficient (also called the Pearson product-moment correlation coefficient) is:

Coefficient of Determination

For values of between and , we must be careful in our interpretation. Values such as and indicate two positive correlations, one stronger than the other, but it is wrong to conclude that indicates a linear relationship twice as good as .

However, if we write:

then , the sample coefficient of determination, represents the proportion of the variation in explained by the regression of on .

Important Interpretation:

expresses the proportion of the total variation in the values of that can be accounted for by a linear relationship with the values of .

For example, a correlation of means that , or , of the total variation in is accounted for by the linear relationship with .

Example: Forest Products Correlation Study

In a study of anatomical characteristics of plantation-grown loblolly pine, 29 trees were randomly selected. The data shows specific gravity (g/cm³) and modulus of rupture (kPa). Compute and interpret the sample correlation coefficient.

Specific GravityModulus of RuptureSpecific GravityModulus of Rupture
0.41429,1860.58185,156
0.38329,2660.55769,571
0.39926,2150.55084,160
0.40230,1620.53173,466
0.44238,8670.55078,610
0.42237,8310.55667,657
0.46644,5760.52374,017
0.50046,0970.60287,291
0.51459,6980.56986,836
0.53067,7050.54482,540
0.56966,0880.55781,699
0.55878,4860.53082,096
0.57789,8690.54775,657
0.57277,3690.58580,490
0.54867,095

Solution:
From the data we find:

Therefore:

A correlation coefficient of indicates a very strong positive linear relationship between specific gravity and modulus of rupture. Since , approximately of the variation in modulus of rupture is accounted for by the linear relationship with specific gravity.

Hypothesis Testing for Correlation

A test of the hypothesis versus an appropriate alternative is equivalent to testing for the simple linear regression model. The -statistic can be written as:

which follows a -distribution with degrees of freedom.

Example: Testing for No Linear Association

For the forest products data, test the hypothesis that there is no linear association between the variables at .

Solution:

  1. Critical region: or (with 27 degrees of freedom)
  2. Computations:

t = \frac{0.9435\sqrt{27}}{\sqrt{1-0.9435^2}} = 14.79, \quad P < 0.0001

For testing the more general hypothesis against a suitable alternative, if and follow the bivariate normal distribution, we use the transformation:

which follows approximately the standard normal distribution.

Example: Testing Specific Correlation Value

For the forest products data, test against at .

Solution:

  1. Critical region:
  2. Computations:

z = \frac{\sqrt{26}}{2}\ln\left[\frac{(1+0.9435)(0.1)}{(1-0.9435)(1.9)}\right] = 1.51, \quad P = 0.0655

Important Caveats About Correlation

Critical Points to Remember:

  1. Correlation measures linear relationship only: A correlation coefficient close to zero indicates lack of linear relationship, not necessarily lack of association.

  2. Correlation does not imply causation: High correlation between two variables does not mean one causes the other.

  3. Model assumptions matter: Results are only as good as the assumed bivariate normal model.

  4. Preliminary plotting is essential: Always examine scatter plots before interpreting correlation coefficients.

bookhue

Scatter diagrams showing: (a) zero correlation with no association, and (b) zero correlation with strong nonlinear relationship. (Walpole et al., 2017).

A value of the sample correlation coefficient close to zero can result from:

  • Purely random scatter (Figure a): implying little or no causal relationship
  • Strong nonlinear relationship (Figure b): where a quadratic or other nonlinear relationship exists

This emphasizes that implies a lack of linearity, not necessarily a lack of association. If a strong quadratic relationship exists between and , we can still obtain zero correlation, indicating the need for nonlinear analysis methods.