A brief guide to linear regression (including MLR)
In my Multivariate Statistics course, I elected to cover the section on multiple linear regression (MLR).  The following is a packet of useful information on this technique.  As part of the class requirements and by way of illustrating this technique, I also applied MLR in a mock study on Playboy Playmates which is also on this site.



Part I: A review of Regression (aka SLR- Simple Linear Regression)
Description: A method used to generate a mathematical equation which will describe the nature of the relationship between two variables.

Differences between correlation and regression:

  • 1. Regression assumes causation; correlation does not.
  • 2. Regression generates a mathematical model; correlation does not.
  • Uses of regression models:
  • 1. description
  • A model is a more compact description of a set of data.
  • Formulation of a model allows you to assess the relative degree to which each predictor variable accounts for variation in the criterion variable.
  • 2. prediction
  • Extrapolation
  • Interpretation
  • Objectives of regression analysis:
  • to determine whether or not a relationship exists between two variables.
  • to describe the nature of the relationship (should one exist) in the form of a mathematical equation.
  • to assess the degree of accuracy of description or prediction achieved by the regression equation.
  • to assess the relative importance of the various predictor variables in their contribution to the variation in the criterion variable (specific to multiple regression).
  • Classification of Variables:
  • Criterion variable- the dependent variable which the model will attempt to predict.
  • Predictor variables- the independent variables which influence the criterion variable.

  • Part II:  Developing a regression model
    Two main issues to be resolved…
      • 1. Which variables to include (which are true "predictors" and which are incidental conditions in the system).
      • 2. The relative contributions of each of those variables.  These values will be affected by the interaction between the variables decided upon in #1.
      Regression requirements (both SLR and multiple regression):
      • 1. All variables are continuous.
      • 2. The independent variable is fixed (i.e. they are all under the control of the investigator) while the dependent variable is random.
      • 3. A linear function will be described by the data.  You may have to transform some of the data in order to accomplish this.
      • 4. At each level of the independent variable, the dependent variables are all independently and normally distributed.
      • 5. At each level of the independent variable, samples of the dependent variables are all homoscedastic.


      How a model is generated:

        A regression model draws the Line of Best Fit- the line which travels through the individual data points with the smallest sum of squared residuals.  This is accomplished by using the Method of Least Squares.


      Residuals:
      -sum of residuals = 0, therefor the average of residuals = 0

      General equation for a linear bivariate relationship:

         <dependent variable> (units) = intercept + slope * <dependent variable> (units)
        The y-intercept represents the value of the criterion variable in the absence of the predictor variable(s).


      Limitations:

      • 1. Regression tells us nothing about causation!  You have to establish causation prior to developing a regression model.
      • 2. Sample size: Most authors recommend that one should have at least 10 to 20 times as many observations (cases, respondents) as one has variables, otherwise the estimates of the regression line are probably very unstable and unlikely to replicate if one were to do the study over.

    Part III The Coefficient of Determination
    The Coefficient of Determination: r2:
      • reported as the measure of fit b/w the independent and dependent variables.
      • measure of the proportion (or percentage) of variation accounted for by the model
      • It has a domain of 0 to 1.
      • r2  = SSregression / SStotal = variation explained by the model / unexplained variation.
      • It is not directly tested for significance.


      When reporting a regression model, give:

      • The complete model with the independent and dependent variables named
      • The probability of the model
      • r2


      For example: The following highly significant (p < 0.0001, r2 = 0.91) linear model was found between hemocrit and age of 9 men:
          hemocrit (%) = 65.5 - 0.563 (age, years).

      Outliers

      • Single points which would dramatically affect the regression line
      • Ways to test for them:
        • 1.eyeball a scatterplot (proc plot)
        • 2.analysis of residuals
      • Cook's D statistical assessment is also useful for identifying these troublesome points.


      Confidence belts:
      Errors in predicting y are due to 3 sources:

      • 1. sy.x - the variation around the true regression line.
      • 2. error in estimating the overall estimation (y-intercept) of the true regression line.
      • 3. error in estimating the slope of the true regression line.
      Confidence belts take these into account.

      Hypotheses testing:
      What can you do with one regression model?

      • Test to see if it is statistically significant: (i.e. is the slope significantly different than zero?) Ho: b = 0
      • Allows us to be certain that the observed linear equation was not simply a chance departure from a horizontal line.
      Can be accomplished in two ways:
      • 1. F test = MSregr / MSresid
      • 2. t test = slope / standard error of the slope


      What can you do with two regression models?**

      • Compare the slopes of  two models (most common comparison made).*
      • Compare the elevations of  two models.*
      • Compare predicted Y values for a given X between two models.

      •  

         

        *Use approximately the same range of X when comparing two models.
        ** none of these tests can be accomplished purely via SAS.  However, SAS will provide you with intermediate values required for the calculations.

      Additional considerations regarding regression analysis:
      • The range of the independent is important: i.e. you could have a "window" in which the relationship appears linear.
      • Incorporate all the data, not just the means (or medians).  This increases the sample size and ensures that the raw data display a relationship, not just the means.
      • Make sure you know which is the independent variable and which is the dependent.  If you can't establish causation, then you will have to report correlation instead of a regression model.
      • Just because a model is statistically significant does not mean that it is the best model (i.e. it might not be a linear relationship).
      • Don't force a linear model on a data set.  The model could be significant, but the relationship not linear.  Nonsignificance does not mean that there is no relationship, just not a linear one.
      Nonlinear  regression:
        Some relationships are not truly linear.  In these cases, transforming the data may allow you to generate a linear model.  If one of these appears best, you can manipulate the equation in such a way that the model is expressed in terms of the untransformed data.

        Possible nonlinear associations include:

        • semi-logarithmic- either log X or log Y.
        • double logarithmic- both log X and log Y.
        • polynomial- X2, X3, X-2, X-2/3, etc.
        Also, there are many software packages currently available which will attempt to model your data through nonlinear methods such as curve fitting.  These approaches do not attempt to transform data and "force" it into linearity when a better model might be more appropriate.  One such package (with an informative promotional site) is Curvefit by GraphPad Software, Inc. (see http://www.curvefit.com/index.htm).

    Part IV: Multiple regression
    Often more than one independent variable contributes to the value of a dependent variable.
    Approach multiple regression with a practical mind.  The ultimate goal of multiple regression is to generate the most simple, compact model which will accurately describe and/or predict the value of the dependent variable of interest.  This means choosing the independent variables which contribute the most to the value of the dependent variable.  There are many tests which will make these evaluations for you.
    Beta coefficients
  • Tell you the relative contribution of a predictor to the model.
  • Beta coefficients will change when variables are added or removed from a model.
  • These partial contributions are estimated by converting the raw coefficients to z scores.

  •  

     

    Note: beta coefficients can inform us only of the relative importance of each of the predictor variables, not the absolute contributions, since there is still the joint contributions of two or more variables taken together that cannot be disentangled.  The relative importance of any two predictor variables is dependent upon which other predictor variables have been included in the analysis.

    Stepwise procedures:
    These procedures stop at the point when the introduction of another variable would account for only a trivial or statistically insignificant proportion of the unexplained variance.


    I The step-down (aka backward elimination) procedure:

  • start with all the predictor variables.
  • sequentially eliminate the least predictive one at a time.
  • stop when the elimination of the next variable would sacrifice a significant amount of explained variance in the criterion variable.

  • II The step-up (aka forward addition) procedure:

  • just the opposite

  •  

     

    Note: The step-up and step-down procedures will not always result in the same regression equation.  However, it is possible to result in the same R^2 but with completely different sets of variables eliminated from the analysis.


    III All-regressions:

  • individual evaluate models containing all possible combinations of the predictor values.

  • The Maximum R2 procedure:

    These attempt to find the best 1, 2, 3, … n model.  In this way you can choose the most appropriately sized model from the output of the most accurate models.


    Validation of the regression equation:

    Apply the regression equation to a fresh sample of objects to see how well it does in fact predict values on the criterion variable.  This will demonstrate whether or not the decision to include the predictor variables was based purely on chance relationships.


    Pairwise combinations:

  • The number of pairwise combinations of variables is given my the formula:
  • x (x - 1) / 2, where x is the number of variables.
  • This is useful for anticipating the size of your correlation matrix.

  • Collinearity (aka variable redundancy, aka multicollinearity)

  • This is the problem of using two or more predictor variables which are highly correlated with one another.
  • A related problem is using variables that are directly related to one another.  For example, costs, sales, and profits are directly related to one another.  To include any two of these variables in a model would include the third.
  • In such a situation, the computer cannot do the thinking for you.




  • Copyright Alexplorer.
    Back to the index