*************************************************************; * (1) Regression on one covariate * (2) Regression on two covariates * * How does a variable yy depend on another variable xx? * (xx is called an `independent variable' or a `covariate') * For example, weight on height ? * income on education ? * plant height on amount of fertilizer? * * How can we measure the quality of the fit to the data? * What is the best linear relationship? * How can we test model assumptions? * * With two covariates, we can ask about dependence of * weight on height and income? * income on education and parental income? * plant height on amount of fertilizer and amount of sunlight? * *************************************************************; title 'REGRESSION with one covariate - YOURNAME'; options ls=75 ps=40 pageno=1 nocenter; data onereg; input yy xx; datalines; 113 11.3 122 11.7 197 12.5 226 12.5 107 12.8 211 13.3 374 14.6 326 14.8 227 15.0 310 15.1 257 15.4 256 15.6 261 15.9 296 16.2 155 16.7 218 16.8 198 17.0 219 17.1 188 17.3 177 17.7 run; *************************************************************; * First, let's see what the data looks like in a plot: *************************************************************; proc plot; title2 'A SCATTERPLOT OF Y ON X'; plot yy*xx = '*'; run; *************************************************************; * The plot does not look exactly linear, but let's analyze it anyway. * The basic model for a simple regression of Y_i on X_i is * * Y_i = mu + beta*X_i + e_i * * where the e_i are independent N(0,sigma^2) for fixed sigma^2. * * Here we test the hypothesis H_0:beta=0, which is the hypothesis * that the Y_i depend at least somewhat on X_i in a linear way. * * We use `proc glm' to test H_0 as before. While we are at it, we * add the `predicted' and `residual' values to the dataset for * later analyses. *************************************************************; proc glm; title2 'PROC GLM: REGRESSION of yy on xx'; model yy=xx; output p=pred r=resid; run; proc plot; title2 'USE PROC PLOT TO DISPLAY CURVE AND PREDICTED VALUES'; title3 ' ON THE SAME GRAPH'; title4 '(This may not work in SAS versions earlier than 9.)'; plot yy*xx = '*' pred*xx='P' / overlay; run; *************************************************************; * As with a one-way ANOVA, three identical P-values appear in the output, * one for the `Model' (here, that mu + beta*X_i fits Y_i better than * pure noise). There are also two additional P-values, called `Type I and * Type III'. As before, the Types I and III P-values are identical with * the model P-value. Here none of the three P-values are significant. * If we had more than one covariate on the right-hand side of the equation, * then the Types I and III P-values would be different from the Model * P-value and usually different from one another. * * Note that the Model R^2 is very small (R^2=0.08). This says that the * Y_i are not `explained' very well by a linear function of X_i , * which we can also surmise from the scatterplot. * * The estimated values of mu and beta in the regression `Y=mu+beta*X' * appear in the `Parameter Estimates' part of the proc glm output. * Here the estimated line is Y = 68.46 + 10.25*X (approx). * * It is bad news for the fit of the regression if the residuals fail to * be independent of the predictor variables. Let's check this: * *************************************************************; proc plot; title2 'PROC GLM RESIDUAL PLOT'; plot resid*xx = '*'; run; *************************************************************; * This shows that the residuals tend to be high for midrange values of X * and low for outside values of X. This is another indication that the * model `Y = mu + beta*X' does not fit the data very well, and that a * quadratic function might fit better. Let's also check if the residuals * appear normal: *************************************************************; proc univariate normal plot; title2 'PROC UNIVARIATE: Checking the residuals'; title3 'for fit to a normal distribution'; var resid; run; *************************************************************; * The Shapiro-Wilks test is not rejected, which is perhaps not surprising * since the fit is so bad. This means that the residuals are large and * perhaps randomly spread out. The box plot of the residuals shows a skew * towards negative values, which suggests that the residuals are not * normal, even though they are not significantly non-normal by the * Shapiro-Wilks test. * * There is a SAS procedure similar to `proc glm', called `proc reg', * that is customized for regressions. This can do plots, the main * ANOVA statistical analysis, and some residual analyses in the * same procedure. * * The argument `lineprinter' is required in `proc reg' for text-mode plots. * Otherwise, if there are `plot' commands, `proc reg' will look for a * non-text-printer graphics device and will crash if one is not found. * * Note that the ANOVA table and parameter estimates are exactly the * same as in the `proc glm' output. * *************************************************************; proc reg lineprinter; title2 'PROC REG: REGRESSION of yy on xx'; model yy=xx; plot yy*xx='Y' predicted.*xx='P' /overlay; plot residual.*xx; run; *************************************************************; * In `proc reg', the option / r in the model statement --- as in * * model yy = xx / r; * * outputs a line for EACH OBSERVATION with a number of goodness-of-fit * statistics. Each line has the observed value, the predicted value, * the residual, the ``Studentized residual'', and ``Cook's D value''. * These allow you to check for ``outliers''. Due to the fact that * estimated coefficients are sums of squares of data, inferences can be * unduly influenced by a few ``outliers'' with unusually large values * or residuals. Outliers may be typographical errors, or else may be * special cases that do not belong in the dataset, such as a rat in * a sample of mice. * * Removing the influence of outliers is called ``data cleaning''. We * do not worry about this yet, because we are not yet happy with the * model itself. * * The Y by X and residual plots suggest that Y should be a quadratic * function of X, not a linear function. This is also the cause of * the low value of the model R^2. * * Let's try, instead, a two-covariate model with independent variables * xx and xx*xx. That is, we will consider the model * * Y_i = mu + beta_1*X_i + beta_2*X_i*X_i + e_i * * In some versions of SAS, `proc glm' and `proc reg' accept the syntax * `model yy=xx xx*xx' without error messages but either ignore the * quadratic term xx*xx or else do the regression correctly but * obtain incorrect values for the residuals. * * To be safe, we reopen the data set and add a new variable (or column) * xx2 defined by xx2=xx*xx (for xx squared) and avoid tempting SAS to * make a mistake. * * In general, the analysis of a regression on two variables considers only * possible linear functions of the two variables and does not take into * account that one variable might be a function of the other. Thus xx * and xx*xx are treated as independent variables in any case. * *************************************************************; data onereg; set onereg; xx2=xx*xx; run; proc print; title2 'Y ON X AND X^2: THE DATA AS SAS SEES IT'; var yy xx xx2; run; proc glm; title2 'PROC GLM: QUADRATIC REGRESSION: yy on xx and xx^2'; title3 "Include residuals, `Studentized residuals', and "; title4 " `Cook's Residuals' as columns to the dataset"; model yy=xx xx2; output r=resid rstudent=rstudent cookd=cookd; run; proc print; title2 'Y ON X AND X^2: AFTER THE REGRESSION'; run; proc plot; title3 'PROC GLM QUADRATIC RESIDUAL PLOT'; plot resid*xx; run; *************************************************************; * Note that the Model R^2 is much larger (R^2=0.60 instead of R^2=0.08). * The values of X and the residuals look much more independent in the * residual plot than before, with the possible exception of a mildly * high value with Resid=100, X=14.5 (approximately) and a mildly low * value with Resid=-110, X=12.8 . * * With two covariates, the `proc glm' output has one P-value for the * Model as well as separate tables of P-values for the two slopes * that are generally different in the Type I and Type III tables. * * In general, Type I P-values mean ``and then'' or ``after before'': * that is, the P-value is for the effect of that variable on the * regression AFTER the effect of all previous variables has been * allowed for, but NOT CORRECTING FOR variables that are listed AFTER * that variable in the Type I table. * * Thus, Type I P-values are for ``cumulative effects''. This is consistent * with the fact that the SS values add up to the Model SS for the fitted * values in the Model table. * * In contrast, Type III P-values mean ``last effect'': that is, the * P-value is for the effect of that variable on the regression * after the effect of ALL OTHER VARIABLES have been allowed for in * the regression. Note that Type III SS contributions DO NOT add up to * the Model SS. * * In particular, the LAST LINE in the Type I and Type III tables will * always be the same, since for the last line ``and then'' (after * all previous variables) is the same as allowing for all other * variables. * * The P-values in the PARAMETER ESTIMATE table for the two slopes are * derived by a different method, but turn out to be identical with * the Type III (``last effect'') P-values. The main difference is * that Type III P-values are based on F-tests with null distribution * F(1,df), where df is the error degrees of freedom, while the * PARAMETER ESTIMATE P-values are Student t-tests with null distribution * t(df). These are equivalent since t(df)^2 is the same as F(1,df). * Note that the squares of the T statistics in the PARAMETER ESTIMATE * table are exactly the same as the Type III F values. * * The `rstudent' (Studentized residuals) and `cookd' columns are useful for * checking for outliers. Studentized residuals are residuals divided by * an estimate of the standard deviation for those values of the * covariates, and should look like independent standard normal variates. * They do so in this case, with the possible exception of * * Obs 5 (Y=107, X=12.8, Resid=107) Value -2.845 and * Obs 7 (Y=374, X=14.6, Resid=213) Value 2.48. * * These are the two values that stand out (slightly) in the residual plot. * * The Cook's D-statistic is another check for outliers. As a rough rule, * one should not be concerned unless D>0.30, so that the two possible * outliers are OK by this standard. However, it wouldn't hurt to check * that there were no recording errors for these two values. * * Let's also check if the residuals are normal: * *************************************************************; proc univariate normal plot; title2 'PROC UNIVARIATE: Checking QUADRATIC residuals'; var resid; run; *************************************************************; * The normal plots now look much more symmetric. * * The Type I P-values in the `proc glm' and `proc reg' output are asymmetric * in the sense that (for model yy=xx xx2) the P-value for XX is `first * effect' and the P-value for XX2 is for its important AFTER XX has been * allowed for. Let's reverse the order of variables to see the `opposite' * Type I P-values: * * Note that the effect of xx AFTER xx2 is much more significant than XX * by itself in either of the previous models (with MSE from either a one- * or a two-variable model), even though xx2 is not significant as a * first effect. *************************************************************; proc glm; title2 'PROC GLM: QUADRATIC REGRESSION: yy on xx and xx^2'; title3 " xx^2 listed BEFORE xx in the model statement."; model yy=xx2 xx; output r=resid rstudent=rstudent cookd=cookd; run;