************************************************; * An impartial panel judges the taste of apples at different levels * of 5 covariates: 3 soil nutrients and 2 other variables, shade * and water. Two potential problems: Two of the soil nutrients * (Na and K) have similar effects on plants, and shade and water * may be correlated among farms, so that their individual effects * may be difficult to separate. * * Which of the covariates are the most important in the regression? * Are any important? Are some covariates better than others? * In general, are more covariates always better than fewer * covariates? * * Program note: * The ``label'' statement in the data step provides SAS with more * descriptive names for the variables. Most SAS procedures use these * labels for clearer output, but a few do not. ************************************************; title 'REGRESSION for taste on 5 covariates - YOURNAME'; options ls=75 ps=60 pageno=1 nocenter; data apples; input yy nat kk pp shade water; label yy='Apple Taste' nat='Sodium' kk='Potassium' pp='Phosphorus' shade='Shade' water='Water'; datalines; 2876 20.0 38 2488 2.42 216 2078 11.1 13 2998 1.62 321 3052 19.8 31 3835 2.79 376 2265 13.9 19 2360 1.65 265 940 17.0 24 233 0.86 18 2815 16.9 26 3922 2.70 369 2661 11.6 16 4343 2.40 453 2181 14.3 22 3110 2.05 267 2052 10.5 13 2869 1.63 286 2064 18.2 31 2335 2.17 252 1551 8.3 8 1784 0.84 185 2338 20.4 36 2601 2.47 275 1753 8.7 18 2124 1.27 201 2110 7.5 4 4408 1.85 411 ; ************************************************; * Do a multiple regression of yy (taste) on the 5 covariates using * `proc reg': ************************************************; title2 'PROC REG of taste on 5 variables'; title3 'EVEN THOUGH THE MODEL HAS P=0.0016 WITH Rsquare=0.88, NONE of the'; title4 ' INDIVIDUAL PARAMETER ESTIMATES are ANYWHERE NEAR significant !'; title5 'At least, as a consolation, there are no apparent problems with the'; title6 " Studentized residuals."; proc reg; model yy = nat kk pp shade water / r; run; ************************************************; * In the `proc reg' output above, note that not only are none of the * individual covariates statistically significant in the Parameter * Estimate table, but also the coefficients of Na and Shade are * negative. * * This implies that Na and Shade make apple taste worse, even though the * next output will show that Na and Shade are positively correlated * with apple taste. * * This is due to the fact that Na and K (and similarly Shade and Water) * are highly correlated among the observations, which makes their * estimated regression coefficients unstable. * * Are the covariates correlated with the observed value (yy or taste), * and do they have any obvious patterns of correlations among * themselves? Perhaps this will give us a clue as to why the model * P-value is highly significant while, apparently, none of the * covariates themselves have any effect. * * Note the handy notation ``nat--water'' (with two hyphens) below, * which means all columns in the dataset between `nat' and `water', * inclusively. To use this notation, you have to keep track of the * order of the variables in the current SAS dataset, which is the same * as the order in which the variables are mentioned in the data step. * * This notation is in contrast with `nn1-nn5' (for example) with one * hyphen, which expands to `nn1 nn2 nn3 nn4 nn5' (in that order) * whether those variables have been defined before or not. * * In the correlation matrix below, the first row (or column) gives * correlations between yy (taste) and the covariates. This measures * the power of each covariate by itself to predict the observed value. * * If the first row and the first column are ignored, the rest of the * correlation matrix gives correlations among the covariates. Note * that they are highly correlated with one another. The reason why * no single covariate is significant in the Parameter Estimate table * is that the effect of each individual covariate is already contained * in the effects of all of the other covariates. * * In more detail, Na and K (sodium and potassium) are highly correlated * (rho=0.95312) and also P (phosphorus) and Water (rho=0.97877). * Shade is correlated with all of the other variables (rho>=0.57962). * * Because of this, it would be sensible to run a regression on just * Na (or K) and P (or Water) and ignore Shade, on the grounds that * (Na,P) or (Na,Water) by themselves contain all of the useful * information in the covariates. There appears to be no way to use * the data about (Na,K) or (P,Water) in such a way that the * individual effects can be disentangled. * * In larger and more complex data sets, relationships between the * covariates will not be as clear, and it is important to be able * to have more general methods to handle correlated covariates * in a regression. * ************************************************; proc corr; title2 'CORRELATIONS of taste WITH explanatory variables'; title3 'and BETWEEN PAIRS OF explanatory variables'; var yy nat--water; run; ************************************************; * As a contrast, let's apply `proc glm' to the same regression: * * Proc glm gives separate `Type I' and `Type III' tables for the * effects of the covariates. In general: * * CORRELATIONS of yy versus explanatory variables measure the * effect of that covariate as a FIRST or ONLY EFFECT, as mentioned * above. However, the correlations only use the data for yy and * that variable, and not the MSE for all five covariates. Thus the * nonsignificant correlations between yy and Na and K do not mean * that Na and K cannot have a significant effect after other * covariates are subtracted from the MSE. * * TYPE I TABLE P-values measure the (cumulative) effect of each * covariate AFTER THE EFFECTS of PREVIOUS COVARIATES in the Type I * table list have been subtracted off. * * Note that the Type I P-values are consistent with our earlier * intuition that (Na,K) and (P,Water) are correlated pairs, with * Shade being correlated with everything. * * TYPE III TABLE P-value measure the effect of each covariate as a * LAST EFFECT: That is, after ALL OTHER COVARIATES have been * allowed for. In particular, this means that the last lines of * Type I and III tables should always be the same. * * For mathematical reasons, the P-values in the `Parameter Estimate' * tables are the same as LAST EFFECT (Type III) P-values, even * though they are derived differently. Note that the P-values are * identical with those in the Type III table, and that the * F-statistics in the Type III table are exactly the squares of * the T-statistics in the Parameter Estimate table. * ************************************************; title2 'APPLY PROC GLM to the same data'; title3 'SOME effects are significant in the Type I table,'; title4 ' but NOTHING IS SIGNIFICANT in the Type III table.'; title5 'Predicted and residual values are saved for later analysis.'; proc glm; model yy = nat--water; output p=pred r=resid; run; ************************************************; * As long as we are at it, let's check the residuals for the * full regression. * * The best single residual check for a multiple regression is the * relationship between the residuals and the predicted values, * since the latter depend on all of the covariates. * * A more complete analysis would also look at residual plots for * each of the covariates (nat--water). Among other things, this * would allow you to check whether any covariates should be * included with quadratic or exponential terms. Since the model * R^2=0.88, this is not likely in this case. In any event, we * skip the residual plots for individual variables to save paper. * (They look about the same as the residual*predicted plot.) * In general, however, it is a good idea to look at residual * plots for each of the covariates. * * The Shapiro-Wilk test for normality and the box-plot and normal * plot are uneventful. * * Program note: * The option / vpos=30 reduces the page height for a nicer * residual x predicted values plot * Saying options ps=30 before the `proc plot' and then * options ps=60 afterwards would have a similar effect. ; ************************************************; proc plot; title2 'Full-model residuals by Predicted Values and Covariates'; plot resid*pred / vpos=30; * The following command, which is commented out, would produce ; * six plots, of the residuals with pred and then with each ; * of the 5 covariates. ; * plot resid*(pred nat--water) / vpos=30; run; proc univariate normal plot; title2 'NORMAL DIAGNOSTICS for residuals'; var resid; run; ************************************************; * We noticed from the correlation output that the covariates (Na,K) * and (P,Water) were strongly correlated pairs, and that Shade * was correlated with all of the other covariates. * * Let's try two simpler models for a comparison: ************************************************; proc glm data=apples; title2 'Regression of YY on Na P only'; title3 'Both Na and P are highly significant in both the Type I and'; title4 ' Type III tables and also in the Parameter Estimate table.'; title5 'This means that the numerical effects of Na and P levels on'; title6 ' taste can be trusted'; title7 'If two correlated variables are not individually significant,'; title8 ' then the numerical effects of either cannot be resolved.'; model yy = nat pp; * Save `apples' with 3 diagnostic variables to a new dataset `appletwo'; * Otherwise, it would be difficult to find the new variables after * the next `proc glm'; output out=appletwo r=resid rstudent=rstudent cookd=cookd; run; proc glm data=apples; title2 'Regression of YY on Na P Shade only'; title3 'Even though the Model R^2 is higher with the 3rd covariate,'; title4 ' Shade is not significant in the Type I table and destroys'; title5 ' the significance of Na and P in the Type III table.'; title6 'Thus (Na P) is enough, and (Na P Shade) is too much.'; model yy = nat pp shade; run; ************************************************; * Now display the residuals etc from the previous (Na P) model; ************************************************; proc print data=appletwo; title2 'Regression of YY on Na P only'; title3 "Residuals, Stud.residuals, and Cook's D for each observation:"; var yy resid rstudent cookd; run; ************************************************; * For safety, let's try dropping the observation with the large * Cook's D value from the original dataset. * A condition like `if ne