******************************************************; * SEE COMMENTS BELOW about writing SAS output directly in HTML * or PDF or Microsoft Word RTF format. * * MODEL SELECTION for regressions: * * An impartial panel judges the taste of apples at different levels * of 5 covariates: 3 soil nutrients, shade, and water. * * Which of the covariates are most important? * * See `apples.sas' for the general structure of the problem. * The Model test is highly significant (P=0.0016) with * Rsquare=0.8793. However, in the Parameter Estimate Table, no * single covariate is significant. Further analysis shows that the * variables (Na,KK) and (PP,Water) are highly correlated and that * Shade is significantly correlated with all other variables. * * Can we pick a smaller subset of covariates that are as good * as all five, using a formal procedure in case we didn't know * in advance about the relationships between the variables? * * General methods for choosing subsets of the covariates are called * methods of MODEL SELECTION. Some widely used methods are: * * RSQUARE * Choose the best model with k covariates by * maximizing R^2 = SSMod/SSTot = 1 - SSE/SSTot * * This gives an ordered list of k-variable models for each k, * but doesn't give a criterion for choosing between different k. * Note that adding another variable to any model always * increases R^2. * * ADJUSTED RSQUARE (ADJRSQ) * Choose the best model for all k by maximizing * Adjusted Rsquare = 1 - (SSE/(n-p))/(SSTot/(n-1)) * * This is like RSQUARE, but includes a penalty for using more * covariates. In general, AdjRsq decreases with added covariates * unless the added covariates are actually helpful. While RSQUARE * is generally positive, ADJRSQ can be negative. * * The disadvantage of RSQUARE and ADJRSQ is that all possible subsets * have to be examined. This would be 32 models for 5 covariates or * 128 models for 7 covariates. Classical ``stepwise'' models work * sequentially on covariates and are computationally much faster if * there are a huge number of covariates, but they may not find the * optimum model: * * FORWARDS (forwards STEPWISE regression) * This procedures starts with no covariates and then adds * covariates one at a time. At each stage, the procedure chooses * the covariate not in the model with the most significant Type I * F-test P-value. The procedure stops when all covariates not * already chosen have Type I P-values greater than a parameter * SLENTRY (default 0.50). The value of SLENTRY can be customized. * * STEPWISE regression is the same as forwards (stepwise) regression, * except it has an additional step or series of steps at each stage * of dropping variable(s) that are already in the model if its * Type I P-value is greater than or equal to a parameter SLSTAY. * For example, if variables X1 X2 X3 are entered into the model * in sequence, but X1 is nearly a linear combination of X2 and X3, * then X1 may be dropped just after X3 is entered. In SAS's STEPWISE * regression, the default of both SLENTRY and SLSTAY is 0.15. * * BACKWARD (backwards stepwise regression) * This starts with all covariates in the model and then drops * covariates by, at each step, choosing the covariate in the model * with the least significant (largest) Type III P-value. The procedure * continues until all remaining covariates in the model have Type III * P-values less than SLSTAY, which has a default value of 0.10. * * For the Apple regression, a summary of the models selected by * each method is: * * RSQUARE: Shade (k=1), K P (k=2), K P Water (k=3) * ADJRSQ: K P * FORWARD: Na Shade * STEPWISE: Na Shade * BACKWARD: K P * * Thus BACKWARD does the best of the three stepwise methods in terms of * matching the RSQUARE and ADJRSQ results, but none of the 2- and * 3-variable models listed above are statistically distinguishable. * * * TRAINING SETS AND CROSS-VALIDATION: * We are usually interested in finding the best way of explaining * variation of a target variable (here Taste) in terms of covariates * in general, not necessary just for a particular dataset. A fitted * regression should explain part of the relationship between the * target variable and its covariates, but may also explain accidental * variability in that dataset that does not generalize beyond that * particular dataset (that is, ``random noise''). * * A regression that fits random variation in a particular dataset to * the exclusion of information that would generalize to other, * similar datasets is said to have OVERFIT the original data. The * negative regression coefficients of Na (Sodium) and Shade in the * 5-variable model could be considered as examples of overfitting. * * One way of judging whether a regression has overfit data is to split * the original data into a TRAINING SET (of records) and a TEST DATASET * with the remaining records. Regressions on the training set that * do poorly on the test dataset have probably overfit the training * dataset and can be ignored. CROSS-VALIDATION is a way of measuring * the accuracy of a regression without splitting the original data * into training and test datasets. One way of doing this is, for * each observation, treating that observation as a test dataset of * one observation and deriving a regression from the remaining * observations, considered as a training dataset. The error of the * regression derived from the training set when applied to the single * test observation is found. The sum of the squares of these errors, * with each observation considered as a test dataset and the remaining * observations used to derive a different regression, defines a * measure of a particular method for deriving regressions. * * In this example, with only 14 observations for 5 covariates, we do * not have enough data for a separate training and test datasets, but * we could have compared cross-validation errors for different methods * of model selection. We do not do this here, but we will consider * cross-validation in procedures in the future. * * * ODS: SAS's ODS (``Output Delivery System'') can be used to format * SAS tables and output in a very large number of different ways. * * The following `ods' command tells SAS to write SAS output as an HTML * file, in addition to the usual text (*.lst) output file. This is * the only ODS command that is needed to accomplish this: The rest of * the SAS input file is exactly the same as before. Other ODS options * can write PDF or RTF (Microsoft Word) output. * * In general, it is a good practice to close all ODS output commands at * the end of a SAS input file. There are hints that, in PC Windows SAS, * ODS commands remain open across several programs until you either * close them or exit PC Windows SAS. Here we end the program with * ods html close; * See end of program; ******************************************************; ods html file='applemodsel.htm'; * Write output as HTML file; * ods listing close; * This would stop *.lst output; * ods pdf file=`applemodsel.pdf'; * This would also write PDF; * ods rtf file=`applemodsel.rtf'; * This would write in Microsoft Word; title 'MODEL SELECTION for apple taste on 5 covariates - YOURNAME'; options ls=75 ps=60 pageno=1 nocenter; data apples; input yy Nat Kk Pp Shade Water; label yy='AppleTaste' Nat='Sodium' Kk='Potassium' Pp='Phosphorus' Shade='Shade' Water='Water'; datalines; 2876 20.0 38 2488 2.42 216 2078 11.1 13 2998 1.62 321 3052 19.8 31 3835 2.79 376 2265 13.9 19 2360 1.65 265 940 17.0 24 233 0.86 18 2815 16.9 26 3922 2.70 369 2661 11.6 16 4343 2.40 453 2181 14.3 22 3110 2.05 267 2052 10.5 13 2869 1.63 286 2064 18.2 31 2335 2.17 252 1551 8.3 8 1784 0.84 185 2338 20.4 36 2601 2.47 275 1753 8.7 18 2124 1.27 201 2110 7.5 4 4408 1.85 411 ; ******************************************************; * Do regression of taste on the 5 covariates * Note the use of the SAS abbreviation Na--Water for all variables * in the dataset in that range of columns. ******************************************************; proc reg; title2 'PROC REG of taste on 5 variables'; title3 'MODEL RSQUARE=0.8793 (P=0.0016), but NOTHING is significant'; title4 ' in the Parameter Estimate table.'; model yy = Nat--Water; run; ******************************************************; * Model selection using five different methods. ******************************************************; proc reg data=apples; title2 'MODEL SELECTION of taste on 5 variables'; model yy = Nat--Water / selection=rsquare; model yy = Nat--Water / selection=adjrsq; model yy = Nat--Water / selection=forward; model yy = Nat--Water / selection=stepwise; model yy = Nat--Water / selection=backward; run; ******************************************************; * For safety, close all open ODS output commands; ******************************************************; ods html close;