*****************************************************; * An impartial panel judges the taste of apples at different levels * of 5 covariates: 3 soil nutrients and 2 amounts of shade and water. * * Which of the covariates are most important? Are any important? * * SAS contains a powerful matrix-language package `Proc IML', * where IML stands for `Interactive Matrix Language'. * * The following shows that you can get exactly the same results using * matrix theory and formulas like betahat = (X'X)^{-1}X'Y as you * would from SAS's regression procedures like Proc GLM or Proc REG * * Specifically, we * (1) Export the 14x1 vector Y and 14x6 design matrix X to Proc IML, * (2) Compute and display beta=(X'X)^{-1}X'Y, * (3) Compute the full Parameter Estimate Table for parameters beta_i, * as well as finding the residuals and Studentized residuals, * (4) Compute the model R^2 and model ANOVA table, * (5) Find the eigenvalues of the correlation matrix of the columns of X * as a way of seeing how singular it is, * (6) Export the Parameter Estimate Table from Proc IML to SAS so that * it can be displayed in SAS * ******************************************************; title 'A REGRESSION on 5 covariates - YOURNAME'; options ls=75 ps=60 pageno=1 nocenter; data apples; input yy nat kk pp shade water; * We need an extra column of ones for the design matrix X in Proc IML; * Defining it here is the easiest way to get it; con=1; label yy='AppleTaste' nat='Sodium' kk='Potassium' pp='Phosphorus' shade='Shade' water='Water'; datalines; 2876 20.0 38 2488 2.42 216 2078 11.1 13 2998 1.62 321 3052 19.8 31 3835 2.79 376 2265 13.9 19 2360 1.65 265 940 17.0 24 233 0.86 18 2815 16.9 26 3922 2.70 369 2661 11.6 16 4343 2.40 453 2181 14.3 22 3110 2.05 267 2052 10.5 13 2869 1.63 286 2064 18.2 31 2335 2.17 252 1551 8.3 8 1784 0.84 185 2338 20.4 36 2601 2.47 275 1753 8.7 18 2124 1.27 201 2110 7.5 4 4408 1.85 411 ; ******************************************************; * Use Proc REG to do a multiple regression on the 5 covariates ******************************************************; proc reg; title2 'PROC REG for AppleTaste on 5 covariates'; title3 'The MODEL TEST is highly significant, but'; title4 'NO COVARIATES are significant in the Param.Est. table.'; model yy = nat kk pp shade water / r; run; title2 "NOW USE MATRIX THEORY (SAS's PROC IML) to do the same analysis"; title3 "(PROC IML IS SAS's MATRIX LANGUAGE)"; ******************************************************; * Start the IML environment ******************************************************; proc iml; use apples; * Make this the default IML SAS dataset; * Define the Design Matrix (X) and vector of observed values (Y); * The design matrix is 14 x 6. The observed values are 14 x 1; * Recall that `con' is con=1; * The `all' means all records (rows), while `var' lists which columns; read all var { con nat kk pp shade water } into xx; read all var { yy } into yy; * Display the design matrix and print a heading; * Note the format statement for nicer output; print "Design Matrix (X)", xx [format=8.2]; * Calculate beta = (X'X)^{-1}X'Y ; xpx=xx`*xx; * Matrix transpose is X` (backwards prime); xpxinv=inv(xpx); * Inverse of X'X ; xpy=xx`*yy; * The vector X'Y ; print "The matrix X'X:", xpx [format=11.2]; print "(X'X)^{-1} and X'Y:", xpxinv [format=9.4] " " xpy [format=9.1]; beta=xpxinv*xpy; * The estimated regression parameters; print "Regression Results using Proc IML:", "Recall BETA = (X'X)^{-1}X'Y", "The following should be IDENTICAL with the Proc REG output:"; * Also calculate Standard Error(beta), T-statistics, and Student-T * P-values for H_0:beta_i=0 ; * First, define a column vector of variable labels for labels ; * In literal definitions of vectors and matrices, SPACES mean the ; * next column while COMMAS mean the next row. Thus a ; * comma-separated list means a column vector and a ; * space-separated list is a row vector.; varnames = { 'Intercept', 'Sodium', 'Potassium', 'Phosphorus', 'Shade', 'Water' }; print "The estimated regression parameters in Proc IML are", varnames " " beta [format=14.6]; print ,,; * Add some vertical space; * Now do the parameter estimate table: * First, we need MSE for the T-statistic denominators:; * The number of rows and the number of columns:; n=nrow(xx); * Number of observations; k=ncol(xx); * Number of parameters including the intercept; * The fitted values, the residuals, SSE, and MSE ; yfit=xx*beta; * The fitted values (as a column vector); resid=yy-yfit; * The residuals (as a column vector); sse=ssq(resid); * SSE = Sum of squares of residuals; dferror=n-k; * Degrees of freedom of SSE; mse=sse/dferror; * MSE = SSE/dferror; * Now for the Parameter Estimate Table: * ; * Recall that Cov(beta)(matrix) = sigma^2 (X prime X)(inverse), * so that Var(beta_i) = sigma^2 ((X'X)^{-1})_ii * * vecdiag(matrix) gives the diagonal of `matrix' as a vector * * Most numerical functions, for example sqrt(), are applied * COMPONENT BY COMPONENT when applied to a vector or matrix * A#B says to multiply matrices COMPONENTWISE, not as matrices * The next three statements are carried out COMPONENT BY COMPONENT; stdb=sqrt(mse*vecdiag(xpxinv)); * A VECTOR of std_dev(beta_i); tt=beta/stdb; * A VECTOR of T-statistics; probt=1-probf(tt#tt,1,dferror); * A VECTOR of Student-t P-values; print "Parameter estimate table using Proc IML",, varnames beta [format=14.6] stdb [format=14.6] tt [format=14.6] probt [format=14.6]; * As long as we are at it, let's calculate the Studentized residuals: * One can show that Cov(Yhat) = sigma^2 K = sigma^2 X'(X'X)^{-1}X ; * and that Cov(res) = sigma^2 (I-K) * The Studentized residuals are res/sqrt(Cov(res)) * i(n) is the nxn unit matrix; rr=mse*(i(n)-xx*xpxinv*xx`); * Covariance matrix of residuals; resstdev=sqrt(vecdiag(rr)); * Standard deviations of residuals; studtres=resid/resstdev; * VECTOR of STUDENTIZED residuals; print "Studentized Residuals from PROC IML:", "Y, Yfit, Residuals, ResStdev, and Studentized residuals:", yy yfit [format=10.2] resid [format=10.3] resstdev [format=10.3] studtres [format=12.5]; print ,; * Add space on the page at the bottom of the last table; * Compute the model ANOVA table and calculate R^2:; * Find sigmahat=sqrt(MSE), SSMOD, and SSCTOT ; rootmse=sqrt(mse); * RootMSE = Est(sigma); ybar=sum(yy)/n; * Mean of the Y's: Sum(vec) sums a vector; ssmod=ssq(yfit-ybar); * Model sum of squares; ssctot=ssq(yy-ybar); * Corrected total sum of squares; * Find the model R^2 and the F-statistic for the model test; rsquare=ssmod/ssctot; * Model R^2; dfmod=k-1; * Model degrees of freedom; msmod=ssmod/dfmod; * Model mean sum of squares; fstat=msmod/mse; * F-statistic for Model Test; * The model F-test:; * Probf(A,n1,n2) is Prob[F(n1,n2) le A] for an F distribution ; * 1-Probf(F,d1,d2) gives an F-test P-value; * 1-Probf(A*A,1,df) gives a 2-sided Student-t P-value; pfstat = 1-probf(fstat,dfmod,dferror); * Model F-test P-value; * Show the model results: Model R^2 and Model F test; print "Model ANOVA results from Proc IML:", rsquare [format=.5] dfmod dferror " " fstat [format=.4] " " pfstat " " rootmse [format=.2]; * Replicate the ANOVA table: ; * Start with a 4x1 text column vector of row headings; rownames = { "SSMOD", "SSE", "SUM", "SSTOT" }; * Allocate room for a 4x1 vector of SSs ; sstab=j(4,1); sstab[1]=ssmod; sstab[2]=sse; sstab[3]=ssmod+sse; sstab[4]=ssctot; * Now degrees of freedom ; degfree=j(4,1); degfree[1]=dfmod; degfree[2]=dferror; degfree[3]=dfmod+dferror; degfree[4]=n-1; print "THE ANOVA TABLE:", " " rownames " " sstab [format=12.3] degfree; print ,,,,,,,; * Add some vertical space; * A conditioning matrix for the design matrix X'X: * The eigenvalues of X'X, suitably normalized, suggest how many * independent covariates that X really has * See MatrixIml.sas for computation of correlation matrix; xxc = xx - j(n)*xx/n; * Center X at column means; covf = xxc`*xxc/(n-1); * Covariate matrix of SAS variables; covar = covf[2:k, 2:k]; * Ignore row and column of intercept; gg = inv(sqrt(diag(covar))); * Diagonal matrix 1/StdDev(X_i); corr = gg*covar*gg; * Correlation matrix of SAS variables; print "THE CORRELATION MATRIX OF THE COLUMNS OF X IS", corr [format=9.4]; call eigen(egvals,egvecs,corr); * Find eigenvalues and eigenvectors; print "THE EIGENVALUES AND EIGENVECTORS OF CORR ARE", egvals [format=8.6] " " egvecs [format=9.4], , "(The columns of Egvecs are the eigenvectors of Corr)"; * You can export a matrix to a SAS dataset as well as read one in. * This copies the parameter estimates table to a SAS dataset and * quits Proc IML; create parmtab var { varnames beta stdb tt probt }; * This commend is essential; append; quit; * Quit Proc IML; * Repeat the Parameter Estimate table using Proc print; title2 'BACK TO REGULAR SAS'; title3 'Using PROC PRINT to display the PARAMETER ESTIMATE TABLE'; title4 ' IMPORTED from PROC IML'; proc print data=parmtab; run;