***********************************************; * Using SAS to work with matrices * * SAS's Proc IML (``Interactive Matrix Language'') is a general matrix * package that can be used to add and multiply matrices, find matrix * inverses, solve systems of linear equations, display matrices, * and export results to SAS for use in statistical routines. * * The following gives two sets of examples: First, * (1) we import two matrices A, B into Proc IML from SAS datasets and * find A*B, C=B*A, inv(C), and solve C X=Y, * (2) we export inv(C),X,Y back to SAS and display it, * (3) we generate a 30x5 random matrix X in SAS, * (4) we calculate the 5x5 covariance and correlation matrices of the * columns of X, * (5) we find the eigenvalues and eigenvectors of the correlation * matrix of the columns of X, * (6) using matrix operations, we compute a 5x5 matrix of T-statistics * and Pearson-correlation-P-values for the columns of X. * ************************************************; title 'MATRIX CALCULATIONS IN SAS - YOURNAME'; options ls=75 ps=60 pageno=1 nocenter; ************************************************; * Note that SAS datasets are matrices already, except that * some columns can be text: A matrix in SAS is either a SAS * dataset or else a rectangular portion of a SAS dataset. * Of course, we can't multiply or invert matrices unless * all of their columns are numbers. * First, we create a 5x3 matrix A and a 3x5 matrix B * as two SAS datasets. * * Recall that aa1-aa3 is SAS shorthand for aa1 aa2 aa3 * ************************************************; data adat; input aa1-aa3; datalines; 5 4 1 7 9 7 0 1 1 12 -2 1 7 6 5 run; data bdat; input bb1-bb5; datalines; 5 4 1 4 5 4 3 1 3 4 1 3 -2 3 11 run; ************************************************; * Now use SAS's Proc IML to work with these matrices ; ************************************************; title2 "NOW WE ARE IN PROC IML, SAS'S MATRIX ENVIRONMENT"; proc iml; * START the IML environment; use adat; * Make this the default IML SAS dataset; read all var { aa1 aa2 aa3 } into aa; * Define matrix A (5x3); use bdat; * Make this the default IML SAS dataset; read all var { bb1 bb2 bb3 bb4 bb5 } into bb; * Define B (3x5); print "The matrices A (5x3) and B (3x5) are", aa " " bb; ************************************************; * NOTES: * * (1) ALL STATEMENTS in proc iml MUST END with a semi-colon * (as in regular SAS); * * (2) `var {} into MM' says to build a matrix called MM * using the columns in in the current SAS dataset, * * `read all' says use all rows in the default IML SAS dataset, * so that the matrix has the same number of rows as the dataset * has records. * * `read point { 1 2 3 } var ...' would mean to use only those row * numbers, creating in this case a matrix with three rows; * * ALTERNATIVELY, you could define matrices as rectangular sections * of a single SAS dataset or larger matrix by saying, for example, * * read all var { aa1 aa2 aa3 xx yy zz bb1 bb2 bb3 bb4 bb5 } * into alldat; * * aa=alldat[1:5, 1:3]; * Rows 1-5 of columns 1-3; * bb=alldat[1:3, 7:11]; * bb is a 3x5 matrix from columns 7-11; * * Unfortunately, you cannot use the useful abbreviations aa1-aa3 ; * and bb1-bb5 within proc iml; * * Now compute the 3x3 matrix C=BA and the 5x5 matrix D=AB; * * Isn't this easier than doing it by hand? ; ************************************************; cc=bb*aa; dd=aa*bb; ************************************************; * (3) You can write directly to the output (*.lst) file in * proc iml using `print'. In print statements, * * COMMAS always mean a new line * SPACES mean spacing on the SAME LINE, if there is room * (otherwise Proc IML goes to the next line) ; * ; ************************************************; print "The matrices C=BA (3x3) and D=AB (5x5) are", cc " " dd; print "EXERCISE: Check these calculations yourself!"; * The TRACE of a matrix is the sum of its diagonal terms, which has ; * the property than trace(AB)=trace(BA), even if the matrices ; * AB and BA are of different sizes. ; * * In general in Proc IML * sum(vector) or sum(matrix) is the sum of the entries * vecdiag(matrix) is the diagonal of the matrix, viewed as a vector ; trc=sum(vecdiag(cc)); trd=sum(vecdiag(dd)); print "The traces Trace(C=BA) and Trace(D=AB) are " trc " " trd; * The inverse of the matrix C=BA is; cinv=inv(cc); * Multiply C(inverse)*C, which should be the identity matrix ; cii = cinv*cc; * We want to see how close cii is to the 3x3 identity matrix * We first create an identity matrix I of the same size as C=BA * It is safer to use a Proc IML command to get the size of C rather than * just saying 3, since a literal `3' could create unnecessary problems * later if you used a different C ; n = nrow(cc); * The number of rows in C, here 3; ii = i(n); * The 3x3 Identity matrix ; * abs(matrix) returns a matrix of exactly the same size as `matrix'; * with all components replaced by their absolute values * Thus sum(abs(cii - ii)) is the sum of the absolute values of * the errors in inverting the matrix C=BA; cerror = sum(abs(cii - ii)); print "The matrices CINV and CINV*C and the error in inverting C are", cinv [format=8.3] " " cii [format=8.3] " " cerror; print "Note the using of formatting for proper spacing of matrix entries."; yy = { 21.71, -11.31, 21.91 }; * A literal 3x1 column vector; xx=cinv*yy; * Solve C X = Y by setting X = C(inv) Y ; print "The matrix C and the solution X of C X = Y are", cc " " xx [format=7.2] " " yy; print "EXERCISE: Check that X is a solution yourself!"; print ,; * A little bit of vertical space; * Finally, let's copy the matrix Cinv, xx, and yy to a SAS dataset ; * in case we might need it in SAS; * Unlike columns in a matrix, columns in a SAS dataset must have names, * which are the names of the corresponding SAS variables. * This is handled by exporting sets of column vectors so that SAS can * use the column vector names as SAS variable names. ; * The first step is to convert the 3x3 matrix Cinv into 3 column vectors; * so that SAS will have proper column names ; ci1 = cinv[1:3, 1]; * Rows 1-3 of 1st column of CINV ; ci2 = cinv[1:3, 2]; * Rows 1-3 of 2nd column of CINV ; ci3 = cinv[1:3, 3]; * Rows 1-3 of 3rd column of CINV ; * CREATE a new SAS dataset with 3 rows and 5 columns ; create dat5 var { ci1 ci2 ci3 xx yy }; append; * Otherwise nothing will be written to dat5; quit; * Quit Proc IML temporarily; * Now we are back in SAS: * Let's use proc print to display the new SAS dataset dat5; title2 'WE ARE NOW BACK IN REGULAR SAS'; proc print data=dat5; title3 'PROC PRINT: The dataset that we exported to SAS is:'; run; * Next, let's create a dataset with 30 rows and 5 columns of * random data. Recall that `ranuni()' defines a uniform * variate in (0,1), where is the initial random-number seed ; data randdat; input nn; * For each row of data after datalines; do ii=1 to nn; * ii is the row number: create 5 columns ; x1=10*ranuni(37); x2=10*ranuni(37); x3=10*ranuni(37); x4=10*ranuni(37); x5=10*ranuni(37); x2=(x1+x2)/2; * Force a (Col1,Col2) correlation; output; * `output' writes a row to `randdat' ; end; * `end' means the end of the `do-loop'; datalines; 30 run; proc print data=randdat; title2 'PROC PRINT IN SAS: A DATASET WITH 5 RANDOM VECTORS'; run; * Let's re-enter `proc iml' with our new matrix. * First, set title lines for proc iml output ; title2 "WE ARE BACK IN PROC IML AGAIN"; title3 "Find the covariance and correlation matrix of a set of random vectors"; proc iml; * RE-START the IML environment; use randdat; * Read the 5 random vectors into a 30x5 matrix; read all var { x1 x2 x3 x4 x5 } into xx; print "THE DATASET XX WITH FIVE COLUMN VECTORS IS", xx [format=9.4]; n=nrow(xx); * The number of rows of xx; jj=j(n); * An nxn matrix with all 1s ; xbar=jj*xx/n; * This defines COLUMN MEANS of xx as a nx5 matrix ; xx=xx-xbar; * xx is now centered at its column means; covar=xx`*xx/(n-1); * The matrix transpose is X` (backwards prime) ; * covar is a 5x5 matrix '; k=ncol(xx); * The number of columns of xx (here=5); * Covar is the 5x5 sample variance/covariance matrix of the columns of X; print "THE COVARIANCE MATRIX OF THE 5 COLUMNS OF X IS", covar [format=9.4]; * Note Corr(X,Y) = Cov(X)/(Sqrt(Var(X))*Sqrt(Var(Y)) ; * * If D is a diagonal matrix, then D*A multiplies the ROWS of A by * the diagonal elements of D, and A*D multiplies the COLUMNS of A. * Thus we want Corr(matrix) = D*Covar*D for appropriate D. * * diag(matrix) creates a diagonal matrix from `matrix' by setting all * entries=0 except for entries on the diagonal * Similarly, * diag(vector) creates a diagonal matrix with `vector' along * the diagonal. * * Most numerical functions, for example sqrt(), are applied * COMPONENT BY COMPONENT when applied to a vector or matrix * ; gdiag=diag(covar); * Diagonal matrix with Var(X_i) on diagonal; gg=sqrt(inv(gdiag)); * Now 1/Sqrt(Var(X_i)) on the diagonal; corr=gg*covar*gg; * Corr is now the correlation matrix; print "THE CORRELATION MATRIX OF THE 5 COLUMNS OF X IS", corr [format=9.4]; print "EIGENVALUES OF THE CORRELATION MATRIX:"; ************************************************; * Any symmetric matrix A (such as Corr) has an orthogonal set of * EIGENVECTORS: That is, solutions v=v_i of * * Av = la v, vector v=v_i ne 0, la=la_i * * Here la is called the EIGENVALUE of v for A. The fact that there * are enough of these to form an othogonal basis for the vector space * is called the SPECTRAL THEOREM for symmetric matrices. * * The SAS statement call(egvals,egvecs,corr) defines a 5x1 vector * of eigenvalues EGVALS of the matrix corr and also 5x5 matrix of * eigenvectors EGVECS. The i-th column of EGVECS is an eigenvector * of the i-th eigevalue in EGVALS for Corr. All of these eigenvectors * are normalized to have length one. Thus, if vi=Egvecs[1:5,i] * (so that vi is the i-th column), * * Corr*vi = Egval[i]*vi (*) * * The columns of Egvecs are normalized and orthogonal, so that * EGVECS is also an orthogonal matrix, and is in fact an orthogonal * matrix that diagonalizes Corr: That is, * ; * Egvecs*Diag(Egvals)*Egvecs` = Corr (**) ; * Egvecs`*Corr*Egvecs = Diag(Egvals) (**) ; * * where Egvecs`*Egvecs = Egvecs*Egvecs` = I_5 * (Which is true for any orthogonal matrix) * * It is easy to verify (**) from (*): Since the columns of EGVECS ; * (or the rows of EGVECS`) are orthogonal, yi=Egvecs`vi is the ; * i-th unit vector in R^n. Then Diag*yi = la_i y_i if la_i is the ; * i-th diagonal entry, and Egvecs*yi = the i-th column of EGVECS ; * = vi again. * * Find the eigenvalues (egvals) and eigenvectors (egvecs) of corr: ************************************************; call eigen (egvals, egvecs, corr); print "THE EIGENVALUES AND EIGENVECTORS OF CORR ARE", egvals [format=8.6] " " egvecs [format=9.4], , "(The columns of Egvecs are the eigenvectors of Corr)"; sumevals=sum(egvals); trcorr=sum(vecdiag(corr)); print "Sum of eigenvalues and tr(corr):" sumevals " " trcorr; print "CHECKING THAT EGVECS IS AN ORTHOGONAL MATRIX:"; dd = egvecs`*egvecs; print "EGVECS`*EGVECS is " dd [format=8.2]; dderr = sum(abs(dd-i(k))); print "THE INVERSION ERROR IS " dderr; * As a check:; * Recall that `diag(vector)' produces a diagonal matrix ; * with `vector' along the diagonal. ; cc = egvecs`*corr*egvecs; ccdif= cc - diag(egvals); ccerror = sum(abs(ccdif)); print "C = Egvecs`*Corr*Egvecs and the diagonalization error are", cc [format=8.3] " " ccerror; * Finally, let's see if we can use Proc IML to calculate the P-values * for correlation coefficients:; * If R is the Pearson correlation coefficient for independent * normal data, then T = Sqrt(n-2)*R/Sqrt(1-R*R) has a Student-t * distribution with n-2 degrees of freedom. * ; * In general, ; * M#M says to multiply matrices COMPONENTWISE, NOT as matrices ; * A/B means DIVIDE COMPONENTWISE (A and B must be the same size) ; * Probf(A,d1,d2) is Prob[F(d1,d2) le A] for an F distribution ; * Recall T(d)^2 = F(1,d), so that 1-Probf(T#T,1,d) returns ; * two-sided Student-t P-values ; * First, put `missing values' on the diagonal of corr to avoid * T=infinity errors. SAS handles missing values properly when * doing matrix calculations. In this case, it means that * diagonal entries will be carried over as `missing'.; do i=1 to k; corr[i,i]=.; end; * `end' means end of do-loop; * In the next two statements, all operations are COMPONENTWISE; * A MATRIX of Student T-statistics: ; * Recall that j(k) is the kxk matrix with all 1s, * so that j(k) - A#A is a matrix of values 1 - a(i,j)^2; tstat = sqrt(n-2)*corr/sqrt(j(k) - corr#corr); * A MATRIX of P-values; pvals = 1-probf(tstat#tstat,1,n-2); print "A MATRIX OF T STATISTICS FOR CORR (H_0:rho=0) IS", tstat [format=9.4]; print ,,; * Space down by two; print "A MATRIX OF P-VALUES FOR CORR (H_0:rho=0) IS", pvals [format=9.4]; print "THE CORRELATION MATRIX AGAIN", corr [format=9.4]; quit; * End Proc IML; ************************************************; * How do the correlations and P-values that we just calculated * compare with proc corr output? * In `proc corr', `nosimple' means to skip means etc of variables ************************************************; proc corr data=randdat nosimple; title2 "USING PROC CORR IN SAS:"; title3 "NOTE THAT CORRs and P-VALUEs ARE EXACTLY THE SAME !"; var x1 x2 x3 x4 x5; run;