***********************************************;
* Using SAS to work with matrices
*
* SAS's Proc IML (``Interactive Matrix Language'') is a general matrix
*   package that can be used to add and multiply matrices, find matrix
*   inverses, solve systems of linear equations, display matrices,
*   and export results to SAS for use in statistical routines.
*
* The following gives two sets of examples:  First,
* (1) we import two matrices A, B into Proc IML from SAS datasets and
*    find A*B, C=B*A, inv(C), and solve C X=Y,
* (2) we export inv(C),X,Y back to SAS and display it,
* (3) we generate a 30x5 random matrix X in SAS,
* (4) we calculate the 5x5 covariance and correlation matrices of the
*   columns of X,
* (5) we find the eigenvalues and eigenvectors of the correlation
*   matrix of the columns of X,
* (6) using matrix operations, we compute a 5x5 matrix of T-statistics
*   and Pearson-correlation-P-values for the columns of X.
*
************************************************;

title 'MATRIX CALCULATIONS IN SAS - YOURNAME';
options ls=75 ps=60 pageno=1 nocenter;

************************************************;
* Note that SAS datasets are matrices already, except that
*   some columns can be text:  A matrix in SAS is either a SAS
*   dataset or else a rectangular portion of a SAS dataset.
*   Of course, we can't multiply or invert matrices unless
*   all of their columns are numbers.

* First, we create a 5x3 matrix A and a 3x5 matrix B
*   as two SAS datasets.
*
* Recall that aa1-aa3 is SAS shorthand for aa1 aa2 aa3
*
************************************************;

data adat;
     input aa1-aa3;
datalines;
      5   4   1
      7   9   7
      0   1   1
     12  -2   1
      7   6   5
     run;

data bdat;
     input bb1-bb5;
datalines;
     5   4   1   4   5 
     4   3   1   3   4 
     1   3  -2   3  11 
     run;


************************************************;
* Now use SAS's Proc IML to work with these matrices      ;
************************************************;

title2 "NOW WE ARE IN PROC IML, SAS'S MATRIX ENVIRONMENT"; 

proc iml;    * START the IML environment;

use adat;    * Make this the default IML SAS dataset;

read all var { aa1 aa2 aa3 } into aa;  * Define matrix A (5x3);

use bdat;    * Make this the default IML SAS dataset;

read all var { bb1 bb2 bb3 bb4 bb5 } into bb;  * Define B (3x5);

print "The matrices A (5x3) and B (3x5) are",
   aa "     " bb;

************************************************;
* NOTES:
*
* (1) ALL STATEMENTS in proc iml MUST END with a semi-colon
*     (as in regular SAS);
*
* (2) `var {<list>} into MM'   says to build a matrix called MM
*   using the columns in <list> in the current SAS dataset, 
*
* `read all' says use all rows in the default IML SAS dataset,
*    so that the matrix has the same number of rows as the dataset
*    has records.
*
* `read point { 1 2 3 } var ...'  would mean to use only those row
*    numbers, creating in this case a matrix with three rows;
*
* ALTERNATIVELY, you could define matrices as rectangular sections
*   of a single SAS dataset or larger matrix by saying, for example,
*
*     read all var { aa1 aa2 aa3  xx yy zz  bb1 bb2 bb3 bb4 bb5 }
*        into alldat;
*
*     aa=alldat[1:5, 1:3];    * Rows 1-5 of columns 1-3;
*     bb=alldat[1:3, 7:11];   * bb is a 3x5 matrix from columns 7-11;
*
* Unfortunately, you cannot use the useful abbreviations  aa1-aa3  ;
*   and bb1-bb5  within proc iml;
* 
* Now compute the 3x3 matrix C=BA and the 5x5 matrix D=AB;
*
* Isn't this easier than doing it by hand? ;
************************************************;


cc=bb*aa;  dd=aa*bb;


************************************************;
* (3) You can write directly to the output (*.lst) file in
*  proc iml using `print'.  In print statements,
*
*   COMMAS always mean a new line
*   SPACES mean spacing on the SAME LINE, if there is room 
*     (otherwise Proc IML goes to the next line)  ;
*                                          ;
************************************************;


print "The matrices C=BA (3x3) and D=AB (5x5) are", cc "   " dd;

print "EXERCISE: Check these calculations yourself!";

* The TRACE of a matrix is the sum of its diagonal terms, which has ;
*   the property than trace(AB)=trace(BA), even if the matrices ;
*   AB and BA are of different sizes.  ;
*
* In general in Proc IML
*   sum(vector) or sum(matrix) is the sum of the entries
*   vecdiag(matrix) is the diagonal of the matrix, viewed as a vector ;

trc=sum(vecdiag(cc));
trd=sum(vecdiag(dd));

print "The traces Trace(C=BA) and Trace(D=AB) are " trc "   " trd;

* The inverse of the matrix C=BA is;

cinv=inv(cc);

* Multiply C(inverse)*C, which should be the identity matrix ;

cii = cinv*cc;

* We want to see how close cii is to the 3x3 identity matrix
* We first create an identity matrix I of the same size as C=BA
* It is safer to use a Proc IML command to get the size of C rather than
*   just saying 3, since a literal `3' could create unnecessary problems
*   later if you used a different C ;

n = nrow(cc);   * The number of rows in C, here 3;
ii = i(n);      * The 3x3 Identity matrix ;

* abs(matrix) returns a matrix of exactly the same size as `matrix';
*   with all components replaced by their absolute values
* Thus sum(abs(cii - ii)) is the sum of the absolute values of
*   the errors in inverting the matrix C=BA;

cerror = sum(abs(cii - ii));

print "The matrices CINV and CINV*C and the error in inverting C are",
    cinv  [format=8.3] "   " cii [format=8.3] "   " cerror;

print 
 "Note the using of formatting for proper spacing of matrix entries.";

yy = { 21.71, -11.31, 21.91 };   * A literal 3x1 column vector;
xx=cinv*yy;     * Solve C X = Y by setting X = C(inv) Y ;

print "The matrix C and the solution X of C X = Y are",
  cc "    " xx [format=7.2] "    " yy;

print "EXERCISE: Check that X is a solution yourself!";

print ,;     * A little bit of vertical space;

* Finally, let's copy the matrix Cinv, xx, and yy to a SAS dataset ;
*   in case we might need it in SAS;
* Unlike columns in a matrix, columns in a SAS dataset must have names,
*   which are the names of the corresponding SAS variables.
* This is handled by exporting sets of column vectors so that SAS can
*   use the column vector names as SAS variable names.	;

* The first step is to convert the 3x3 matrix Cinv into 3 column vectors;
*   so that SAS will have proper column names ;

ci1 = cinv[1:3, 1];   * Rows 1-3 of 1st column of CINV ;
ci2 = cinv[1:3, 2];   * Rows 1-3 of 2nd column of CINV ;
ci3 = cinv[1:3, 3];   * Rows 1-3 of 3rd column of CINV ;

* CREATE a new SAS dataset with 3 rows and 5 columns ;

create dat5 var { ci1 ci2 ci3 xx yy };
append;    * Otherwise nothing will be written to dat5;

quit;      * Quit Proc IML temporarily;

* Now we are back in SAS:
* Let's use proc print to display the new SAS dataset dat5;

title2 'WE ARE NOW BACK IN REGULAR SAS';

proc print data=dat5;
     title3 'PROC PRINT: The dataset that we exported to SAS is:';
     run;


* Next, let's create a dataset with 30 rows and 5 columns of 
* random data.  Recall that `ranuni(<num>)' defines a uniform
*   variate in (0,1), where <num> is the initial random-number seed ;

data randdat;
     input nn;          * For each row of data after datalines;
     do ii=1 to nn;     * ii is the row number: create 5 columns ;
       x1=10*ranuni(37);  x2=10*ranuni(37);
       x3=10*ranuni(37);  x4=10*ranuni(37);
       x5=10*ranuni(37);
       x2=(x1+x2)/2;    * Force a (Col1,Col2) correlation;
       output;      * `output' writes a row to `randdat' ;
       end;         * `end' means the end of the `do-loop';
datalines;
     30
     run;


proc print data=randdat;
     title2 'PROC PRINT IN SAS: A DATASET WITH 5 RANDOM VECTORS';
     run;


* Let's re-enter `proc iml' with our new matrix.
* First, set title lines for proc iml output ;

title2 "WE ARE BACK IN PROC IML AGAIN";
title3
 "Find the covariance and correlation matrix of a set of random vectors";

proc iml;       * RE-START the IML environment;

use randdat;    * Read the 5 random vectors into a 30x5 matrix;

read all var { x1 x2 x3 x4 x5 } into xx;

print "THE DATASET XX WITH FIVE COLUMN VECTORS IS", xx [format=9.4];

n=nrow(xx);      * The number of rows of xx;
jj=j(n);         * An nxn matrix with all 1s ;
xbar=jj*xx/n;    * This defines COLUMN MEANS of xx as a nx5 matrix ;
xx=xx-xbar;      * xx is now centered at its column means;

covar=xx`*xx/(n-1);    * The matrix transpose is X` (backwards prime)  ;
                       * covar is a 5x5 matrix ';
k=ncol(xx);            * The number of columns of xx (here=5);

* Covar is the 5x5 sample variance/covariance matrix of the columns of X;

print "THE COVARIANCE MATRIX OF THE 5 COLUMNS OF X IS",
   covar [format=9.4];

* Note Corr(X,Y) = Cov(X)/(Sqrt(Var(X))*Sqrt(Var(Y))   ;
*
* If D is a diagonal matrix, then D*A multiplies the ROWS of A by
*   the diagonal elements of D, and A*D multiplies the COLUMNS of A.
*   Thus we want  Corr(matrix) = D*Covar*D  for appropriate D.
*
* diag(matrix) creates a diagonal matrix from `matrix' by setting all
*   entries=0 except for entries on the diagonal
* Similarly,
*   diag(vector) creates a diagonal matrix with `vector' along
*     the diagonal.
*
* Most numerical functions, for example sqrt(), are applied
*   COMPONENT BY COMPONENT when applied to a vector or matrix
*                                         ;

gdiag=diag(covar);    * Diagonal matrix with Var(X_i) on diagonal;
gg=sqrt(inv(gdiag));  * Now 1/Sqrt(Var(X_i)) on the diagonal;

corr=gg*covar*gg;     * Corr is now the correlation matrix;

print "THE CORRELATION MATRIX OF THE 5 COLUMNS OF X IS",
   corr [format=9.4];

print "EIGENVALUES OF THE CORRELATION MATRIX:";

************************************************;
* Any symmetric matrix A (such as Corr) has an orthogonal set of
*   EIGENVECTORS: That is, solutions v=v_i of
*
*     Av = la v,   vector v=v_i ne 0,  la=la_i
*
*   Here la is called the EIGENVALUE of v for A.  The fact that there
*   are enough of these to form an othogonal basis for the vector space
*   is called the SPECTRAL THEOREM for symmetric matrices.
*
* The SAS statement   call(egvals,egvecs,corr)  defines a 5x1 vector
*   of eigenvalues EGVALS of the matrix corr and also 5x5 matrix of
*   eigenvectors EGVECS.  The i-th column of EGVECS is an eigenvector
*   of the i-th eigevalue in EGVALS for Corr. All of these eigenvectors
*   are normalized to have length one. Thus, if vi=Egvecs[1:5,i]
*   (so that vi is the i-th column),
*
*      Corr*vi = Egval[i]*vi  (*)
*
*   The columns of Egvecs are normalized and orthogonal, so that
*   EGVECS is also an orthogonal matrix, and is in fact an orthogonal
*   matrix that diagonalizes Corr:  That is,
*                                                                ;
*       Egvecs*Diag(Egvals)*Egvecs` = Corr     (**) ;
*       Egvecs`*Corr*Egvecs = Diag(Egvals)     (**) ;
*
*   where  Egvecs`*Egvecs = Egvecs*Egvecs` = I_5
*   (Which is true for any orthogonal matrix)
*
* It is easy to verify (**) from (*):  Since the columns of EGVECS ;
*   (or the rows of EGVECS`) are orthogonal, yi=Egvecs`vi is the ;
*   i-th unit vector in R^n. Then Diag*yi = la_i y_i if la_i is the ;
*   i-th diagonal entry, and Egvecs*yi = the i-th column of EGVECS ;
*   = vi again.
*
* Find the eigenvalues (egvals) and eigenvectors (egvecs) of corr:
************************************************;

call eigen (egvals, egvecs, corr);

print "THE EIGENVALUES AND EIGENVECTORS OF CORR ARE",
   egvals [format=8.6] "    " egvecs [format=9.4], ,
  "(The columns of Egvecs are the eigenvectors of Corr)";

sumevals=sum(egvals);  trcorr=sum(vecdiag(corr));
print "Sum of eigenvalues and tr(corr):" sumevals "   " trcorr;

print "CHECKING THAT EGVECS IS AN ORTHOGONAL MATRIX:";
dd = egvecs`*egvecs;
print "EGVECS`*EGVECS is " dd [format=8.2];

dderr = sum(abs(dd-i(k)));
print "THE INVERSION ERROR IS " dderr;

* As a check:;
* Recall that `diag(vector)' produces a diagonal matrix ;
*   with `vector' along the diagonal. ;

cc = egvecs`*corr*egvecs;
ccdif= cc - diag(egvals);
ccerror = sum(abs(ccdif));

print "C = Egvecs`*Corr*Egvecs and the diagonalization error are",
   cc [format=8.3] "    " ccerror;

* Finally, let's see if we can use Proc IML to calculate the P-values
*   for correlation coefficients:;

* If R is the Pearson correlation coefficient for independent
*   normal data, then T = Sqrt(n-2)*R/Sqrt(1-R*R)  has a Student-t
*   distribution with  n-2 degrees of freedom.
*                    ;
* In general, ;
*   M#M says to multiply matrices COMPONENTWISE, NOT as matrices ;
*   A/B means DIVIDE COMPONENTWISE (A and B must be the same size) ;
*   Probf(A,d1,d2)  is   Prob[F(d1,d2) le A] for an F distribution ;
*   Recall T(d)^2 = F(1,d), so that 1-Probf(T#T,1,d) returns ;
*     two-sided Student-t P-values ;

* First, put `missing values' on the diagonal of corr to avoid
*   T=infinity errors. SAS handles missing values properly when
*   doing matrix calculations. In this case, it means that
*   diagonal entries will be carried over as `missing'.;

do i=1 to k;  corr[i,i]=.;  end;  * `end' means end of do-loop;

* In the next two statements, all operations are COMPONENTWISE;

* A MATRIX of Student T-statistics: ;
* Recall that j(k) is the kxk matrix with all 1s,
*   so that  j(k) - A#A  is a matrix of values  1 - a(i,j)^2;

tstat = sqrt(n-2)*corr/sqrt(j(k) - corr#corr);

* A MATRIX of P-values;
pvals = 1-probf(tstat#tstat,1,n-2);

print "A MATRIX OF T STATISTICS FOR CORR (H_0:rho=0) IS",
   tstat [format=9.4];

print ,,;   * Space down by two;

print "A MATRIX OF P-VALUES FOR CORR (H_0:rho=0) IS",
   pvals [format=9.4];

print "THE CORRELATION MATRIX AGAIN",
   corr [format=9.4];

quit;    * End Proc IML;

************************************************;
* How do the correlations and P-values that we just calculated
*   compare with proc corr output?
* In `proc corr', `nosimple' means to skip means etc of variables
************************************************;

proc corr data=randdat nosimple;
     title2 "USING PROC CORR IN SAS:";
     title3 "NOTE THAT CORRs and P-VALUEs ARE EXACTLY THE SAME !";
     var x1 x2 x3 x4 x5;
     run;