************************************************************; * One-way layout for vector-valued data * (MANOVA for Multivariate Analysis of Variance) * * Assume that we are given k d-dimensional VECTOR-VALUED samples * * Y_1a Y_2a ... Y_(n_a)a for 1 le a le k * * where n_a is the size of the a-th sample, so that Y_ia is * a nxk matrix if the sample sizes n_a=n are constant. * * We assume that the Y_ia are independent N(mu_a,Sigma). That is, * the k samples have the same covariance matrix, but may have * different mean vectors. Samples in a one-way layout are often * called ``treatment groups'', even if there are no treatments * involved. * * As in the classical univariate one-way ANOVA case, we want to test * * H0:mu_1=mu_2=mu_3=...=mu_k for VECTOR-VALUED means mu_a * * In the example below, k=3 and H_0 means that three sets of skulls * have approximately the same shape. * * For d=1, the classical one way ANOVA test is based on a * ``treatment-group sum of square'' * * H = Sum(a=1,k) n_a (Ybar_a - Ybar)^2 * * where Ybar_a is the mean of the a-th treatment group, n_a is * the size of the a-th treatment group, and Ybar is the overall mean, * and the ``error sum of squares'' * * E = Sum(a=1,k) Sum(j=1,n_a) (Y_ja - Ybar_a)^2 * * Given H_0, H is distributed as sigma^2 chi^2(k-1), where the * 1x1 matrix Sigma is the number sigma^2 with d=1, E is * distributed as sigma^2 chi^2(N-k), where N=Sum(a=1,k) n_a is * the total sample size, and H and E are independent. This leads * to the classical statistic * * F = ((1/(k-1))H/((1/(N-k))E) = (N-k)/(k-1) H/E * * which is distributed as F(k-1,N-k) given H_0, and which is used * to test H_0 if d=1. * * In the multivariate case, H and E become the dxd MATRICES * * H = Sum(a=1,k) n_a (Ybar_a - Ybar)(Ybar_a - Ybar)' * * E = Sum(a=1,k) Sum(j=1,n_a) (Y_ja - Ybar_a)(Y_ja - Ybar_a)' * * A difficulty now is that we don't know what H/E means if H and E * are dxd matrices. * * As in the one-sample and two-sample cases, one can show that E * has the Wishart distribution W(d,N-k,Sigma), so that E is * invertible if N ge k+d. However, the three matrices * * (*) HE^{-1} E^{-1}H E^{-1/2}HE^{-1/2} * * are equally plausible for test ``statistics'', even though they * are matrices. In general, however, these three matrices are * different. * * Fortunately, the three matrices in (*) have the same EIGENVALUES, * since the eigenvalues are the same as the solutions of the equation * (for example) * * det(HE^{-1} - zI) = det(H - zE)/det(E) = 0 * * Since the third matrix in (*) is positive semi-definite, there are * d non-negative eigenvalues, which we can assume are in the order * * la_1 ge la_2 ge ... ge la_d ge 0 * * If r_H=rank(H)0, and * the remaining la_i=0. * * For a MANOVA, the form of H implies that r_H = max{d, k-1}. * In particular, if there are two samples (k=2), then r_H=1 and * there is only one nonzero eigenvalue, la_1>0. In this case, * one can show that la_1 is a constant times the Hotelling * T^2 statistic, which we already know has an F distribution. * (See the Multivariate notes on the Math439 Web site.) * * In general, the likelihood-ratio test statistic for H_0 is * * Lambda = det(E)/det(E+H) = Prod(i=1,d) 1/(1+la_i) * * which is called the Wilk's Lambda statistic. However, there are * three other commonly tests of H_0, which involve different * combinations of the la_i. If there is only one non-zero la_i, * then all four tests are equivalent, which is the case for one- * or two-sample Hotelling T^2 tests and in general if r_H=1. * * See MLizards.sas for notes about the SAS code below. * * Note that the eigenvalues la_i and the eigenvectors of E^{-1}H * are listed in the output. In the following example, k=3 and * there are only two nonzero eigenvalues. ************************************************************; title 'MANOVA TEST FOR THREE SETS OF SKULLS - YOURNAME'; title2 'Skulls from 3 eras: (1) 4000BC (2) 3300BC (3) 1850BC'; title3 'DATA FROM Johnson&Wichern Table 6.13 p344'; options nodate ls=75 ps=60 pageno=1 nocenter; data skulls; infile "EgyptSkulls.dat" firstobs=8; * infile "c:\stat\multivar\EgyptSkulls.dat" firstobs=8; input Ord Time y1 y2 y3 y4; * Alternative informative names for the variables; * These will often also appear in SAS output; label y1="Breadth" y2="Height" y3="Depth" y4="NoseToTop"; run; proc print data=skulls; title4 'THE DATA AS SAS SEES IT'; id ord; run; ************************************************************; * See MPairedSamp for a discussion of the syntax of Proc Glm * for multivariate ANOVA tests *************************************************************; proc glm; title3 'MANOVA TEST FOR THREE SETS OF SKULLS: y1-y4'; class Time; model y1 y2 y3 y4 = Time; manova h = Time / printe; run; ************************************************************; * The univariate tests suggest that only y1 and y3 are changing. * Let's see if we get more significance: *************************************************************; proc glm; title3 'MANOVA TEST FOR THREE SETS OF SKULLS: y1,y3 only'; title4 'This is more significant than with y1 y2 y3 y4!'; class Time; model y1 y3 = Time / nouni; manova h = Time / printe; run; ************************************************************; * How are the variables changing over Time? * The following displays means and variances of each variable * for each epoch. * SAS Note: ``By Var'' generally means for each value of Var * Note the abbreviation y1-y4 for y1 y2 y3 y4 * Breadth seems to be increasing and Depth decreasing *************************************************************; proc means mean var maxdec=4; title3 'PROC MEANS: Means for each time period'; title4 'y1 seems to be increasing, y3 decreasing'; by Time; var y1-y4; run; data skulls; set skulls; if Time=0 then Tsym='A'; else if Time=700 then Tsym='B'; else if Time=2150 then Tsym='C'; else Tsym='?'; run; proc plot data=skulls; title2 'Skulls from three eras: A=4000BC B=3300BC C=1850BC'; title3 'PROC PLOT: Y1=Breadth vs Y3=Depth'; plot y1*y3=Tsym; run; ************************************************************; * Display the correlation matrix, including with Time, * is also useful, along with the Pearson P-values for * significance of correlations. *************************************************************; proc corr; title3 'PROC CORR: Correlation matrix for Time and y1-y4'; var Time y1-y4; run;