Math 439 Homework 4

Math 439 Homework 4 - Fall 2008

HOMEWORK #4 due 11-11

In the following, _ means subscript and ^ means superscript.

NOTE: In all problem sets that use SAS, arrange your answers into three parts, in the following order:

(I) Your answers to all questions, either written by hand or using a word processor,

(II) The SAS program source files (*.sas files) that you used in the problem set,

(III) The output from the SAS programs in Part II.

In Part I, you can refer to plots or tables or large matrices that problems ask for by saying (for example), ``The scatterplot or matrix for Problem 3 is on page 17 of the SAS output.'' If necessary, add page numbers to the SAS output, so that (for example) you don't have several different page 1s in Part III.

1. Heights and weights for 58 tribesmen on a tropical island are given in Table 1, with 33 from one tribe and 25 from a second tribe. Both tribes are known to be highly inbred. Both tribes are considered to be odd by other tribes on the island.

Table 1 - Heights and Weights for Tribesmen on a Tropical Island

Tribe A (n=33):
    47  166       49  163       51  152       51  157
    52  143       53  145       53  153       54  136
    54  152       54  155       54  163       55  136
    56  140       56  149       57  132       57  133
    58  125       58  135       60  129       60  129
    62  138       66  110       66  130       66  145
    67  113       68  118       68  135       69  116
    70  124       70  132       74  119       75  100
    77   97
Tribe B (n=25):
    49  170       51  158       51  164       54  162
    57  148       59  135       59  155       61  123
    61  133       64  122       65  121       65  129
    65  132       65  138       66  142       67  134
    67  141       67  148       68  121       68  136
    69  148       70  114       71  128       73  120
    73  130

(i) Do the two tribes differ significantly by height? by weight? Do Student t-tests to test the appropriate hypothesis in each case.

(ii) Do the two tribes differ in (height,weight) together, considered as a vector? Do a Hotelling T^2 test to find out.

(iii) What are the (Pearson) correlation coefficients between height and weight within each tribe? Are they similar?

(iv) What are the covariance matrices of (height,weight) within each tribe? What is the pooled covariance matrix for the two tribes together?
(Hint: Use either dedicated SAS procedures or proc iml within SAS or both.)

2. Use Proc IML in SAS to do the following:

(i) Define and display a 40x6 matrix X whose entries are realizations of independent normally-distributed random variables with mean zero and variance one. The 40x6=240 displayed values should be mostly in the range -2 to 2 with a few values outside that range.

(Hints: In Proc IML, the command B=J(m,n) generates a mxn matrix all of whose entries equal one. If Y is any mxn matrix, W=normal(Y) generates an mxn matrix W whose entries are realizations of independent normally-distributed random variables with mean zero and variance one. Thus W=normal(Y) depends on Y only through its dimensions.)

(ii) Display the matrix W=X'X. Note that this is a 6x6 matrix that is an instance of a Wishart distribution W_6(40,I_d). In particular, the diagonal elements will be independent realizations of a chi-square distribution with 40 degrees of freedom.

(iii) Find and display the 6x6 correlation matrix Q of the columns of X.

(Hint: See the proc iml in MPairedSamp.sas or PCAApples.sas on the Math439 Web site. As a check, the diagonal elements of Q should be all 1s and the off-diagonal terms will be in the range of -1 to 1.)

(iv) Find and display the 6 eigenvalues of Q. (Hint: Note the use of the function eigen(evals,evecs,aa) in PCAApples.sas.)

(v) Compute la_max/la_min, where la_max is the largest and la_min is the smallest of the 6 eigenvalues. (You can do this part by hand.)

(Remark: If you have done this correctly, then la_max/la_min should be somewhere in the range of 2 to 6 or nearby. For many real data sets with more than 3 or 4 covariates, the value of la_max/la_min is much larger than this. This is an indication that the true dimensionality of many multidimensional data sets is much smaller than the actual number of covariates.)

3. Let X be a normally distributed random dx1 column vector with E(X)=0 and Cov(X)=A. (That is, X is N(0,A) where A is dxd.) Assume that A is invertible. Show that X'A^{-1}X has a chi-square distribution with d degrees of freedom. (Hint: Find a dxd matrix B such that N=BX is N(0,I_d) and note that X=B^{-1}N.)

4. Aggregate data for five demographic variables in 14 census tracts in the Madison, Wisconsin, area are given in Table 2.

     Table 2 - Data for 14 US census tracts near Madison, Wisconsin
# From Johnson&Wichern, ``Applied Multivariate Statistical Analysis'',
#   5th ed, 2002, Table 8.5, p470
# Variables: TotalPopn(1000s), Median Years of Schooling, TotalEmployed(1000s)
#   Health Services Employment (100s), Median Home Value ($10,000s)
             TotPop  MedSchYr TotEmploy  HealthEmp  MedValHom
  Tract01     5.935    14.2     2.265     2.27       2.91
  Tract02     1.523    13.1     0.597     0.75       2.62
  Tract03     2.599    12.7     1.237     1.11       1.72
  Tract04     4.009    15.2     1.649     0.81       3.02
  Tract05     4.687    14.7     2.312     2.50       2.22
  Tract06     8.044    15.6     3.641     4.51       2.36
  Tract07     2.766    13.3     1.244     1.03       1.97
  Tract08     6.538    17.0     2.618     2.39       1.85
  Tract09     6.451    12.9     3.147     5.52       2.01
  Tract10     3.314    12.2     1.606     2.18       1.82
  Tract11     3.777    13.0     2.119     2.83       1.80
  Tract12     1.530    13.8     0.798     0.84       4.25
  Tract13     2.768    13.6     1.336     1.75       2.64
  Tract14     6.585    14.9     2.763     1.91       3.17

(i) Use either SAS procedures or SAS's proc iml to do a Principal Components Analysis for the data in Table 2. How many principal components are required to explain at least 85% of the total variation in the data? (Hint: See e.g. PCAApples.sas on the Math439 Web site.)

(ii) How do the principal components in part (i) appear to characterize the census tracts? For each PC listed in part (i), how would census tracts that were high in that principal component tend to differ from the average? Ignore coefficients whose absolute value is smaller than about 0.35.

(iii) What percentage of the total variability of the data in Table 2 is explained by these principal components?

(iv) Construct a scree plot for the eigenvalues. Do the eigenvalues appear to drop off suddenly at one point?

5. Annual reports for 1990 for the 10 largest US companies are given in Table 3.

     Table 3 - Data for the 10 largest US Corporations in 1990
# From Johnson&Wichern, ``Applied Multivariate Statistical Analysis'',
#   5th ed, Problem 1.4, p39, 2002
# Source: Fortune Magazine (April 23, 1990) p346-367 Co 1990 Time Inc.
# All numbers are in millions of dollars.
                      Sales   Profits  Assets
    General_Motors   126974    4224    173297
    Ford              96933    3835    160893
    Exxon             86656    3510     83219
    IBM               63438    3758     77734
    General_Electric  55264    3939    128344
    Mobil             50976    1809     39080
    Philip_Morris     39069    2946     38528
    Chrysler          36156     359     51038
    Du_Pont           35209    2480     34715
    Texaco            32416    2413     25636

(i) Use Sas's proc princomp (or a matrix package) to do a Principal Components Analysis for the data in Table 3. How many principal components are required to explain at least 90% of the variation in the data? What percentage of the variability of the data in Table 3 is explained by these principal components?

(ii) Construct a scree plot for the eigenvalues, as a way of showing graphically the relative sizes of the eigenvalues. (Hint: See MensTrackPCA.sas on the Math439 Web site.)

(iii) As one might have expected in advance, the first PC (principal component) is a measure of the overall size of the corporation, since larger (smaller) corporations tend to have uniformly larger (resp. smaller) amounts of sales, profits, and assets.

What does the second PC measure? For example, how do companies that have relatively large positive values of the second principal components differ from the average? large negative values?

Sort the 10 corporations by the values of their second PC. Which companies are at the top of the list? at the bottom of the list? What do each set of companies have in common?

(Hint: If you use Company for the company name in Table 3 in a SAS data step, include the command length Company $16; before the input command. Otherwise SAS will truncate the company name to 8 characters and you won't be able to tell General_Motors from General_Electric. The length command tells SAS to allow up to 16 characters.)

Top of this page