Math 439 Takehome Final

Math 439 Takehome Final - Fall 2008

TAKEHOME FINAL due Thursday 12-18 by 5:30 P.M.
Hand in either to Professor Sawyer or to the receptionist in the Mathematics Office.

NOTE: There should be NO COLLABORATION on the takehome final,
other than for the mechanics of using the computer.

Open textbook and notes (including course handouts).

ORGANIZE YOUR WORK in the following manner:

(i) your answers to all questions written out separately,

(ii) all SAS programs that you use, if you use SAS for any problems, followed by

(iii) all SAS output.
ADD CONSECUTIVE PAGE NUMBERS to the output so that you can make references from part (i) to part (iii). For example, so that you can say things like, ``The answer to part (a) is 7.71. The scree plot for part (b) is on page #Y below.''

NOTE: In the following, _ means subscript, ^ means superscript, le means `less than or equal', and ge means `greater than or equal'.

Whole problems are equally weighted, but different parts of problems may be weighted differently.
Seven (7) problems.

1. (i) Let Z = (Y X)' be a 5x1 vector written in block form for a 3x1 vector Y and and 2x1 vector X. Suppose that Z has a five-dimensional joint normal distribution with covariance and mean given in block matrix form by

             (  6   3   3  |  -1  -1 )              (  1  )
             (  3   2   1  |   0  -1 )              (  3  )
  Cov(Z) =   (  3   1   7  |   1   1 )      E(Z) =  (  1  )
             ( --------------------- )              ( ----)
             ( -1   0   1  |   7   3 )              (  1  )
             ( -1  -1   1  |   3   5 )              (  3  )

Find the (3x3) conditional covariance matrix Cov(Y | X=(5 3)') and the (3x1) conditional mean vector E(Y | X=(5 3)'). Do this either by hand or else by using SAS's proc iml or a comparable matrix language.

(ii) Find the eigenvalues of the 3x3 matrices Cov(Y) and Cov(Y | X=(5 3)'). Are they similar or very different?

(Warning: If you do this problem by hand, do not expect the numbers to come out even.)

(Hints: You can create matrices with given values in proc iml by either setting (for example)


     yy = { 1 2 3, 4 5 6, 7 8 9 };

(note curly braces), where spaces mean ``same row'' and commas mean ``start of new row''. (This defines a 3x3 matrix.) Alternatively, you can define data sets using SAS datasteps and import columns to a matrix in proc iml. (See examples on the Math439 Web site.) You can find eigenvalues and eigenvectors of symmetric matrices in proc iml by using the function call eigen. (See PCAApples.sas on the Math439 Web site for an example.))

2. An experimenter measures a response Y_i along with four covariates, which she imaginatively calls X1, X2, X3, and X4. (More exactly, Xi1, Xi2, Xi3, and Xi4 for the i-th observation.) She carries out a linear regression of Y_i on X1,X2,X3,X4 (including an intercept term) under the assumption that the errors are independent normal with the same error covariance sigma^2. Data for the n=50 observations are contained in the file Experiment.dat.

(i) Using proc iml in SAS or a similar matrix package, find (a) the least-square or ML estimators of the five coefficients beta_i in the regression (including the intercept), (b) T-statistics for the tests H_0:beta_i=0, and (c) Student-t P-values for H_0 in each case.

(ii) What is the estimate of sigma^2 (MSE) and the estimate of sigma (Sqrt(MSE))? What is the number of degrees of freedom in the associated T-tests? How was the number of degrees of freedom calculated?

(Hint: See ExampReg3 on the Math439 Web site.)

3. A colleague of the experimenter remarks that while the estimates of the coefficients beta_2 and beta_3 in Problem 3 appear different, theory suggests that the parameters may be the same. The colleague wonders if the experimenter has found evidence that beta_2 ne beta_3.

Carry out a Student t-test of the hypothesis H_0:beta_2=beta_3. How many degrees of freedom does the resulting t-test have?

(Hints: (i) Let t be the vector (0 0 1 -1 0)'. Show that t'beta=beta_2-beta_3. Given the theoretical distribution of betahat, show that t'betahat is normally distributed with mean t'beta and variance sigma^2 V(t), where V(t) depends only on t and X'X. Given H_0, conclude that T=t'betahat/sqrt((MSE)*V(t)) has a Student-t distribution and continue.

(ii) You should be able to do this problem by adding a few more lines of matrix code to the program that you wrote for Problem 2.)

4. The file HRatWeights.dat contains weekly gains over four weeks for three groups of rats. The first group (Group=1) was the control and was given no extra ingredients in their water, Group 2 was given thyroxin, and Group 3 was given thiouracil.

(i) Is there a significant difference in weekly gains among the groups of rats, viewing the four weekly gains as vector-valued data? Carry out a MANOVA procedure (using, for example, SAS). Why does the output give 4 different P-values for the vector-valued procedure? Are they all significant or all nonsignificant? What are the P-values?

(ii) Which of the four individual coordinates (weekly gains) are significant, using the corresponding univariate ANOVAs with three treatment groups? What are the P-values for the coordinates that are significant?

(iii) Find the means of the four weekly weight gains for each group. Does any one group stand out as being different in weekly gains?

(Hint: In SAS, try

proc means mean; class group; var
y1-y4; run;

, but use your own variable names instead of group and y1 y2 y3 y4.)

5. A naturalist makes 4 measurements (Height, Width, Tail Length, Length) on 50 lizards of a particular species as a function of the Altitude at which the lizard was collected. The data is in the file Dat4aLizards.dat on the Math439 Web site.

(i) Carry out a multivariate regression of the four lizard measurements on Altitude using SAS or a comparable statistical package. What is the P-value for the multivariate regression? Why does the output list four different P-values for the multivariate regression on Altitude, and why are all of them the same?

(Hint: See e.g. MrEgyptSkulls.sas on the Math439 Web site, as well as the handout on Multivariate Linear Models on the Math439 Web site.)

(ii) What are the degrees of freedom of the F-distribution that is behind the four tests in part (i), both numerator and denominator? How were these calculated in this case? Do the numbers of degrees of freedom in the output agree with your predictions?

(Hint: See the discussion of the Hotelling T^2 test in the handout on Multivariate Linear Models.)

(iii) Which of the four simple univariate regressions for the four physical measurements on Altitude are significant? What are their P-values? What are the estimated slopes of the significant univariate regressions?

(iv) Construct a two-dimensional scatterplot of the lizard measurements with Width on the Y-axis and Length on the X-axis, with a plotting symbol L for Altitude less than or equal to 10 and H for Altitude greater than 10. Can you see a trend from the upper left to the lower right in the scatterplot as the altitude increases or decreases? (If so, in which way?) Is this consistent with the parameter estimates and P-values in part (iii)?

(Hint: Using SAS, you can define a plotting variable called (for example) ASym in the SAS data step by an if-then-else statement like, ``

if Altitude<10 then ASym='L'; else
Asym='H';

''.)

6. Consider the lizards whose measurements are in the data set Dat4aLizards.dat and whose Altitude is either less than or equal to 8.0 (call these Type=1) or greater than or equal to 12.0 (call these Type=2). (The remaining lizards are discarded.) For simplicity, call the 4 lizard measurements y1 y2 y3 y4 instead of Height Width Tail_Length Length. The naturalist is interested in finding a rule that depends only on y1-y4 and that will classify most Type=1 lizards as Type=1 (i.e., low altitude) and most Type=2 lizards as Type=2.

Use Fisher's Linear Discriminant method for these lizards to find a linear discriminant function

L(data) = c_0 + c_1y1 + c_2y2 + c_3y3 + c_4y4

with the property that L(data)>0 predicts Type=1 and L(data)<0 predicts Type=2. Assume that SAS's default assumptions for proc discrim holds for these lizards.

(i) How many lizards are you using to find L(data)? That is, how many lizards are of Types either 1 or 2?

(ii) Find coefficients for the linear discriminant function L(data). (Hint: See the comments in Dgaussdiscrim.sas. In particular, the coefficients of the function L(data) can be computed as the difference between two vectors in the output of SAS's

proc
discrim

(iii) How many mistakes (that is, misclassifications) does the discriminant function make when applied to the lizards that were used to derive it? (This is called a ``simple resubstitution'' analysis.)

(iv) How many of the variables y1-y4, and which of these variables are required, to derive the discriminant function L(data) most efficiently? (See the discussion of proc stepdisc in the comments in Dgaussdiscrim.sas.) Do you end up with the same variables that were significant in Problem 5(iii)?

(Hints: (i) Try including the statement ``

if Type=1 or
Type=2 then output;

'' in the data step that reads Dat4aLizards.dat, assuming that lizards found at altitudes strictly between 8.0 and 12.0 are assigned Type=3. In a SAS data step, if you ever use the command ``output'', then ONLY those records for which you say ``output'' will appear in the corresponding SAS dataset. This gives you a way of dropping the excess Type=3 records. Make sure that this works by using a proc print statement.

(ii) Note that you are not using crossvalidation nor reading in an additional ``moredat'' dataset to apply the rule to. Thus the call to ``proc discrim'' can be much simpler than in Dgaussdiscrim.sas.)

7. Apply a logistic regression to the lizard variables y1-y4 for the lizards that you used in Problem 6. (That is, Type=1 or Type=2.) (Hint: See Dlogistic.sas on the Math439 Web site.)

(i) What is the linear function (with intercept) of y1-y4 that is derived by estimating Prob(Type=1|Y) from

logit(Prob(Type=1|Y)) = c_0 + c_1y1 + c_2y2 + c_3y3 + c_4y4

Recall that, given a lizard with dimensions Y, we classify the lizard as Type=1 if logit(Prob)>0, or equivalently if Prob(Type=1|Y) > Prob(Type=2|Y).

(ii) How many mistakes does this procedure make when applied to the lizards that were used to derive it? Is it less than or more than the number of errors for Gaussian linear discriminant analysis in Problem 6? (Hint: See the code in Dlogistic.sas on the Math439 Web site.)

(iii) How many of the variables y1-y4, and which of these variables are required, to derive the discriminant function L(data) most efficiently? (Most SAS regression procedures support variable subset selection procedures of various sorts. In this case, try adding

/
selection=backwards;

to the end of the model statement in proc logistic.) Do you end up with the same subset of variables as in Problem 6?

Top of this page