Math 475 Homework 4 - Fall 2005

  • Click here for Math475 home page
  • Click here for Math475 homework page
  • Click here for Prof. Sawyer's home page

    HOMEWORK #4 due 11-8

    Organize your homework in the following manner:
    (i) your answers to all questions,
    (ii) all of your SAS programs, followed by
    (iii) all of your SAS output.
    Add page numbers to your homework so that you can make references from part (i) to part (iii): for example, so that you can say things like, ``The answer in part (a) is 17. The scatterplot for part (b) is on page #Y below.'' Include your SAS output even if you don't refer to it explicitly. Except for forward references like these, a grader should not have to look beyond part (i) of your homework unless he or she thinks that you have done something wrong.
    Include your name in a title statement so that your name will appear at the top of each output page.
    If a problem asks you to do a statistical test, EXPLAIN CLEARLY what the null hypothesis H_0 is, what test you used, what the P-value is, and whether the data is significant, highly significant, or neither. Include this as part of your answer in part (i).
    
    

    1.   An experimenter is interested in how a quantity that she calls zubricity depends on three other quantities called drubness, viscosity, and speed. The experimenter is fairly certain that drubness has a significant effect on zubricity, but is not sure about viscosity and speed. Twenty measurements of zubricity, drubness, viscosity, and speed are recorded in Table 1. For definiteness, call the variables ZUBRIC, DRUBNESS, VISCOSTY, and SPEED.
     Table 1: Zubricity and Covariates
     -----------------------------------
      OBS  Zubric    Drubn  Visc  Speed
     -----------------------------------
       1    310       16     27     12
       2    210       17     36     10
       3    450       24     40     20
       4    390       24     44     15
       5    780       26     44      8
       6    330       28     53     18
       7    580       39     55     19
       8    330       22     56     24
       9    400       29     57     16
      10    230       28     58     17
      11    470       34     60     24
      12    510       35     61     17
      13    490       37     66     20
      14    450       36     68     11
      15    630       46     73     21
      16    400       38     78      6
      17    760       34     80     22
      18    590       47     83     17
      19    520       43     84     12
      20    540       44     89     17

    (i) Is there a significant regression of ZUBRIC on the three covariates? Use SAS (proc reg or proc glm) to find out. What is the model P-value? What is the model R2? What is the value of the F-statistic that led to the model P-value? How many degrees of freedom does it have in its numerator and denominator?

    (ii) Which covariates are significant in the Parameter Estimate table in the output? What are their P-values? Is the experimenter correct that drubness has a significant effect on zubricity?

    (iii) Obtain residual plots (with the residuals as the Y variable) for the predicted value, drubness, viscosity, and speed. Do these look all right? That is, do they look like the residuals are normally distributed with values that are independent of the X-coordinates? If not, which observations appear to be responsible?

    (iv) Obtain a list of Studentized residuals and Cook's D values for all of the observations. Do any of these appear to be out of line? If so, which ones? Use a criterion of either 3.0 for Studentized residuals or else 0.700 for Cook's D (or both). (Note: You should be able to get all of the information that you need for parts (i)-(iv) from either one run of proc reg or else one run of proc glm plus a proc print for associated variables.)

    The experimenter was disappointed with the regression output, since variables that she thought should have been significant were not significant. After looking at the data again, she began to wonder about observations #5, #10, and #17. After checking with her technician, she found that she had misread the technician's handwriting, and that the zubricity value of 780 in observation #5 should have been 480 and the value of 760 in observation #17 should have been 460. Observation #10 was correct as it was originally recorded.

    (v) Change the values of Zubricity in the two incorrect observations and re-analyze the regression. What is the new model P-value? Model R2? Which covariates are significant in the new Parameter Estimate table? Does the output now support the experimenter's hypothesis that drubness has a significant effect on zubricity?
    
    

    2.   A manufacturer tests the hardness of 30 alloys as a function of the amount of 5 different additives. The hardness, the date, and the amounts of the 5 additives for each of the alloys are given in Table 2.

     Table 2: Alloy Hardness as a Function of Five Additives
     --------------------------------------------------------
       Hardness  Date     AA      BB      CC       DD     EE
       ------------------------------------------------------
         2190    Mon      9.9    190    488.53    1464     4
         2475    Mon     10.8    305    372.95    1623     3
         2185    Tue     17.6    375    309.80    3106     2
         1964    Tue      2.1    261    433.25     113     4
         2115    Tue     14.6    199    529.34    2145     3
         1721    Tue     12.4    536    147.51    1717     2
         2217    Wed     13.0    322    357.55    2479     3
         2879    Wed     18.4    311    413.35    3124     2
         2523    Thu     11.2    265    452.12    2144     3
         2003    Thu      8.3    393    300.76    1308     3
         2733    Fri     19.3    213    522.37    3357     3
         2866    Fri     26.2    315    399.55    4778     1
         2295    Fri      5.2    343    418.58     477     4
         1994    Fri     12.7    249    426.77    1956     3
         2092    Fri     12.2    281    419.65    1587     3
         2345    Mon     19.1    292    433.03    2897     2
         2788    Mon     22.7    269    441.17    3958     2
         2595    Mon     21.0    416    321.95    3284     1
         2268    Tue     21.5    394    324.71    3906     2
         3032    Tue     23.8    393    292.20    3739     1
         2875    Tue     26.7    282    396.95    4777     1
         2765    Tue     14.0    123    635.74    2243     4
         1900    Tue      4.8    440    254.23     580     3
         1874    Tue     15.0    395    335.37    2270     2
         2132    Tue      9.4    318    375.88    1074     3
         2125    Tue     11.6    396    306.15    1479     2
         2145    Tue     12.2    381    334.74    2093     2
         2775    Wed     21.4    319    368.25    3094     2
         1979    Wed      9.6    470    236.03    1513     2
         2292    Wed     15.9    270    442.34    2280     3 

    (i) Is there a significant regression of Hardness on the 5 additives? What is the Model P-value? What is the Model R2 ?   What additives are significant in the Type I SS table? in the Type III table? (Hint: Use proc glm .)

    (ii) What set or subset of covariates gives the best-fitting model as judged by the largest adjusted R2? The second-best fitting model? The third-best fitting model? How many covariates were involved in each case? Are these the variables that you would have guessed from the Type I or Type III tables in the output for part (i)? (Hint: Run proc reg with / selection=adjrsq.)

    (iii) Run a backwards regression of Hardness on the 5 additive variables. What variables does SAS choose for the regression? Are these the same as in part (ii)? (Hint: Run proc reg with / selection=backward.)

    (iv) Run a (forwards) stepwise regression of Hardness on the 5 additive variables. What variables does SAS choose for the regression? Are these the same as in part (ii)? (Hint: Run proc reg with / selection=stepwise.)

    (v) Is there a significant regression of Hardness on the covariates for the model with the largest adjusted R2? What is the Model P-value? What is the Model R2 ?   How does it compare with the Model R2  for the full model? What additives are significant in the Type I SS table? in the Type III table?
    
    

    3. Annual reports for the 10 largest US corporations in 1990 are given in Table 3.

         Table 3 - Data for the 10 largest US Corporations in 1990
    # Source: Fortune Magazine (April 23, 1990) p346-367 Co 1990 Time Inc.
    # All numbers are in millions of dollars.
                          Sales   Profits  Assets
        General_Motors   126974    4224    173297
        Ford              96933    3835    160893
        Exxon             86656    3510     83219
        IBM               63438    3758     77734
        General_Electric  55264    3939    128344
        Mobil             50976    1809     39080
        Philip_Morris     39069    2946     38528
        Chrysler          36156     359     51038
        Du_Pont           35209    2480     34715
        Texaco            32416    2413     25636
     
    (i) Do a Principal Components Analysis for these 10 corporations to explain the variability of the financial data in Table 2. How many principal components are required to explain at least 90% of the variation in the data?

    (ii) As one might have expected in advance, the first Principal Component (Prin1) is a measure of the overall size of the corporation, since larger (or smaller) corporations are likely to have larger (or smaller) amounts of sales, profits, and assets. Thus, in this case, one is primarily interested in the proportion of variation that is explained by Prin2 after the variability due to Print1 is accounted for.

    What does the second principal component measure? To help understand what Prin2 says about the financial condition of these 10 corporations over the year, sort and display the list of companies by descending values of Prin2. Include profits, sales, and assets in the display, in that order. Which companies are at the top of the list? at the bottom of the list? What can you say about what caused them to be at the top or bottom of this sorted list?

    (iii) The analysis in part (ii) might be criticized on the grounds that the analysis, which depends on quadratic sums of various quantities, might be dominated by the largest companies. Note that Sales varies by nearly a factor of four in Table 3 and Assets varies by more than a factor of six. Repeat the analysis with Sales, Profits, and Assets replaced by their logarithms in an attempt to ameliorate this problem. (Use SAS commands logvar=log10(var) instead of logvar=log(var) to get base-10 logarithms, so that displays will be more intuitive.)

    Do a principal component analysis of the log-transformed data. Do you obtain similar results to what you obtained in part (ii)? In particular, is the sort of companies on Prin2 for the log-transformed data similar to the sort in part (ii)? Are the top five companies the same?
    
    

    4. A biologist is interested in the population structure of a particular lizard (Cophosaurus texanus). Data with three different measurements from 25 lizards in this species are collected (see Table 4). The biologist would like to use these data to show that the lizards in this species are highly variable in shape, perhaps in response to specialization to different subhabitats within the home range of the lizard.

         Table 4 - Dimensions of a sample of 25 lizards
    # From Johnson&Wichern, ``Applied Multivariate Statistical Analysis'',
    #   5th ed, Table 1.3, p17, 2002
    # Source: J&W say, data courtesy of Kevin E. Bonine
    # Mass is in grams. SVL (snout-vent length) and HLS (hind-limb span)
    #   are in millimeters.
        Obs     Mass     SVL     HLS
         1      5.526    59.0    113.5
         2     10.401    75.0    142.0
         3      9.213    69.0    124.0
         4      8.953    67.5    125.0
         5      7.063    62.0    129.5
         6      6.610    62.0    123.0
         7     11.273    74.0    140.0
         8      2.447    47.0     97.0
         9     15.493    86.5    162.0
        10      9.004    69.0    126.5
        11      8.199    70.5    136.0
        12      6.601    64.5    116.0
        13      7.622    67.5    135.0
        14     10.067    73.0    136.5
        15     10.091    73.0    135.5
        16     10.888    77.0    139.0
        17      7.610    61.5    118.0
        18      7.733    66.5    133.5
        19     12.015    79.5    150.0
        20     10.049    74.0    137.0
        21      5.149    59.5    116.0
        22      9.158    68.0    123.0
        23     12.132    75.0    141.0
        24      6.978    66.5    117.0
        25      6.890    63.0    117.0 
    (i) Do a Principal Components Analysis for this sample of lizards to explain the variability of the data in Table 4. How many principal components are required to explain at least 90% of the variation in the data? What percentage of the variation is explained by the first principal component?
    Does your analysis support the biologist's conjecture that there is considerable variation in shape among these lizards, other than trivial variation in overall size, which might just be due to age or sex?

    (ii) Construct a Prin2*Prin1 plot of the data in Table 4 with the observation number next to each point in order to illustrate the data. Note that the scale of Prin2 is more compressed than that of Prin1. What are the Observation numbers, as measured by Prin1, for the largest and smallest lizards in this plot?

    (Hint: The command plot Y*X='*' $ Obs;  in proc plot will put the value of Obs next to each point in the scatterplot.)

  • Top of this page