Math 475 Homework 3 - Fall 2005

  • Click here for Math475 home page
  • Click here for Math475 homework page
  • Click here for Prof. Sawyer's home page

    HOMEWORK #3 due 10-18

    Text references are to Cody & Smith,

    ``Applied statistics and the SAS programming language'', 5th edn

    Organize your homework in the following manner:
    (i) your answers to all questions,
    (ii) all of your SAS programs, followed by
    (iii) all of your SAS output.
    Add page numbers to your homework so that you can make references from part (i) to part (iii): for example, so that you can say things like, ``The answer in part (a) is 17. The scatterplot for part (b) is on page #Y below.'' Include your SAS output even if you don't refer to it explicitly. Except for forward references like these, a grader should not have to look beyond part (i) of your homework unless he or she thinks that you have done something wrong.
    Include your name in a title statement so that your name will appear at the top of each output page.
    If a problem asks you to do a statistical test, EXPLAIN CLEARLY what the null hypothesis H_0 is, what test you used, what the P-value is, and whether the data is significant, highly significant, or neither. Include this as part of your answer in part (i).
    
    

    1.   The responses of 35 patients to 5 experimental drugs were:
        DrA1:   13.32   18.87   14.61   15.02   15.42   16.23   14.01
        DrA2:   17.01   18.14   18.06   18.46   15.91   16.94   14.50
        DrH1:   17.83   18.13   19.89   19.01   16.84   19.53   14.77
        DrC2:   20.83   19.87   21.04   17.12   20.50   17.55   20.17
        DrC3:   19.62   19.03   20.11   20.52   21.05   20.21   25.91
     

    (i) Was there a significant difference in the responses to the different drugs, as measured by a one-way ANOVA?  What is the P-value? What is the hypothesis H_0? What is the hypothesis H_1?

    (ii) What are the degrees of freedom of the F-test? How were they calculated? (Warning: An F-distribution has TWO degrees of freedom.)

    (iii) What is the estimate of the error standard deviation (that is, sigma)? Where did you find it in the output? How did SAS calculate it?

    (iv) Use the Duncan procedure to find out which PAIRS of treatments are significantly different at the alpha=0.05 level and also at the alpha=0.01 level. Are your conclusions different with the smaller value of alpha?

    (v) Use Tukey's procedure to find out which pairs differ at alpha=0.05. Do you obtain different conclusions from those from Duncan's procedure at alpha=0.05?

    (vi) Use Bonferroni's procedure to find out which pairs differ at alpha=0.05. Do you obtain different conclusions from those from Duncan's procedure at alpha=0.05?

    (vii) A consultant states that the two DrA drugs should behave similarly in the human body due to a similar chemical structure, but that the two DrC drugs should be metabolized differently. Using the same MSE as in the previous analyses, test whether or not the AVERAGE of the two DrA drugs is significantly different from the AVERAGE of the two DrC drugs. What is the P-value?   (Hint: Use a Contrast test. See for example OnewayMC.sas on the Math475 Web site.)

    REMARK: You should be able to do Problem 1 with one SAS program.
    
    

    2.   Average output in tons per acre in 30 test plots for cotton grown at five different levels of an insecticide, ordered by increasing levels of the insecticide, were
       Level1    79    79    95   109   118   150
       Level2    84    95   100   105   119   135
       Level3   109   114   121   123   124   145
       Level4    91   106   119   150   151   151
       Level5   110   113   129   131   145   165
     

    (i) Construct a plot of the output scores versus insecticide level, with insecticide level on the X axis. Can you see differences in the means of the treatment groups? Is there a trend in mean output with increasing levels of the insecticide?

    (ii) Do these scores show a significant variation by treatment group, as measured by a one-way ANOVA? What are the degrees of freedom of the F-test? What is the hypothesis H_0? What is the hypothesis H_1? What is the P-value?

    (iii) Do the observed output scores show a significant increase with increasing amounts of insecticide?
    (Hint: Do a regression of the observed output scores on the insecticide level for the five treatment groups, with insecticide level coded as 1, 2, 3, 4, 5. As a rough measure, assume that amount of insecticide per acre varied approximately linearly over the five insecticide levels.)
    Is there a significant regression of output on insecticide level, coded in this way? What are the degrees of freedom of the F-test? What is the hypothesis H_0? What is the hypothesis H_1? What is the P-value?
    In this regression, what proportion of the variation in output is ``explained'' by the insecticide level? Is this smaller or larger than the proportion of variance ``explained'' by the ANOVA in part (ii)?
    What is the reason for the difference in significance between the conclusions of the two procedures?

    (iv) Write down the estimated regression line in part (iii).  How much additional cotton output is predicted, on the average, for each increase in insecticide level in this range of insecticide levels?
    
    

    3.   A zoo is interested in the dependence of blood pressure on stress in gnus. Blood pressure and stress (yy and stress) for each of 16 gnus under various conditions of stress are given in the following table. (In each of the following 16 pairs of data, yy is the first variable and stress the second variable.)
        47    3.0         50    1.8        110    7.9        1655   15.7
       179    9.1         55    5.2       1310   12.9        2773   15.1
        56    3.6         62    2.9       3052   16.8         126    7.2
       866   12.6        175    8.6       2731   16.7         249    9.0
      

    (i) Is there a significant regression of yy on stress with this data? What P-value does SAS report? What is the model R2 ?

    (ii) Construct a plot of blood pressure yy versus stress. Include the predicted values on the same plot with plot symbol P as a comparison. Does the plot of yy versus stress look linear? How well does it follow the predicted values? (Hint: It might look slightly bowed down in the middle.)

    (iii) Construct a plot of the residuals for the regression of yy on stress against stress. Do the residuals look consistent with the assumptions of a linear regression? Do their signs and absolute values appear to be randomly distributed with respect to stress? (Hint: The negative residuals may be bunched together in the center.)

    (iv) Try regressing yy on both stress and stress*stress. (Hint: Introduce a new SAS variable stress2 for stress*stress.) What is the new model R2 ? In a plot of yy on stress, do the predicted values appear to match yy more closely? Do the residuals have a more random-looking plot on stress? (Hint: Observations with higher values of stress may also have larger residuals.)

    (v) Try a regression of logyy=log(yy) on stress and stress*stress. What is the new model R2 ? Do the predicted values of logyy appear to match the observed values more closely? Does the residual plot show less dependence on stress?

    DO ALL OF PROBLEM 3 in one SAS Program.

    
    

    4.   An experimenter measures 30 instances of an observed value Y along with two covariates. Being utterly devoid of imagination, the experimenter calls the covariates AA and BB. The 30 instances of values Y,AA,BB  are
         1.    714      366.3     1421
         2.   1022      435.8     1737
         3.    267      276.1      532
         4.    287      199.6      571
         5.    716      257.4     1115
         6.    434      203.5     1011
         7.    943      248.1     1676
         8.    356      186.5      712
         9.    423      246.2      624
        10.    698      196.3     1312
        11.     92      227.8     1151
        12.    227      206.4      687
        13.    589      178.7     1215
        14.    716      296.6     1099
        15.    324      235.9      843
        16.    552      449.1     1504
        17.    741      259.7     1227
        18.    437      291.7      439
        19.    143      198.0      265
        20.    409      336.0      939
        21.    654      279.8      438
        22.    666      243.0      379
        23.    479      318.3     1208
        24.    212      176.5      217
        25.    375      266.4      674
        26.    184      226.0      522
        27.    220      114.4      683
        28.    392      231.4      929
        29.    555      203.5      662
        30.    862      328.6      906

    (i) Is there a significant regression of Y on the covariates AA and BB? Run proc glm in SAS to find out. What is the model P-value? What is the model R2 ? What is the value of the F-statistic that led to the model P-value? How many degrees of freedom does it have in its numerator and denominator?

    (ii) Which covariates are significant in the Type I table in the proc glm output? What are their P-values?

    (iii) Which covariates are significant in the Type III table in the proc glm output? What are their P-values? Why are the answers different from those in part (ii)?

    (iv) Plot the residuals of the regression (on the Y-axis) against the predicted values (on the X-axis). Doe the residuals look random in this plot, except perhaps for one observation? Construct similar plots for the residuals on AA and BB. Do they also look random?

    (v) Use proc reg to construct a table of Studentized residuals and CookD statistics for each observation. Which observation corresponds to the odd observation in part (iv)? What is the CookD value for that particular observation? Does the CookD value seem large? In general, do any of observations have CookD values that are large?

    (Hints for part (v): (1) The usual rule of thumb is to compare CookD values to the distribution of F(p,n-p), where here p=2 and n=30. If Prob(F(p,n-p) > CookD_value) < 0.50 (50%) (here if CookD_value > 0.711), then that observation could have an extremely disproportionate effect on the regression coefficients and the fitted values. If Prob(F(p,n-p)>CookD_value) < 0.70 (70%) (here if CookD_value > 0.361), then that value should be considered suspicious.
    (2) You should be able to tell which are the offending observations in the residual plots from the values of their residuals and predicted values. However, if you prefer a high-tech way of finding out what points corrresponds to what observations, you can either
    (a) For plots generated by proc reg, enter a ``paint'' command like (for example) paint ord=17 / symbol='X'; BEFORE the plot statements, or
    (b) For plots generated by proc plot, enter the plot statement as (for example) plot Y*X $ ord   or   plot Y*X='*' $ ord. The $ causes the value of the variable ord to displayed next to each plotted point.)

    (vi) Run proc glm or proc reg on the data set without the apparent outlier in the residual plots. Do the parameter estimates seem qualitatively the same as before? The P-values of the parameters? The Rsquare value?

    (vii) Run proc glm or proc reg starting from the original data set but without the value with the large CookD value. Do the parameter estimates seem qualitatively the same as before? The P-values of the parameters? The Rsquare value? Which observation made the most difference?

  • Top of this page