Math 475 Homework 2 - Fall 2010

  • Click here for Math475 home page
  • Click here for Math475 homework page
  • Click here for Prof. Sawyer's home page

    HOMEWORK #2 due Tuesday 10-12

    Six problems.

    Text references are to the textbook, Cody & Smith, ``Applied statistics and the SAS programming language''

    NOTE: See the main Math475 Web page for how to organize a homework assignment using SAS. In particular,
            ALWAYS INCLUDE YOUR NAME in a title statement in your SAS programs, so that your name will appear at the top of each output page.
            ALL HOMEWORKS MUST BE ORGANIZED in the following order:
            (Part 1) First, your answers to all the problems in the homework, whether you use SAS for that problem or not. If the problem asks you to generate a graph or table, refer to the graph or table by page number in the SAS output (see below). (Xeroxing a page or two from the SAS output or cutting and pasting into a Word file or TeX source file is also OK.)
            (Part 2) Second, all SAS programs that you used to obtain the output for any of the problems. If possible, similar problems should be done with the same SAS program. (In other words, write one SAS program for several problems if that makes things easier, using Better yet would be one SAS title or title2 statements to separate the problems in your output.)
            (Part 3) Third, all output for all the SAS programs in the previous step.
            If an answer in Part 1 requires a table or a scatterplot that you need to refer to, make sure that your SAS output has overall increasing (unique) page numbers and make references to Part 3 by page number, such as ``The scatterplot for Problem 2 part (b) is on page #X in the SAS output below.'' DO NOT say, ``see Page 3 in the SAS output'' if Part 3 has output from several SAS runs, each of which has its own Page 3. In that case, either write your own (increasing) page numbers on the SAS output, or else (for example) refer to ``Page 2-7 in the SAS output'' (for page 7 in the second set of SAS output) and write page numbers in the format ``2-7'' at the top of pages in your output.

    
    

    1.   A test is made of the effects of a new drug on people who are occasional sufferers from a newly discovered allergy that affects people only during the winter. Eighty (80) people are enrolled in the study. Forty (40) subjects are first asked if they had allergic symptoms during a particular year, then given the drug, and then asked again if they had allergic symptoms after the following year. The other half (40) are given the drug the first year but not the second year and, again, asked if had allergic symptoms with and without the drug. Thus, there are two Yes-or-No responses from each enrollee, and, in particular, 8 individuals had no symptoms with the drug but did have symptoms without the drug. This experimental design helps to control for variable severity of the allergy among the subjects. The results were
               Table 1. Numbers of individuals with allergic symptoms
                 with and without a drug over two seasons
               
                               Without Drug
                                Yes     No     Totals
                          Yes    11     22       33
               With Drug       
                          No      8     39       47
               ------------------------------------
               Totals            19     61       80
     
    (i) On the basis of these data, does the drug tend to change significantly the incidence of allergy in vulnerable individuals?
    (ii) If the drug has an effect, would you recommend the drug to someone who suffers from this allergy? That is, does the drug help or hurt?
    (Warning: Although the data is in the form of a 2x2 contigency table, the Pearson chi-square test may not be appropriate. For example, a large number of (Yes,Yes) counts may simply mean that these particular individuals would have allergic symptoms no matter what. Similarly, a large number of (No,No) counts might be due to a subset of the sample who are almost never affected. Thus all of usable information in the table is in the (Yes,No) and (No,Yes) counts. Before using either the Pearson or Fisher exact tests, read about some of the other contingency-table tests in Chapter 3 of the text.)
    
    

    2.   Suppose that the same treatment is given to patients suffering from four different but related diseases, which are labeled as Dis#A, Dis#B, Dis#C, and Dis#D. The numbers of individuals surviving for or dying within six months were collected in the following table.
        Table 2. Morbidity results for four diseases
                     Dis#A          Dis#B          Dis#C          Dis#D   
                   Surv  Die      Surv  Die      Surv  Die      Surv  Die
        Treated     250  107       390  702       218  141       317  757
        Control     454  240       173  390       488  436       113  348
     
    Note that Dis#B and Dis#D appear to be more severe than the others, although all four diseases have high mortality rates in both treatment groups.

              (i) Does the treatment have a significant overall positive or negative effect on mortality over the four strata? Carry out a test that gives you a single P-value for all four tables and that is not subject to Simpson's Paradox. Do you accept or reject the hypothesis that treatment has no effect on survival? Do you get the same results for each of the diseases separately?

              (ii) Is the effect of the treatment positive or negative? That is, do relatively more treated individuals survive than control individuals? (Hint: Consider the phi coefficient for each disease.)

              (iii) Combine the diseases into one 2x2 table. What is the Pearson Chi-Square P-value for this possibly-incorrect table? Is this consistent with your answer to part (i)? What is the phi coefficient for the combined table? Is it consistent with your results in part (ii)? In the combined table, do relatively more treated individuals survive than control individuals, or vice versa?

    
    

    3.   The average output in tons per acre in one season in 30 test plots for cotton grown at five different levels of an insecticide, ordered by increasing levels of the insecticide, are
            Table 3. Output in tons per acre in test plots
                for five different levels of an insecticide
        
            Level1    79    79    95   109   118   150
            Level2    84    95   100   105   119   135
            Level3   109   114   121   123   124   145
            Level4    91   106   119   150   151   151
            Level5   110   113   129   131   145   165
     

    (i) Construct a plot of the output scores versus insecticide level, with insecticide level on the X axis. Can you see differences in the means of the treatment groups? Is there a trend in mean output with increasing levels of the insecticide?

    (ii) Do these scores show a significant variation by treatment group, as measured by a standard one-way layout ANOVA test? What are the degrees of freedom of the F-test (both numerator and denominator)? What is the hypothesis H_0? What is the hypothesis H_1? What is the P-value? (Hints: If you haven't seen or have forgotten one-way ANOVAs, see Chapter 7 in the text. If there are k ``treatment groups'' and a total of N observations over all treatment groups, then ``MS between'' (in the text's notation) has k-1 degrees of freedom and ``MS error'' has N-k degrees of freedom. NOTE: In the originally posted version of HW2, the last sentence was stated in the incorrect form, ``If there are k ``treatment groups'' and n observations per treatment group, then ``MS between'' (in the text's notation) has k-1 degrees of freedom and ``MS error'' has n-k degrees of freedom.'' However, hints are not binding and may occasionally be innocently misleading, and the two degrees of freedom are in the SAS output.)

    (iii) Do the inidividual observed output scores show a significant increase with increasing amounts of insecticide?
    (Hint: That is, do a regression of the observed output scores on the insecticide level for the five treatment groups, with insecticide level coded as 1, 2, 3, 4, 5. This assumes, as a rough approximation, that the amount of insecticide per acre varies linearly over the five insecticide levels.)
    Is there a significant regression of output on insecticide level, coded in this way? What are the degrees of freedom of the Model F-test, both numerator and denominator? What is the hypothesis H_0? What is the hypothesis H_1? What is the P-value?
    In this regression, what proportion of the variation in output is ``explained'' by the insecticide level? Is this smaller or larger than the proportion of variance ``explained'' by the ANOVA in part (ii)?
    What is the reason for the difference in significance between the conclusions of the two procedures?

    (iv) Write down the estimated regression line in part (iii).  How much additional cotton output is predicted, on the average, for each increase in insecticide level in this range of insecticide levels?
    
    

    4.   A zoo is interested in the dependence of blood pressure on stress in gnus. Blood pressure and stress (yy and stress) for each of 16 gnus under various conditions of stress are given in the following table. (In each of the 16 pairs of data in Table 4, yy is the first variable and stress the second variable.)
           Table 4. Blood pressure and stress for 16 gnus
        
            47    3.0         50    1.8        110    7.9        1655   15.7
           179    9.1         55    5.2       1310   12.9        2773   15.1
            56    3.6         62    2.9       3052   16.8         126    7.2
           866   12.6        175    8.6       2731   16.7         249    9.0
      

    (i) Is there a significant regression of yy on stress with this data? What P-value does SAS report? What is the model R2 ?

    (ii) Construct a text plot of blood pressure yy versus stress. Include the predicted values on the same plot with plot symbol P as a comparison. Does the plot of yy versus stress look linear? How well does it follow the predicted values? (Hint: It might look slightly bowed down in the middle.)

    (iii) Construct a plot of the residuals for the regression of yy on stress against stress. Do the residuals look consistent with the assumptions of a linear regression? Do their signs and absolute values appear to be randomly distributed with respect to stress? (Hint: The negative residuals may be bunched together in the center.)

    (iv) Try regressing yy on both stress and stress*stress. (Hint: Introduce a new SAS variable stress2 for stress*stress.) What is the new model R2 ? In a plot of yy on stress, do the predicted values appear to match yy more closely? Do the residuals have a more random-looking plot on stress? (Hint: Observations with higher values of stress may also have larger residuals.)

    (v) Try a regression of logyy=log(yy) on stress and stress*stress. What is the new model R2 ? Do the predicted values of logyy appear to match the observed values more closely? Does the residual plot show less dependence on stress?

    
    

    5.   An experimenter is interested in how a quantity that she calls zubricity depends on three other quantities called drubness, viscosity, and speed. The experimenter is fairly certain that drubness has a significant effect on zubricity, but is not sure about viscosity and speed. Twenty measurements of zubricity, drubness, viscosity, and speed are recorded in Table 1. For definiteness, call the variables ZUBRIC, DRUBNESS, VISCOSTY, and SPEED.
            Table 5: Zubricity and Covariates
            -----------------------------------
             OBS  Zubric    Drubn  Visc  Speed
            -----------------------------------
              1    310       16     27     12
              2    210       17     36     10
              3    450       24     40     20
              4    390       24     44     15
              5    780       26     44      8
              6    330       28     53     18
              7    580       39     55     19
              8    330       22     56     24
              9    400       29     57     16
             10    230       28     58     17
             11    470       34     60     24
             12    510       35     61     17
             13    490       37     66     20
             14    450       36     68     11
             15    630       46     73     21
             16    400       38     78      6
             17    760       34     80     22
             18    590       47     83     17
             19    520       43     84     12
             20    540       44     89     17

    (i) Is there a significant regression of ZUBRIC on the three covariates? Use SAS (proc reg or proc glm) to find out. What is the model P-value? What is the model R2? What is the value of the F-statistic that led to the model P-value? How many degrees of freedom does it have in its numerator and denominator? How did SAS arrive at these numbers?

    (ii) Which covariates are significant in the Parameter Estimate table in the output? What are their P-values? How many degrees of freedom do the T-statistics have for the tests in this table? Is the experimenter correct that Drubness has a significant effect on Zubricity?

    (iii) Obtain residual plots (with the residuals as the Y variable) for the predicted value, Drubness, Viscosity, and Speed. Do these look all right? That is, do they look like the residuals are normally distributed with values that are independent of the X-coordinates? Are they any noticeable outliers? If so, which observations are they?

    (iv) Obtain a list of Studentized residuals and Cook's D values for all of the observations. Do any of these appear to be out of line? If so, which ones? Use a criterion of either 3.0 for Studentized residuals or else 0.700 for Cook's D (or both). (Note: You should be able to get all of the information that you need for parts (i)-(iv) from either one run of proc reg or else one run of proc glm plus a proc print for associated variables.)

    (Hints: You should be able to tell which are the offending observations in the residual plots from the values of their residuals and predicted values. However, an easier way is to tag the values in the plots in such a way to make them easy to identify.
                For plots generated by plot statements within a proc reg procedure, enter a ``paint'' command like (for example) paint obs=17 / symbol='X'; BEFORE the plot statement, where obs stands for the ordinal value of the point (that is, the row number or OBS value in the data set), or
                For plots generated by proc plot, enter the plot statement as (for example) plot Y*X $ obs;   or   plot Y*X='*' $ obs;. The $ obs option causes the ordinal value to displayed next to each plotted point.
                NOTE: The originally posted form of HW2 had `ord' instead of `obs', assuming that `obs' is the first column in Table 5. Both syntaxes work with `ord' replaced by any SAS variable in the current dataset.)

    The experimenter was disappointed with the regression output, since variables that she thought should have been significant were not significant. After looking at the data again, the experimenter began to wonder about observations #5, #10, and #17. After checking with her technician, she found that the technician's handwriting had been misread, and that the zubricity value of 780 in observation #5 should have been 480 and the value of 760 in observation #17 should have been 460. Observation #10 was correct as it was originally recorded.

    (v) Change the values of Zubricity in the two incorrect observations and re-analyze the regression. What is the new model P-value? Model R2? Is it larger than before?
                Which covariates are significant in the new Parameter Estimate table? Does the output now support the experimenter's hypothesis that Drubness has a significant effect on Zubricity?
    
    

    6.   For the corrected data in Problem 5, generate the parameter estimates and the Student-t P-values using SAS's built-in matrix language, proc iml. (Hint: See ThreeRegIml.sas on the Math475 Web site.)
                (i) Did you get the same parameter estimates and P-values as SAS's built-in regression procedure in part (v) in Problem 5?
                (ii) What is the P-value for the significance of Drubness to one digit of accuracy in exponential notation? (For example, 5x10^{-7} or 3x10^{-3} or 7x10^{-11}. NOTE: In the originally posted version of HW2, `one digit of accuracy' was replaced by `one degree of freedom'. The examples are clearer than either wording.)
    
    

  • Top of this page