Math 420 Homework 1 - Spring 2008

Click here for Math420 home page

Click here for Math420 homework page

Click here for Prof. Sawyer's home page

Text references are to

Statistics for Experimenters: Design, Innovation, and Discovery, 2nd edition,

G. Box, J. S. Hunter, and W. G. Hunter, John Wiley and Sons, 2005, ISBN 978-0471-71813-0

HOMEWORK #1 due Wednesday 2-20

NOTE: Organize your homework in the following order:

(i) your answers to all questions (written answers, not SAS output),

(ii) all of your SAS programs, and

(iii) all of the SAS output that you got.

Add page numbers to your homework so that you can make references from part (i) to part (iii): for example, so that you can say things like, ``The answer in part (a) is 17. The scatterplot for part (b) is on page #Y below.'' Include your SAS output even if you don't refer to it explicitly. Except for forward references like these, a grader should not have to look beyond part (i) of your homework unless he or she thinks that you have done something wrong.

Include your name in a title statement so that your name will appear at the top of each output page.

If a problem asks you to do a statistical test, EXPLAIN CLEARLY what the null hypothesis H_0 is, what test was used, what the P-value is, and whether the data is significant, highly significant, or neither. Include this as part of your answer in part (i).

Problem 1. Seven cars (7) were chosen from each of 5 similar types of car. In randomized order, each car was driven over a test course. Fuel efficiencies are given in Table 1.

    Table 1. Fuel efficiencies of cars of 5 types
    ------------------------------------------------
       A   189  202  194  215  206  241  166
       B   132  181  128  195  135  182  136
       C   200  187  181  200  194  200  170
       D   173  159  160  154  181  155  216
       E   183  227  154  225  172  185  234

Use SAS to answer the following questions:

(i) Display the data in your SAS dataset using proc print to make sure that you have entered the data correctly.

(Hints: See oneway.sas on the Math420 Web site. Note that oneway.sas has three alternative ways to import data from a one-way layout into a SAS dataset, one of which allows you to enter the data exactly as in Table 1. Use whatever format seems easiest.)

(ii) Display a vertical histogram of the data using group as a plotting symbol. Do the 5 treatment groups appear to have the same means? Are there any features of the data that you can see? (Hint: Try proc chart with option

vbar yield /
subgroup=type

(iii) Use a one-way ANOVA to test the hypothesis H_0 that the five treatment groups have the same mean. What is the P-value? Do you accept or reject H_0? An F-test P=P(F_{p,q} ge Fobs) has two degrees of freedom, p and q. What are the two degrees of freedom of the F-test in this case?

(iv) For the residuals, display a vertical histogram using group as the plotting symbol as in part (ii). Does the distribution look like it might be normal? Are there any obvious associations between group and the residuals?

(v) Use the Shapiro-Wilk test to test the hypothesis H_0 that the residuals are normal. Recall that the Shapiro-Wilk test is sensitive to outliers with respect to a normal distribution. What is the P-value? Do you accept or reject H_0?

(vi) Have SAS construct a normal plot of the residuals. Does the normal plot look approximately like a straight line through the point (0,0)? (Hints for parts (v)-(vi): Use

proc
univariate normal plot

with var resid. See oneway.sas).

Problem 2. In the text, Problem 4 page 171 considers the effects of four treatments A,B,C,D in 32 consecutive trials for a process that looks like it might be unstable over time.

(i) Display a vertical histogram of the data using group as a plotting symbol as in Problem 1.

(ii) Carry out a one-way ANOVA test for the hypthesis H_0 of equal means for the four treatments. Do you accept of reject H_0? What is the P-value?

(iii) Plot the residuals of the one-way ANOVA against time. Do they look random over time? Or does there appear to be an underlying drift in the process itself?

(iv) Try controlling for instability over time by introducing a blocking factor consisting of eight consecutive blocks of 4 observations each. That is, set Block=1 for observations n=1,2,3,4, Block=2 for observations n=5,6,7,8, up to Block=8 for n=29,30,31,32. Then test the hypothesis H_0 of part (ii) in a randomized block design with two factors, treatment and (time) Block. Do you accept or reject H_0 in the randomized block design? What is the P-value for Treatment? Is it less or more than in part (ii)?

(v) Plot the residuals of the two-way ANOVA against time. Do they look more random than in part (iii)?

Problem 3. Consider the burn data in Problem #2, Chapter 4, on page 170 of the text.

(i) What is this design called?

(ii) How can the design be randomized to avoid accidental association of one or more of the three factors with the order of the experiments?

(iii) Using an ANOVA, which of the factors are significant? What are the P-values of the significant factors?

(iv) Does any one subject stand out with respect to the others with respect to average healing time? (That is, with an unusually small or unusually long average healing time.) What is the P-value for the difference between that subject's score and the second largest score? Is it significant? Highly significant? Is it still significant after a Bonferroni correction for all possible pairwise comparisons of six subjects?

(Warning: Make sure that the Type I and Type III ANOVA tables in the ANOVA output are the same. If they are not the same, then your design is not balanced, and you have entered the data incorrectly.

Hints: Do a two-sample t-test with the MSE from the ANOVA table in place of the pooled variance. The Student-t degrees of freedom will be the same as the error degrees of freedom for the ANOVA. Recall that Bonferroni-corrected P-value for an observation that might have been chosen as the largest by chance for M different events is to multiply the P-value for the observation by M.)

Problem 4. An investigator wants to study the yield of a chemical process depending on a particular catalyst as a function of batch, level of acidity, settling time, and catalyst concentration. Since only 5 batches of the raw material was available, and each batch has only enough material for 5 runs, the investigator chose to use a Greco-Latin square design. She measured yield for 5 different runs for each batch with Acidity, Settling time (A,B,C,D,E), and Catalyst concentration (al,be,ga,de,ep) set as in Table 2 below. The yield results of her 25 runs are also in Table 2.

       Table 2. Yields of a chemical process
           Acid Concentration
 Batch      1          2          3          4          5
 ------------------------------------------------------------
  1       A,al=26    B,be=16    C,ga=19    D,de=16    E,ep=13
  2       B,ga=18    C,de=21    D,ep=18    E,al=11    A,be=21
  3       C,ep=20    D,al=12    E,be=16    A,ga=25    B,de=13
  4       D,be=15    E,ga=15    A,de=22    B,ep=14    C,al=17
  5       E,de=10    A,ep=24    B,al=17    C,be=17    D,ga=14

Use SAS to

(i) For each of the four factors (Batch, Acidity, Settling time, Catalyst concentration), calculate the means of each of the levels. Which factor seems the most variable?

(ii) Using an ANOVA analysis, which of the four factors have a significant variation in level means? Of the factors that are significant, what are their P-values?

(iii) Of the factors that are significant, does there appear to be a particular factor level that gave a noticeably higher or lower level mean? Which level was it? (Just discuss; you needn't do a Bonferroni analysis.)

Problem 5. Consider the data in Problem 8, p227, of the text:

    Table 3. Yield as a function of three factors
     Run   Temp Catal pH   Week1  Week2
    ------------------------------------
      1     L    L    L    60.4    62.1
      2     H    L    L    75.4    73.1
      3     L    H    L    61.2    59.6
      4     H    H    L    67.3    66.7
      5     L    L    H    66.0    63.3
      6     H    L    H    82.9    82.4
      7     L    H    H    68.1    71.3
      8     H    H    H    75.3    77.1

where L,H are low and high values. Answer the following questions, using SAS if convenient:

(i) Treating Week as a true replication, so that Table 3 is a 2^3 design with two replications per cell, which of the main effects are significant? Which of the interactions are significant? What are the P-values of the significant effects?

(ii) Investigate any or all of the significant two-way interactions. Display an interaction plot for each significant interaction. What do these interactions tell you about the yield as a function of the factors involved?

(iii) Display a residual plot for predicted values. That is, a plot of the residuals (on the Y axis) and the predicted values on the X axis. Are there any obvious outliers? Do the residuals appear normal?

Problem 6. Consider the data in Problem 20, p233, of the text:

    Table 4. Impurities in a Chemical Process
       as a Function of Four Factors
     Run   Conc NaOH Speed Temp   Impurities
    ---------------------------------------------
       1     L    L    L    L      38
       2     H    L    L    L      40
       3     L    H    L    L      27
       4     H    H    L    L      30
       5     L    L    H    L      58
       6     H    L    H    L      56
       7     L    H    H    L      30
       8     H    H    H    L      32
       9     L    L    L    H      59
      10     H    L    L    H      62
      11     L    H    L    H      53
      12     H    H    L    H      50
      13     L    L    H    H      79
      14     H    L    H    H      75
      15     L    H    H    H      53
      16     H    H    H    H      54

where L,H are low and high values. Answer the following questions, using SAS if convenient:

(i) Find the parameter estimates for all 15 effects in a four-factor factorial model. Which estimated effects seem the largest? Do the three- and four-way interactions seem of reasonable size in comparison with the others?

(Hint: It may be worthwhile to write L,H as -1,+1 and use a regression model. See FourFac.sas or NormPlot.sas on the Math420 Web site.)

(ii) Using the average of the five SS values for the four three-way interactions and the one four-way interaction as MSE, find obtain P-values for the four main effects and the six two-way interactions. Which of these are significant? What are the P-values of the effects that are significant?

(iii) Generate an interaction plot for each two-way interaction that is significant. What does this interaction plot tell you about the effects of the two factors on the level of impurities?

(iv) Generate a normal probability plot and a normal PP plot for the 15 effect parameter estimates. Restricting to the normal probability plot, do any of the 15 points appear to be outliers? What effects do they correspond to? Are they the same as the significant effects in part (ii)?

Top of this page