Math 420 Homework 3 - Spring 2008

Click here for Math420 home page

Click here for Math420 homework page

Click here for Prof. Sawyer's home page

Text references are to

Statistics for Experimenters: Design, Innovation, and Discovery, 2nd edition,

G. Box, J. S. Hunter, and W. G. Hunter, John Wiley and Sons, 2005, ISBN 978-0471-71813-0

HOMEWORK #3 due Wednesday 4-21

NOTE: Organize your homework in the following order:

(i) your answers to all questions (written answers, not SAS output),

(ii) all of your SAS programs, and

(iii) all of the SAS output that you got.

Add page numbers to your homework so that you can make references from part (i) to part (iii): for example, so that you can say things like, ``The answer in part (a) is 17. The scatterplot for part (b) is on page #Y below.'' Include your SAS output even if you don't refer to it explicitly. Except for forward references like these, a grader should not have to look beyond part (i) of your homework unless he or she thinks that you have done something wrong.

Include your name in a title statement so that your name will appear at the top of each output page.

Include the Problem Number in a title2 statement to make it clearer what output pages belong to what problem.

If a problem asks you to do a statistical test, EXPLAIN CLEARLY what the null hypothesis H_0 is, what test was used, what the P-value is, and whether the data is significant, highly significant, or neither. Include this as part of your answer in part (i).

Problem 1. A gasoline refinery manager wants to compare the efficiency of two different formulations of gasoline under three different seasonal conditions and for four different types of automobile. The resulting efficiencies are recorded in Table 1.

     Table 1 --- Gasoline Efficiencies for Two Blends
        for different Seasons and Automobiles

                    Blend1               Blend2
               Fall Winter Summer   Fall Winter Summer
      Auto1     191   206   183      156   172   179
      Auto2     192   209   202      179   187   190
      Auto3     159   197   188      215   204   243
      Auto4     187   188   203      248   245   231

The manager is primarily interested in overall efficiency of the two blends (factor A=Blend) for different levels of two blocking effects, B=Season and C=Automobile. He considers using a split-plot design for the principal factor A and two blocking factors B and C.
(i) Find the Mean Sum of Square (MSS) statistics for each of the seven effects in the full 2x3x4 factorial design. (Since the design has one observation per cell, there will be nothing left for P-values.) For which effects are the MSS statistics the largest? On the basis of the MSS statistics, is it reasonable to assume that the terms for the between-blocking-factor interaction B*C and the three-way interaction A*B*C can be used to provide reasonable estimates of the error variance?

(ii) Using the effects B*C and A*B*C to estimate the error (whether they are reasonable or not), do the two gasoline blends different significantly, averaged over Season and Automobile? Are there significant main effects for Season or Automobile? What are the P-values of the significant effects? For any main effect that is significant, which levels are associated with the smallest and largest estimated values of efficiency? (Hint: See SplitPlot.sas on the Math420 Web site. In the proc glm model statement, either list all three main effects first before listing the two interactions with A=Blend, or else ignore the Type I table in the proc glm output and use the Type III table for P-values.)

(iii) Are there significant interactions between Blend and either Season or Automobile? What are the P-values of the significant interactions? For any interaction that is significant, what does the interaction plot look like? What does the interaction mean?
(Hint: If you say proc plot; plot MeanYeff*XX=Blend; run; for an interaction plot, then SAS will use the first letter of the levels of Blend as the plotting symbol. Make sure that you define names for the two levels of Blend such that the first letter of the level names are distinct. Otherwise, either the interaction plot with Blend as the plotting symbol will be uninterpretable, or else you will have to go to the trouble of defining a separate plotting-symbol variable for Blend that does have distinct first letters.)

Problem 2. An experimenter wants to test the effect of 8 factors, which he calls A B C D E F G H, on a response variable YY associated with an industrial process. He can afford to do 16 runs and decides to use a 2_{IV}^{8-4} fractional factorial design. The High/Low settings of the eight factors and the output that he measures are in the following table.

 Table 2. Output of an industrial experiment with High/Low
    settings of 8 factors
   Obs     A  B  C  D    E  F  G  H   YY
   ---------------------------------------
     1.   -1 -1 -1 -1   -1 -1 -1 -1   69.8
     2.    1 -1 -1 -1    1  1  1 -1   89.8
     3.   -1  1 -1 -1    1  1 -1  1   62.6
     4.    1  1 -1 -1   -1 -1  1  1   83.0
     5.   -1 -1  1 -1    1 -1  1  1   73.1
     6.    1 -1  1 -1   -1  1 -1  1   79.5
     7.   -1  1  1 -1   -1  1  1 -1  123.5
     8.    1  1  1 -1    1 -1 -1 -1   89.3
     9.   -1 -1 -1  1   -1  1  1  1   41.7
    10.    1 -1 -1  1    1 -1 -1  1   33.4
    11.   -1  1 -1  1    1 -1  1 -1   72.7
    12.    1  1 -1  1   -1  1 -1 -1   50.2
    13.   -1 -1  1  1    1  1 -1 -1   44.8
    14.    1 -1  1  1   -1 -1  1 -1   87.8
    15.   -1  1  1  1   -1 -1 -1  1   32.1
    16.    1  1  1  1    1  1  1  1   83.8

Note that the columns for A B C D are in Yates order. Since this is a 2_{IV}^{8-4} design, the settings under E F G H are the same as E=ABC, F=ABD, G=ACD, and H=BCD. (Hint: See FracFac84.sas on the Math420 Web site for a similar analysis of a 2_{IV}^{8-4} design. Recall that this design confounds the 2^8=256 possible effects involving the 8 factors into 16 groups with 16 effects each. Of these, the main effects A B C D E F G H are confounded only with 3-way or higher interactions, and the 8*7/2=28 two-way interactions are confounded together into 7 groups with 4 two-way and 11 higher-order interactions each. A complete independent (that is, unconfounded) set of variables is A B C D E F G H AB AC AD BC BD CD ABCD.)

(i) Find the parameter estimates for all 15 effects in the design (other than the intercept) and sort them in decreasing order. Do any of the estimates appear to stick out on the high or low side?

(ii) Do the two-way interactions AB AC AD BC BD CD and four-way interaction ABCD appear to be relatively small in comparison with the largest estimates in absolute value? Do a regression analysis of the 8 main effects A B C D E F G H using the 7 two-way and four-way interactions to estimate the error. Which of the main effects are significant? What are the P-values of the significant effects? (Hint: In proc reg and proc glm in SAS, any effects that you do not list in the model statement are used for error.)

(iii) Construct normal probability plots and P-P plots of the 15 effect estimates in part (i). Do any appear to be outliers? Which effects do they correspond to?

(iv) Assume that the three outliers that you identified in part (iii) are active and the 5 other factors are inert. Do a 2^3 factorial design (with 2 observations per cell) analysis of the data in Table 2 using the three active factors. Which of the main effects in this analysis are significant? Which of the interactions? Find the P-values of the significant effects. Is this consistent with your answers in parts (ii) and (iii)? Which analysis would you trust more?

Problem 3. The same experimenter does a second experiment under different conditions with a similar set of 8 factors, which he also calls A B C D E F G H, on a response variable ZZ for a second industrial process. Again, he can afford to do 16 runs and decides to use a 2_{IV}^{8-4} fractional factorial design. The High/Low settings of the eight factors and the output that he measures are in the following table.

 Table 3. Output of an industrial experiment with High/Low
    settings of 8 factors
   Obs     A  B  C  D    E  F  G  H   ZZ
   ---------------------------------------
     1.   -1 -1 -1 -1   -1 -1 -1 -1   74.8
     2.    1 -1 -1 -1    1  1  1 -1   94.8
     3.   -1  1 -1 -1    1  1 -1  1   85.6
     4.    1  1 -1 -1   -1 -1  1  1  102.0
     5.   -1 -1  1 -1    1 -1  1  1   86.1
     6.    1 -1  1 -1   -1  1 -1  1   88.5
     7.   -1  1  1 -1   -1  1  1 -1  106.5
     8.    1  1  1 -1    1 -1 -1 -1   80.3
     9.   -1 -1 -1  1   -1  1  1  1   50.7
    10.    1 -1 -1  1    1 -1 -1  1   50.4
    11.   -1  1 -1  1    1 -1  1 -1   63.7
    12.    1  1 -1  1   -1  1 -1 -1   37.2
    13.   -1 -1  1  1    1  1 -1 -1   65.8
    14.    1 -1  1  1   -1 -1  1 -1  104.8
    15.   -1  1  1  1   -1 -1 -1  1   67.1
    16.    1  1  1  1    1  1  1  1  118.8

See the comments after Table 2 in Problem 2.

(i) Find the parameter estimates for all 15 effects in the design (other than the intercept) and sort them in decreasing order. Do any of the estimates appear to stick out on the high or low side?

(ii) Construct normal probability plots and P-P plots of the 15 effect estimates in part (i). Do any appear to be outliers? Which effects do they correspond to?

(iii) The outliers that you identified in part (ii) should be consistent with 3 active factors with the remaining factors inert. Analyze the corresponding 2^3 factorial design (with 2 observations per cell) for the data in Table 2 using the three active factors. Which of the main effects in this analysis are significant? Which of the interactions? Find the P-values of the significant effects. Is this consistent with your answers in part (ii)?

Problem 4. (i) Use the Box-Meyer program BoxMeyer.exe on the Math420 Web site to analyze the data in Table 2. What 2^1, 2^2, or 2^3 submodels have the highest F-statistics? the highest Bayesian posterior probabilities? What factors have the highest Bayesian posterior probabilities of being active? Are these results consistent with your answers to Problem 2?

(ii) Use the Box-Meyer program to analyze the data in Table 3. What submodels have the highest F-statistics? the highest Bayesian posterior probabilities? What factors have the highest Bayesian posterior probabilities of being active? Are these results similar to your answers to Problem 3?

(iii) In in 2_IV^{8-4} design, what are the other three two-way interactions are confounded with (for example) CD? If e.g. CD is significant, is it easy to tell that this is due to CD and not to one of the other three interactions? (Hint: From two of the defining relations G=ACD and H=BCD, one concludes CD=BH=AG, so that you just have to find one more.)

(iv) In part (ii), why do four different models have similar high values for the F-statistic and for the Bayesian posterior probability, even though three of the four models have a factor that does not show up as significant in Problem 2? (Hint: In a 2_IV^{8-4} design, show that the relation (for example) F=ACD implies that the seven effects A,C,D,AC,AD,CD,ACD are identical with the seven effects A,C,F,AC,AF,CF,ACF after a permutation. This implies that the two models ACD and ACF would have the same F-statistics and the same Bayesian posterior probability.)

Problem 5. A different experimenter studies wants to test the effect of 4 factors, which she calls A B C D, on a response variable Yield associated with an industrial process. She can afford to do 12 runs with different High/Low settings of A B C D and decides to use a Plackett-Burman design. The results are in Table 4. Recall that only the first four factors (A B C D) were used for High/Low settings.

     Table 4. Output of an industrial experiment with High/Low
        settings of 4 factors

    Row    A    B    C    D    e    f    g    h    j    k    l   Yield
      1    1   -1    1   -1   -1   -1    1    1    1   -1    1    88.5  
      2    1    1   -1    1   -1   -1   -1    1    1    1   -1    37.2  
      3   -1    1    1   -1    1   -1   -1   -1    1    1    1   106.5  
      4    1   -1    1    1   -1    1   -1   -1   -1    1    1   104.8  
      5    1    1   -1    1    1   -1    1   -1   -1   -1    1    37.2  
      6    1    1    1   -1    1    1   -1    1   -1   -1   -1    80.3  
      7   -1    1    1    1   -1    1    1   -1    1   -1   -1    67.1  
      8   -1   -1    1    1    1   -1    1    1   -1    1   -1    65.8  
      9   -1   -1   -1    1    1    1   -1    1    1   -1    1    50.7  
     10    1   -1   -1   -1    1    1    1   -1    1    1   -1    94.8  
     11   -1    1   -1   -1   -1    1    1    1   -1    1    1    85.6  
     12   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1    74.8

(i) Find the parameter estimates for all 11 effects in the design (other than the intercept) and sort them in decreasing order. (We would expect at least 7 of them to be random.) Construct normal probability plots and P-P plots of the 11 effect estimates. Are any outliers noticeable? Can extreme effects be easily picked out?

(ii) Run the Box-Meyer program on the data in Table 4, using all 11 columns as a control for the first 4 columns. What are the highest-ranking 2^1, 2^2, 2^3 submodels of the 11 factors in terms of the highest F-statistic? in terms of the highest Bayesian posterior probability? Do any factors stick out as having a noticeably larger posterior probability of being active?

(iii) Analyze a 2^3 design for the data in Table 4 using the three factors in part (ii) that have the highest posterior probabilities of being active. What effects in that design are significant? What are the P-values of the significant effects? Is this consistent with what you observed from the normal plots?

Top of this page