Math 475 TakeHome Final

Math 475 TakeHome Final - Fall 2010

Click here for Prof. Sawyer's home page

TAKEHOME FINAL due on or before Mon 12-20=2010 at 4 PM

NOTE: There should be NO COLLABORATION on the takehome final,
other than for the mechanics of using the computer.

Text references are to the textbook, Cody & Smith, ``Applied statistics and the SAS programming language'', 5th edn

NOTE: See the main Math475 Web page for how to organize a homework assignment or takehome test using SAS. In particular,
ALWAYS INCLUDE YOUR NAME in a title statement in your SAS programs, so that your name will appear at the top of each output page.
ALL HOMEWORKS MUST BE ORGANIZED in the following order:
(Part 1) First, your answers to all the problems in the homework, whether you use SAS for that problem or not. If the problem asks you to generate a graph or table, refer to the graph or table by page number in the SAS output (see below). (Xeroxing a page or two from the SAS output or cutting and pasting into a Word file or TeX source file is also OK.)
(Part 2) Second, all SAS programs that you used to obtain the output for any of the problems. If possible, similar problems should be done with the same SAS program. (In other words, write one SAS program for several problems if that makes things easier, using Better yet would be one SAS title or title2 statements to separate the problems in your output.)
(Part 3) Third, all output for all the SAS programs in the previous step.
If an answer in Part 1 requires a table or a scatterplot that you need to refer to, make sure that your SAS output has overall increasing (unique) page numbers and make references to Part 3 by page number, such as ``The scatterplot for Problem 2 part (b) is on page #X in the SAS output below.'' DO NOT say, ``see Page 3 in the SAS output'' if Part 3 has output from several SAS runs, each of which has its own Page 3. In that case, either write your own (increasing) page numbers on the SAS output, or else (for example) refer to ``Page 2-7 in the SAS output'' (for page 7 in the second set of SAS output) and write page numbers in the format ``2-7'' at the top of pages in your output.

Different parts of problems may not be equally weighted.
Five (5) problems.

Problem 1. Heights and weights for the employees of VaporLock Software Services are recorded in Table 1. Each table entry has the height, weight, and sex for one employee, in that order. The employees of this company are known to be unusual.

     Table 1 --- Height, Weight, Gender for 79 VaporLock Employees

     69  149  M       66  189  M       82  134  M       60  144  F
     71  113  F       69   98  F       72  179  M       65  198  M
     58  147  F       74   83  F       61  125  F       69  191  M
     70   98  F       66   98  F       64  198  M       61  117  F
     68  191  M       68  105  F       74  137  M       72  181  M
     77  129  M       70  145  M       64  126  F       75  132  M
     78  139  M       75  149  M       74  138  M       74  135  M
     72   80  F       61  114  F       66  113  F       67  160  M
     73  150  M       70  115  F       72   91  F       61   90  F
     68   79  F       76  149  M       67   94  F       69   90  F
     59  104  F       61  118  F       69   86  F       68   95  F
     57  134  F       56  139  F       70  180  M       78  165  M
     68  114  F       73   88  F       58  124  F       63  121  F
     69  174  M       65  126  F       77  128  M       79  136  M
     66   92  F       67  136  F       66  123  F       78  149  M
     68  139  M       70   94  F       62  105  F       71  117  F
     65  112  F       77  148  M       70  177  M       59  125  F
     76  179  M       63  139  F       70   97  F       69   88  F
     76  170  M       72  143  M       71  143  M       80  135  M
     78  161  M       58  131  F       69  178  M

(i) For the employees grouped into two samples by sex, what are the two sample sizes? What are the two sample means for weight? Is there a significant difference in weight between the two sexes? What is the value of the t-statistic for the classical two-sample t-test? What is the P-value?

The classical t-test assumes that the variances of the two samples are the same. Is this a reasonable assumption for the weights in this case? What is a P-value for a hypothesis based on this assumption? Does this P-value mean that it is safe to assume that the variances are the same, or the opposite?

(ii) What is the Pearson correlation coefficient between height and weight for the individuals in Table 1, ignoring sex? Is it significantly different from zero? What is the P-value?

How was the P-value for the Pearson correlation coefficient calculated? What is the number of degrees of freedom of the test statistic that is used to calculate the P-value? What is the formula that expresses this test statistic in terms of rho (the sample correlation coefficient)?

(iii) What are are the Pearson correlation coefficients between height and weight for employees within each sex? Are they significant? Do they have the same sign as the correlation coefficient in part~(ii)? How can the correlation coefficients have one sign within groups but a different sign for the two groups combined? Construct a height by weight scatter plot using sex as the plotting symbol to illustrate your answer.

Problem 2. Lengths and widths were measured for two types of aphids (a small beetle) collected in a semitropical country. The entries in Table 2 are the lengths and widths, respectively, for 56 aphids. Units are in tenths of millimeters.

       Table 2 --- Lengths and widths (0.1mm) for 56 aphids.
     Type A (n=17):
        258  237      273  226      287  210      289  231
        304  237      309  207      311  237      314  234
        319  197      330  216      333  185      335  187
        342  189      352  195      357  200      365  201
        371  185
     Type B (n=39):
        239  241      256  228      260  213      266  207
        271  226      273  187      278  230      280  220
        281  183      284  200      286  191      291  214
        292  233      293  199      296  195      296  205
        300  228      302  200      303  198      303  203
        307  215      312  191      318  229      321  181
        322  193      322  193      322  219      323  197
        326  217      328  178      328  190      330  178
        335  187      339  175      340  191      346  177
        346  183      358  178      360  177

(i) (5 pts) Based on these samples, is there a significant difference in lengths between the two samples of beetles? Is there a significant difference in the widths? What is the number of degrees of freedom of the t-distribution in both cases? Based on the sample means, does the longer type of aphid also tend to be wider?

(ii) (15 pts) Is there a significant difference in height and width together (that is, in the vectors (height,width) ) between the two types of aphids? Use a MANOVA test to find out. (A MANOVA test for two treatment groups is called a Hotelling T^2 test.)

What is the value of the equivalent F statistic? How many degrees of freedom does this F statistic have in the numerator and how many in the denominator? What is the P-value? (Hint: See the discussion in Nreading.sas and Ncoffee.sas on the Math475 Web site, and also in HotLizards.sas. In all three SAS example input files, the MANOVA code is at the end.)

Problem 3. A manufacturing company with four factories wants to control the number of defects in the main product that it manufactures. As a first step, the company wants to know where most of the variation of the defects is located: among factories, among groups (workgroups) working within the same factory, or from month to month within the same workgroup.

The company collects observations on the number of products with defects in a sample of 100 products for three randomly chosen different months from workgroups within the four factories. The data is collected in the following table.

      Table 3 --- Product defects by factory and workgroup
      Factory1:
          Group1  30  15  17    Group2  22  6  31    Group3  21  26  15
      Factory2:
          Group1  32  30  32    Group2  31  27  21    Group3  32  31  35
          Group4  27  50  36    Group5  21  29  34
      Factory3:
          Group1  20  30  29    Group2  21  27  21    Group3  28  23  33
          Group4  14  14  25
      Factory4:
          Group1  20  26  26    Group2  23  19  17    Group3  36  30  32
          Group4  11  29  14    Group5  17  22  35

Note that ``Group1'' does not refer to the same group in different factories, but only to the first workgroup from that factory that happened to send data to the parent company. (Treat the three observations for each workgroup as independent and identically distributed samples for that workgroup.)

(i) Using within-workgroup variation to estimate the error, was there significant variation in the numbers of defects over the 12 or more workgroups in the study, ignoring the factories that contain them? What is the P-value? What is the degrees of freedom of the resulting F statistic?

(ii) Analyze the appropriate ANOVA model taking into account both groups and factories. Is there significant variation in the number of defects by factory? Is there significant variation by workgroups within factories? What are the P-values in each case? What are the degrees of freedom of the two F statistics?

(iii) What are the MSS (Mean Sum of Squares) values for within-workgroup variation, between-factory variation, and variation between workgroups within factories? Do these values appear consistent with your answers to part (ii)?

(iv) Is there significant variation in the number of defects by factory, ignoring any group structure within each factory? (That is, assuming that everybody in a factory is in the same workgroup.) What is the P-value? What are the degrees of freedom of the F statistic? Why is this P-value different from the P-value for factory in part (ii)?

(v) For the analysis in part (ii), which pairs of factories produced output that was significantly different in quality? Use the Tukey multiple-comparison procedure to find out. Does any one factory stand out?

Problem 4. An engineer is interested in the running temperature of a mechanical device as a function of three variables: Heat-shield type, with two levels (H1,H2), Fan size, with three types (F1,F2,F3), and heat baffle type, with five levels (B1,B2,B3,B4,B5). One observation of the running temperature is made for each set of levels of the three variables. The running temperatures are listed in Table 4.

       Table 4. Running Temperatures of a Device
                 B1     B2     B3     B4     B5
       F1  H1   199    175    187    169    189
           H2   196    196    221    196    244
       F2  H1   203    182    178    181    193
           H2   176    179    217    245    244
       F3  H1   177    173    178    184    174
           H2   166    204    207    205    284

(i) Run a full-factorial model for H=Heat-shield type, F=Fan size, and B=Baffle type on the data in Table 4. Since there is only one observation per cell, you will not obtain any P-values, but you can compare the MS (mean sum-of-squares) terms for the 7 effects in a full-factorial model with three factors. Which three effects have the largest MS terms? which three have the smallest MS terms?

(ii) Using H=Heat-shield type as the major variable, run a split-plot analysis on the three factors H, F=Fan size, and B=Baffle type. Recall that this procedure tests significance of the main effects of H, F, and B and the two interactions H*F and H*B by using the sum of the SS terms for F*B and H*F*B, or equivalently of the nested SS term F*B(H).

Does this seem like a reasonable procedure in this case, given the relative size of the MS terms for F*B and H*F*B in part (i)? In the resulting analysis, which of the five effects being tested are significant? What are the P-values of the significant effects? Construct an interaction plot for each interaction that is significant, and interpret the interatcion plot.

(iii) Note that F and all of its second and higher-order interactions are non-significant in part (ii), and all have relatively small MS terms in part (iii). Use this information to declare that the factor F is ``inactive'' and remove it from the model. This is equivalent to assuming that values for different levels of F for a fixed (H,B) cell are independent replications for that cell. Run a full factorial model for H and B (only) on the data in Table 4. This should give you P-values for H, B, and H*B, since there are only 10 (H,B) cells and there are 30 values in Tabley 4. Which of the effects for H, B, and H*B are now significant? What are their P-values? How does this compare with your results in part (ii)?

(iv) Again assuming that F=Fan size is inactive as in part (iii), compare the 10 (H,B) combinations using the Tukey multiple-comparison procedure. Which pairs of these 10 combinations are significantly different? Does any pair stand out? (Hint: Do a one-way ANOVA on the pairs (H,B), ignoring the factor F. You can define a variable for (H,B) pairs by proceeding as in the last SAS procedure in ThreeWay.sas on the Math475 Web site, or else by following one of the suggestions on page 219-220 in the textbook.)

Problem 5. A study of nerve fibers is made for 5 normal and 5 diabetic rats. The experimenter wants to learn how the cross-sectional areas of the nerve fibers of a particular nerve varies with the diabetic state, and also how this varies with the position along the nerve fibers (Proximal, Medial, or Distal). For definiteness, let Group be the factor whose levels are Normal and Diabetic, and NvLoc a factor with levels Proximal, Median, and Distal. The nerve cross-sectional areas for the 10 rats are in Table 5.

   Table 5 --- Cross-Sectional Areas of Nerve Fibers in 10 Rats
          Subj   Proximal  Medial  Distal
 Diabetic    1.   529    446    373
 Diabetic    2.   604    455    404
 Diabetic    3.   523    500    378
 Diabetic    4.   504    392    390
 Diabetic    5.   518    486    375
 Control     6.   394    360    513
 Control     7.   352    395    529
 Control     8.   370    317    571
 Control     9.   261    370    586
 Control    10.   348    400    530

(i) Are the two factors Group and NvLoc crossed, nested, or neither? If they are nested, which is nested under which?

The data also has a third Subject factor, for the 10 individual rats. Is Subject crossed with Group? nested within Group? crossed with NvLoc? nested within NvLoc? Why?

(ii) Run a full factorial ANOVA model to test Group, NvLoc, and its interaction. Use nested Subject effects in the standard way to carry out the tests. (Hint: See the comments in NestedSubj2Fac.sas for the appropriate decomposition of a full factorial ANOVA model in this case. See comments in NCoffee.sas, NReading.sas, or the text for the ``standard way'' to test effects in nested subject models with one observation per cell.)

Which of these three effects are significant? highly significant? For the significant effects, what are the P-values, and what are the degrees of freedom in the numerator and denominator for the F-distributions involved?

(iii) Display an interaction plot for the two principal factors, Group and NvLoc, with the factor with the larger number of levels on the X-axis. Is an interaction suggested? Why? Is it significant?

(iv) Which of the levels of the main effects of Group and NvLoc are distinct, using Tukey's method to allow for multiple comparisons? Which are larger? What do the levels of the main effect of NvLoc mean? Are they averages for normal rats? diabetic rats? or both?