Math 475 Homework 5 - Fall 2005

  • Click here for Math475 home page
  • Click here for Math475 homework page
  • Click here for Prof. Sawyer's home page

    HOMEWORK #5 due 12-6

    Organize your homework in the following manner:
    (i) your answers to all questions,
    (ii) all of your SAS programs, followed by
    (iii) all of your SAS output.
    Add page numbers to your homework so that you can make references from part (i) to part (iii): for example, so that you can say things like, ``The answer in part (a) is 17. The scatterplot for part (b) is on page #Y below.'' Include your SAS output even if you don't refer to it explicitly. Except for forward references like these, a grader should not have to look beyond part (i) of your homework unless he or she thinks that you have done something wrong.
    Include your name in a title statement so that your name will appear at the top of each output page.
    If a problem asks you to do a statistical test, EXPLAIN CLEARLY what the null hypothesis H_0 is, what test you used, what the P-value is, and whether the data is significant, highly significant, or neither. Include this as part of your answer in part (i).
    
    

    1.   A warehouse manager is comparing motorized carts from three different manufacturers with the idea of purchasing one of the brands. She is primarily interested in the time (Y) that operators take to fetch and deliver a load in a cart. She also keeps track of the weight of each load in case that has a confounding effect. Trial runs are made for 15 loads for each motorized cart, for a total of 45 trial runs. Forty-five (45) different operators were used. The times and weights for the 45 trial runs were
           Table 1 - Times and Weights for Motorized Carts
        A  42  104    A  38   79    A  47   75    A  44   95    A  51  102
        A  44  107    A  54  110    A  39   98    A  44  106    A  56  101
        A  56  120    A  43   88    A  50   99    A  59  122    A  52   99
        B  56  107    B  42   85    B  49   98    B  54  106    B  44   88
        B  48  110    B  40   93    B  46  104    B  45   87    B  44   86
        B  44  101    B  46   86    B  46   87    B  62  121    B  55   80
        C  51   87    C  47   92    C  62   97    C  57  117    C  43   85
        C  66  120    C  59  101    C  52  115    C  57  107    C  46   99
        C  53  109    C  54   99    C  46   91    C  41   72    C  55  105

    Each triple of values in Table 1 denotes the cart type (one of three values A,B,C), the delivery time for that load (Y), and the weight of the load (X).

    (i) Is there a significant variation in the cart brands, as measured by time efficiency, NOT allowing for load weight? (That is, ignoring the weight measurements. For definiteness, use Carttype for cart brand and Weight for the weight.) What is the P-value for the ANOVA? What is the model R2?

    (ii) Do the carts vary significantly in efficiency, ALLOWING for load weight and cart brand, but NOT allowing for load weight and cart brand interactions, listing cart brand before weight in the regression for definiteness? What is the P-value for the ANOCOVA? What is the model R2?
    Which appears to have a stronger effect on the times, the load weight or the type of cart? Which of the two are significant in the Type I table? What are their P-values? Which are significant in the Type III table? What are their P-values? Why are the P-values different between the two tables?
    Is the effect of cart brand significant in the Type I table? in the Type III table?
    Note that the significance of cart brand in the presence of a load weight correction would be a statistically valid measure of quality of cart brand, as long as the warehouse manager is convinced that the distribution of load weights in Table 1 is the same as she would encounter in practice.

    (iii) Is there a significant interaction between the effect of weight and the type of cart, ALLOWING for load weight and cart brand interaction in the regression, using ``interaction'' in the usual sense for ANOCOVA models? Which effects are significant in the Type I table? in the Type III table? Do your conclusions change from part (ii)?

    (iv) Find the (Pearson) correlation coefficients for time (Y) versus Weight within each cart type. (That is, run proc corr with by Carttype.) Do the within-cart-type correlation coefficients vary? Also, construct a scatterplot of time versus weight with cart type as the plotting symbol. Do these conclusions affect your answer to part (ii)?
    What would you advise the warehouse operator about the best choice of brand for the three types of cart?
    
    

    2.   Variables AA, BB, and CC were measured for 32 subjects. Of these subjects, 12 later developed Condition X while the remaining 20 did not develop Condition X. The data are listed in Table 2.

    An experimenter is interesting in finding which of the variables AA, BB, and CC are significantly related to developing Condition X. The experimenter is also interested in finding a rule that, given the values of AA, BB, and CC for a subject, predicts the probability that that subject will later develop Condition X.
              Table 2 - Covariates AA BB CC for 32 subjects
        that later either developed or did not develop Condition X
    
        Developed Condition X         Did NOT develop Condition X
       Subj   AA     BB     CC        Subj    AA      BB      CC
         1    69     83     51         13     36      55      39    
         2    51     74     32         14     50      69      44    
         3    27     68     33         15     36      59      28    
         4    55     85     46         16     31      26      44    
         5    27     99     34         17     31      49      47    
         6    44     68     38         18     32      45      50    
         7    49     88     57         19     40      59      33    
         8    28     64     66         20     49      51      42    
         9    32     58     46         21     38      70      47    
        10    47     81     39         22     46      63      26    
        11    35     77     31         23     46      64      47    
        12    30     69     62         24     67      94      43    
                                       25     47      60      56    
                                       26     56      62      45    
                                       27     39      64      27    
                                       28     52      71      24    
                                       29     33      62      52    
                                       30     57      63      48    
                                       31     39      78      23    
                                       32     48      70      55  

    (i) Use the data in Table 2 to find a linear discriminant function

    L(data) = c0 + c1AA + c2BB + c3CC

    with the property that L(data)>0 predicts Condition X. Assume SAS's default assumptions for proc discrim that the variables AA BB CC in each group have joint normal distributions with the same covariance matrix in each group and begin with a prior belief of 0.50 that a randomly chosen subject has Condition X. (Hints: (a) See plogistic.sas on the Math475 Web site. (b) Look for Linear Discriminant Function in the SAS output for proc discrim followed by Coefficient Vector = COV(-1)Xbar_j or something similar. The coefficients in the linear discriminant function are the differences between the covariate coefficients for the two groups. The cutoff value is given by the difference between the Constant values.)

    Have SAS print out the means and standard deviations of each covariate within each group. (Hint: The option simple tells SAS to do this, as in (for example) proc discrim data=xnotx simple ....;. Alternatively, you can use proc means.) Which covariates seem to be the most divergent between the two groups as judged by the within-group means and standard deviations?

    (ii) Which variables have the highest coefficients in the discriminant function? Is this consistent with your answer to the previous problem? Are the signs of these coefficients consistent with the differences in within-group means? That is, are large (or positive) values in a covariate suggestive of Condition X, or small (or negative) values?

    (iii) Using the data in Table 2 as a test data set, how many of the subjects are incorrectly classified? (This is called a Resubstitution analysis.) If you enter crossvalidate on the proc discrim command line, then SAS will also do a crossvalidation procedure in which each subject is classified on the basis of the discrimination rule defined by the other subjects, NOT INCLUDING that subject itself. (The resubstitution compares each subject with the rule for all subjects, including the subject itself, which influences the rule about what group that subject should belong to.) How does the number of misclassified subjects change under this crossvalidation?
    
    

    3. For the data in Table 2,

    (i) Use a logistic regression to predict the probability of developing Condition X given values of AA, BB, and CC.
    Is there an overall statistically significant effect of the three covariates together on whether or not a subject develops Condition X? What is the P-value? (If more than one test is available, pick one of them.) What is the number of degrees of freedom of the chi-square statistic?
    (Hint:: See plogistic.sas on the Math475 Web site.)
    (ii) Which of the three variables AA, BB, and CC individually have a significant effect on the probability of developing Condition X in the logistic regression? Which have a highly significant effect? For the variables that have significant effects, what is the P-value for each? For each variable with a significant effect, does increasing the value of that variable make Condition X more likely to occur, or less likely? How can you tell from the output?
    (iii) Are your answers to part (ii) consistent with the means of the variables in the two groups? That is, if increasing a covariate also increases the probability of Condition X, is this consistent with the mean of that covariate being higher among the records with Condition X?
    
    

  • Top of this page