Math 475 Homework 1 - Fall 2010

  • Click here for Math475 home page
  • Click here for Math475 homework page
  • Click here for Prof. Sawyer's home page

    HOMEWORK #1 due Thursday 9-16

    Text references are to the textbook, Cody & Smith, ``Applied statistics and the SAS programming language''

    NOTE: See the main Math475 Web page for how to organize a homework assignment using SAS. In particular,
            ALWAYS INCLUDE YOUR NAME in a title statement in your SAS programs, so that your name will appear at the top of each output page.
            ALL HOMEWORKS MUST BE ORGANIZED in the following order:
            (Part 1) First, your answers to all the problems in the homework, whether you use SAS for that problem or not. If the problem asks you to generate a graph or table, refer to the graph or table by page number in the SAS output (see below). (Xeroxing a page or two from the SAS output or cutting and pasting into a Word file or TeX source file is also OK.)
            (Part 2) Second, all SAS programs that you used to obtain the output for any of the problems. If possible, similar problems should be done with the same SAS program. (In other words, write one SAS program for several problems if that makes things easier, using SAS title or title2 statements to separate the problems in your output.)
            (Part 3) Third, all output for all the SAS programs in the previous step.
            If an answer in Part 1 requires a table or a scatterplot that you need to refer to, make sure that your SAS output has overall increasing (unique) page numbers and make references to Part 3 by page number, such as ``The scatterplot for Problem 2 part (b) is on page #X in the SAS output below.'' DO NOT say, ``see Page 3 in the SAS output'' if Part 3 has output from several SAS runs, each of which has its own Page 3. In that case, either write your own (increasing) page numbers on the SAS output, or else (for example) refer to ``Page 2-7 in the SAS output'' (for page 7 in the second set of SAS output) and write page numbers in the format ``2-7'' at the top of pages in your output.

    
    

    1.   (Similar to Problem #3-7 in the text) Some summary statistics for the occurrence of asthma and SES (socioeconomic class) are

                   Asthma       Yes       No
                   -------------------------
                   LowSES        39      101
                   HighSES       29      137
     
    Create a SAS data set from these data and test the hypothesis of independence of rows and columns. For this table, what is the Pearson chi-square P-value? The P-value for the two-sided Fisher exact test? On the basis of these data, do you accept or reject the hypothesis that there is no association between SES and Asthma?
    
    

    2.   Twenty five (25) individuals volunteered for a study. Confidential identifiers for the 25 individuals are given in the following table.

             Table 1. Confidential identifiers for 25 volunteers
                A10     B33     C22     D61     E88
                F91     G21     H42     I37     J19
                K90     L30     M98     N48     O11
                P77     Q07     R54     S18     T31
                U45     V11     W71     X76     Y32
     
              (i) Write a SAS program to randomly assign 8 of these individuals to a treatment group and the remaining 17 individuals to a control group. (Hint: See the sample program randsamp.sas on the Math475 Web site, and perhaps the use of @@ as in the sample program ctable.sas.)

              (ii) Which 8 individuals did you (or your program) assign to the treatment group? List their (confidential) identifiers in alphabetical order.

    
    

    3.  The following data was gathered for the 47 current employees of Vaporlock Computer Services. The individuals working at this company are considered to be odd in some respects.

              Table 2.   Height (inches),  Weight (pounds),  and Sex (Gender) for 47 employees:

         67  123  F       67  143  M       69  174  M       64  127  F
         61  116  F       70  159  M       71  142  M       66  146  F
         61  128  F       59  139  F       65  127  F       69  172  M
         64  166  M       63  120  F       69  166  M       67  152  F
         62  153  F       60  152  F       66  168  M       66  155  M
         71  145  M       64  164  M       72  168  M       64  123  F
         64  135  F       68  158  M       63  159  M       71  177  M
         65  158  M       63  169  M       60  139  F       71  177  M
         65  150  F       63  145  M       62  141  F       64  118  F
         64  168  M       66  151  F       68  171  M       63  158  M
         63  146  M       68  149  M       66  162  M       68  144  F
         61  131  F       72  179  M       62  142  F
     
              (i)  Construct a scatter plot of heights (Y-variable) by weights (X-variable) using sex as the plotting symbol. Do the heights and weights appear to be correlated? (That is, do taller individuals appear also to be heavier?) Do heights and weights appear to be correlated within each sex; that is, for Fs only or for Ms only?
              (Hint: See the code using proc plot; in the sample program List1.sas on the Math475 Web site. The scatter plot will look better if you precede the code with options ps=40;, as in List1.sas.)

              (ii)  Are heights and weights significantly correlated for the 47 employees? What is the P-value? Are heights and weights significantly correlated within each sex? What are the two P-values?
              (Hint: You can use proc corr; to find the Pearson correlation coefficient between two SAS variables as well as the P-value for H_0:rho=0, if rho is the true correlation coefficient. Use proc corr; by sex; ....; to find the same information within strata defined by sex. (WARNING: If you use proc corr; with ``by sex;;'' the data set MUST FIRST BE SORTED by sex. See the example program randsamps.sas on the Math475 Web site an example of sorting.) See either Chapter 5 in the text or SAS documentation for proc corr;. The P-values in the SAS output are based on the fact that sqrt(n-2) r/sqrt(1-r^2) has a Student-t distribution with n-2 degrees of freedom if r is the Pearson correlation coefficient between two independent normal samples.)

    
    

    4.   A total of 2000 observations are made of individuals that can have any of three different levels of Zubricity (A,B,C) and any of four different levels of Income. The counts are

                               Income
                               1     2     3     4
                        A     66    98   127   180
         Zubricity      B    111   136   170   228
                        C    168   193   240   283
     
              (i) Is there an association between Zubricity and Income in this table? Have SAS do the Pearson chi-square test to find out. What is the number of degrees of freedom? What is the P-value? How did SAS calculate the number of degrees of freedom?

              (ii) What is the P-value for the the Mantel-Haenszel (trend) chi-square test? What is its number of degrees of freedom? Why is the P-value different from part (i)? What is this test designed to detect? That is, what alternative H_1 is the test sensitive to?

    
    

    5.   Observations are made of individuals that can have any of three different levels of Ablativeness (A,B,C) and five different ranges of height, which are referred to as the Height Index. The counts are

                  
                                Height Index
                                1     2     3     4     5
                          --------------------------------
                          A    29    27    25    39    24  
         Ablativeness     B    20    20    21    21    22  
                          C    27    21    34    10    11
    
     
              (i) Is there an association between Ablativeness and Height Index? Have SAS do the Pearson chi-square test on the 3 by 5 table to find out. What is the P-value? What is the number of degrees of freedom? How did SAS calculate the number of degrees of freedom?

              (ii) If the P-value for the Pearson chi-square test in part (i) is significant, is the significance due to deviations from independence among all 15 cells in the table, or does the departure from independence appear to due to deviations at only two or three cells? If so, which cells appear to be responsible for the lack of independence?
              (Hint: The Pearson chi-square statistic Q_P is the sum of (Obs-Expec)^2/Expec over 3x5=15 cells. If Q_P is large, it could be because all 15 summands are large, or it could be due to a few large terms with the remainder less than 2-3 or so. Note that if the underlying probabilities are consistent with independence of Ablativeness and Height Index, then each of the summands (Obs-Expec)^2/Expec should have a distribution that is approximately chi-square with one degree of freedom.
              To have SAS display the values of (Obs-Expec)^2/Expec for each cell, use the option cellchi2 in a table statement in proc freq (as in table A*B / chisq cellchi2).