*****************************************************; * Examples of Tests Based on Contingency Tables: * (i) Tests for 2x2 tables * (ii) Tests for 2x5 tables, including Mantel-Haenszel trend test * (iii) Clever ways to read a 2x5 table * (iv) A two-sample t-test for tabled data *****************************************************; title 'Examples with contingency tables - YOURNAME'; options ls=75 ps=60 nocenter pageno=1; *****************************************************; * Read a 2x2 table and do a number of statistical tests; * Note the use of @@ to read triples of numbers. ; *****************************************************; data table1; input sex$ income$ num @@; datalines; M Rich 20 M Poor 10 F Rich 80 F Poor 90 ; * Display the table; proc print; title2 'A simple 2x2 table with 4 entries'; title3 'The data as SAS sees it:'; run; *********************************************************; * See below for comments about the next procedure. * The option / chisq tells SAS to perform a number of different * statistical tests *********************************************************; proc freq order=data; title3 '(Two one-way tables followed by a 2x2 table)'; tables sex income; table sex*income / chisq; weight num; run; *********************************************************; * SAS's contingency-table analyses are based on `proc freq', which * was originally designed to tabulate records in large databases. * The command `weight num' tells SAS to count each record in * `table1' `num' times. Otherwise, SAS would just tabulate the * four individual records in `table1', which would lead to a very * uninteresting table. * * The first `tables' statement tabulates counts for `sex' and * `income' individually. The second `table' creates a 2x2 * contingency table. The option / chisq tells SAS to carry out * a number of different statistical tests. * * By default, SAS displays rows and columns ordered alphabetically * by row and column headings, instead of in the order encountered. * This makes sense for a dataset with millions of entries and * perhaps dozens of row and column types, since otherwise it might * be difficult to find particular rows and columns. * * For small tables, you generally want to determine the row and * column orders yourself in terms of the order in the data in your * data step. The option `order=data' tells SAS to order rows and * columns in the output in the order that they are first * encountered. * * If `order=data' were left out, the displayed 2x2 table would be * arranged with columns in the order `Poor Rich' and rows as * `F M', since those are the correct alphabetical orders. * * The next example has the contingency table in the datalines block * in a more intuitive form and uses `output' statements to write * two dataset records for each row of input. The `output' command * tells SAS to write a line to the output dataset immediately * using the current values of the variables. See `twosamp.sas' * for further discussion of the `output' command. * * In this example, SAS writes two records for each row that it * reads from the `datalines' block, for a total of 4 records. * The `sex' variable is the same for both records written * from that row. The result is the same 4 records as before. *********************************************************; data table2; input sex$ num1 num2; income='Rich'; count=num1; output; income='Poor'; count=num2; output; datalines; M 20 10 F 80 90 ; proc print data=table2; title2 'The data as SAS sees it:'; title3 '2x2 table read using output commands'; run; *********************************************************; * A 2x5 Contingency Table: * * The notation num1-num5 is a handy general SAS abbreviation for * num1 num2 num3 num4 num5 * * Each `input' statement reads 6 values from the `datalines' * block and then writes 5 output records. The resulting dataset * will have 10 records from 2 rows of input. *********************************************************; data table3; input sex$ num1-num5; income=1; count=num1; output; income=2; count=num2; output; income=3; count=num3; output; income=4; count=num4; output; income=5; count=num5; output; datalines; F 10 12 19 17 20 M 20 15 12 14 10 ; proc print; title2 'Data for a 2x5 table'; title3 'The data as SAS sees it:'; run; ************************************************************; * Note the use of " delimiters below instead of ' in the title3 * command in the next procedure. The usual ' delimiters would * interfere with the quote characters in `Mantel-Haeszel'. In * general, you can delimit text strings with either " or ' , in * which case the other character is treated as any other text * charactger. * * Can you see why the Mantel-Haenszel P-value is much smaller? ************************************************************; proc freq order=data; title3 "NOTE that the P-value for the `Mantel-Haenszel' test"; title4 " is MUCH MORE significant than the others."; title5 "Can you see why?"; table sex*income / chisq; weight count; run; *********************************************************; * USING SAS ARRAYS TO SIMPLIFY A DATA STEP: * * The repetition of 5 lines of the form * income= count=num output; * suggests that we should be able to do this in terms of a loop of * some kind. * * In fact, we can use `SAS arrays' within a SAS `do loop' to simplify * code. SAS arrays are actually arrays of pointers to variables * rather than arrays of variables themselves, so that arrays in SAS * have to be handled slightly differently than in most other * computer languages. * * The command `array nn(*) num1-num5' below defines nn(1) - nn(5) * to be the same as num1 - num5, but with an index that can * be used within a loop. * * The arguments after `array nn(*)' could be any list of SAS * variables. For example * * array nn(*) k num1-num5 wombat apple k1-k20; * * This define nn(1)=k, nn(2)=num1, nn(3)=num2, ..., nn(7)=wombat, * nn(8)=apple, ..., up to nn(28)=k20. This allows all of these * variables to be referred to as nn(i) within a loop. * * You can define new variables in an array statement and also assign * them values at the same time. For example, * * array cc(*) $ zz1-zz3 ( 'Apple' 'Banana' 'Cabbage' ); * * defines new text variables zz1,zz2,zz3, then assigns zz1='Apple', * zz2='Banana', and zz3='Cabbage', and finally assigns array values * cc(1)=zz1='Apple', cc(2)=zz2, etc. for use in a loop. Note the * use of round parenthesis ( ) to enclose initializers of * variables. The $ after cc(*) means that the variables will be * text variables. Alternatively, you can leave out the variable * names zz1-zz3 and just say * * array cc(3) $ ( 'Apple' 'Banana' 'Cabbage' ); * * This syntax creates 3 new variables named cc1-cc3 and then * assigns cc(1)=cc1='Apple', cc(2)=cc2='Banana' etc. You cannot * say `cc(*)' in this syntax, since SAS cannot figure out in * advance how many variables there will be. * * In the `do' loop below, SAS carries out all of the commands until * the following `end' statement for each value of `inc'. * * In particular, `end' just means the end of the `do' loop, and does * NOT say to end the datastep. Later we will discuss a command * `stop' that tells SAS to immediately exit the data step. *********************************************************; data table3a; input sex$ num1-num5; * Define array pointers for num1-num5 ; array nn(*) num1-num5; * Define array pointers for column headings ; array colhead(*) $ ch1-ch5 ( 'Lev1' 'Lev2' 'Lev3' 'Lev4' 'Lev5' ); * Use a `do' loop to write 5 output lines. Note that `inc' * is now one of the SAS variables in the output dataset.; do inc=1 to 5; count=nn(inc); colh=colhead(inc); output; end; datalines; F 10 12 19 17 20 M 20 15 12 14 10 ; proc print; title2 'The same 2x5 table using a SAS array'; title3 ' with more explicit column headings'; title4 'The data as SAS sees it:'; title5 'Showing the variables sex colh count inc (income) only'; var sex colh count inc; run; proc freq order=data; title4 'Table output:'; table sex*colh / chisq; weight count; run; ************************************************************; * COMBINING a contingency test and a Student t-test: * Testing H_0:E(X)=E(Y) for tabled values: * * Note that the table * * F 10 12 19 17 20 Sum: 78 * M 20 15 12 14 10 Sum: 71 * * is also a TABLE of values: specifically, a table of income levels * 1,2,3,4,5 for 78 individuals in one sample (F) and 71 individuals * in a second sample (M). * * For the 78+71=149 values in the table, we can ask whether the * (expected) mean F values differ significantly from the mean of * the M values. More exactly, assuming that the tabled F values * are samples from a generic random variable X and the tabled M * values from Y, do we have enough evidence to reject * * H_0:E(X)=E(Y) ? * * The example `twosamp.sas' considered the same question for two * smaller samples using two Student t-tests and a nonparametric * (Wilcoxon rank-sum) test. Can we apply the same tests here? * * Note that the traditional contingency-table test is for ANY * DEVIATION from independence (meaning from sameness of * proportions of the rows). This gives this tests less power to * detect ANY PARTICULAR DEVIATION from independence, such as a * a difference in mean values. This is analogous to the * difference between a Pearson chi-square P-value and the * Mantel-Haenszel trend chi-square test. * * To apply a Student's t-test to the 149 values that are implicit * in the table, we must first convert the data to a dataset with * 149 rows with data values 1-5 in one column and sample * designators (M or F) in a second column. This is done in the * following SAS data step. Note the use of `set table3' to * tell SAS to read rows for `data table4' from `table3' instead * of from a (nonexistent) `datalines' block. Given a count from * the 2x5 table, a `do' loop is used to repeat a record with the * same `Sex' and `Inc' value that many times. * * The resulting dataset `table4' will have columns for * `sex income count num1-num5 k'. All we need is `sex' and * `income'. The remaining columns do no harm, but could take * up large amounts of space if the dataset had millions of * records rather than 149. To tidy up, we remove the unneeded * columns `k count num1-num5' from `table4' by using a `drop' * statement. ************************************************************; data table4; set table3; * Read Sex$ num1-num5 income count; do k=1 to count; output; end; * Repeat the record; * The resulting dataset will have 149 rows, so that is is ; * worthwhile to drop the columns that we do not need.; * The remaining columns with be Sex$ and income (=1,2,3,4,5); drop k count num1-num5; run; ************************************************************; * We now have the data in the form of two samples (78 Fs and * 71 Ms), and can (i) draw a histogram, (ii) do a two-sample * t-test, and (iii) do a Wilcoxon rank-sum test. * * Note that the Wilcoxon P-value is very close to the * Mantel-Haenszel trend P-value. ************************************************************; proc chart; title2 '2x5 table stored as 2 numerical samples'; title3 'Ms and Fs for males and females'; vbar income / subgroup=sex; run; proc ttest; title2 'Two-sample t-test for M versus F'; class sex; var income; run; proc npar1way wilcoxon; title2 'Two-sample Wilcoxon rank-sum test for M vs F'; class sex; var income; run;