***************************************************; * Two-sample tests: Suppose that for two samples * * Xi 9.19 9.54 8.65 7.31 8.47 9.78 * Yj 8.73 8.17 6.40 6.31 7.09 7.99 5.89 6.38 8.24 * * we want to test whether or not there is a difference in the means. * That is, if the Xi come from a source with generic distribution X * and the Yj have generic distribution Y, we want to test * * H0: E(X)=E(Y) * versus * H1: E(X)E(Y) * * Two commonly-used statistical tests for a difference in means or * medians between two samples are * the two-sample Student's t-test and * the Wilcoxon rank sum test * * Both tests implicitly assume that the variances are the same * and test if the samples differ by an additive shift. * * These two tests can be done by `proc ttest' and `proc npar1way' * in SAS, respectively. * * Both procedures need a dataset with numerical values in one column * and sample designators (like 'X' and 'Y' or 'Male' and 'Female') * in another column. The following shows how to do this. * * NOTES ABOUT THE SAS CODE BELOW THAT DEFINES THE DATA: * * In a statistical comparison of Xs and Ys, it might seem more logical * to have Xs in one column and Ys in another. However, this approach * does not generalize easily to more complex settings in which * observations may have several different attributes, such as * `X or Y' or `Up or Down' or `Rich or Poor' and `Hot or Cold'. * * Nearly all SAS procedures require observed (or dependent) values in * one column and sample designators or attibutes or covariates in * one or more other columns. Here we call the sample values `Dat', * for both X and Y values. The sample designator `Samp' is a text * variable that takes on the values `X' and `Y'. * * Normally, `input ...' in a SAS data step tells SAS to read one line * of data, store the values in one or more variables, and then go on * to the next line of data. This would be inefficient in our case * since the `datalines' block would have to have 15 lines instead of * three lines. The `trailing @@' in the statement * * input Samp$ Dat @@ * * tells SAS to use the first two values (words) on the line for the * first record in the SAS dataset and keep the rest of that line in * memory. The next `input Samp$ Dat @@' statement then reads the * next two values for the second output record and so forth. * * If we left out the `trailing @@' in the input statement, the * program would only read the first pair of values on each line in * the datalines block, and we would end up with only one X value * and two Y values. * ***************************************************; title 'Two-sample tests: YOUR NAME'; options ls=75 ps=60 pageno=1 nocenter; data twosamp; * Samp stands for X or Y; input Samp$ Dat @@; datalines; X 9.19 X 9.54 X 8.65 X 7.31 X 8.47 X 9.78 Y 8.73 Y 8.17 Y 6.40 Y 6.31 Y 7.09 Y 7.99 Y 5.89 Y 6.38 Y 8.24 run; title2 'The data as SAS sees it'; proc print; run; ***************************************************; * In general, one should always try to visualize data before carrying * out any tests, to make sure that the data does not contain any * unpleasant surprises. The following summarizes the data in two * different bar charts. ***************************************************; title2 'Visualizing the data so that we can see what is going on'; proc chart; vbar Dat / subgroup=Samp; vbar Dat / group=Samp; run; * Finally, carry out a two-sample t-test; title2 'Two-sample t-test'; proc ttest; class Samp; var Dat; run; * and the Wilcoxon rank-sum test; title2 'Two-sample Wilcoxon Rank Sum Test'; proc npar1way wilcoxon; class Samp; var Dat; run; ***************************************************; * In the example above, we repeated the sample designator (X or Y) * before each value in the datalines block. This wasn't too much * trouble in this case, but might have been if the sample names * were (for example) `TestDrug#1117Alpha' and * `Placebo_Rubinoflaxin_Type_7'. * * Later in the course we will have data sets where each data value * has 4 or 5 different sample designators. * * We now show two different ways of writing a data step so that we * do not have to repeat the sample designator for each value. * * NOTES FOR THE FIRST (`SWITCH') METHOD: * * The idea is as follows. We read from the datalines block one word * at a time (`input zz$ @@'). If the word is a sample designator * (X or Y), we use it to set a switch (i.e., Samp = zz = 'X' or * 'Y'). The command `retain Samp' tells SAS to remember the value * of Samp from line to line. SAS normally resets each variable to * `missing' just before it reads a new line. * * If zz is not X or Y, then `zz' is a number, but has been read by * the command `input zz$ @@' as a text string. The command * Dat = input(zz,12.0) converts the text in zz$ to the number * `Dat'. SAS cannot use text and numerical variables * interchangeably. If text is to be used as a number, it must * first be stored in a numerical variable. * * The result in the output data set will be a series of lines with * two fields `Samp' and `Dat', suitable for proc ttest or * proc npar1way. Each time that the word read is an X or Y, the * value stored will have a missing value of `Dat'. However, most * SAS procedures ignore records with missing values in key fields, * so that records with Dat=`missing' are ignored in the output. * **********************************************************; title2 'ALTERNATIVE DATA STEPS: Reading data using a switch'; data twosamp2; retain Samp; input zz$ @@; if zz='X' or zz='Y' then Samp=zz; else Dat = input(zz,12.0); datalines; X 9.19 9.54 8.65 7.31 8.47 9.78 Y 8.73 8.17 6.40 6.31 7.09 7.99 5.89 6.38 8.24 title3 'What the data looks like'; title4 'The important fields are Samp and Dat'; title5 'Note the two records with Dat=Missing'; proc print; run; **********************************************************; * Since `twosamp2' has the same data as `twosamp', it will have * exactly the same output for `proc ttest' and `proc npar1way' * * As a last example, we have Xs are in one column in a datalines * block and the Ys are in a second column. This is not the usual * way to do things in SAS, but we can do it this way as an * exercise. Since we have 3 more Y-values than X-values, we * include explicit `missing values' (`,') for the X values * that do not match. * * For each line of input, we write one `Dat' record for the X and * one for the Y using explicit `output' commands. In general, * `output' tells SAS to write a line to the output SAS dataset * using the current values of all variables. (Unless you have used * a `retain' statement, each variable is reset to `missing' before * each line is read.) * * In general, if you do not say `output' explicitly in a data step, * then SAS provides an `output' command implicitly at the end of * the data step. However, if you ever say `output' explicitly, * then SAS writes to the output dataset ONLY when you say `output'. * * The code below writes 18=2*9 records to store 6+9=15 usable * records, as you can tell from the print statement. * **********************************************************; title2 'READING DATA IN TWO COLUMNS'; data twosamp3; input xx yy; Samp='X'; Dat=xx; output; Samp='Y'; Dat=yy; output; datalines; 9.19 8.73 9.54 8.17 8.65 6.40 7.31 6.31 8.47 7.09 9.78 7.99 . 5.89 . 6.38 . 8.24 run; title3 'The records are alternating by sample designator'; title4 'Note the missing Dat values in the last three A records.'; proc print; run; **********************************************************; * Since `twosamp3' has the same data as `twosamp', there is * no need to rerun `proc ttest' and `proc npar1way' **********************************************************;