/* Finding exact P-values for permutation tests: */ /* Wilcoxon signed rank and rank sum tests */ /* Random number generators (RNGs) on a computer: The simplest good RNGs are `Linear Congruential RNGs', which are based on a single integer: unsigned int rseed; One first STARTS or `seeds' the RNG by setting rseed=MyStartingValue; Thereafter, successive random numbers are generated by setting in sequence rseed = A*rseed + B for optimally chosen `magic constants' A and B, with overflows in the 32-bit unsigned integer `rseed' ignored. In other words, rseed = A*rseed + B (modulo 2^32) The RNG in statlib.c uses one of a set of constants suggested for 32-bit unsigned integers by Donald Knuth in his book Seminumerical Algorithms: A = 1664525 and B = 1 The idea is that, after a few random numbers have been generated, `rseed' attains and remains near 2^{32}. Thereafter, each successive choice of a number X_n = A*rseed + B overflows the 32-bit integer. This causes most of the previous high-order bits of `rseed' to be lost. Thus if we follow the rule that we only trust the higher-order bits of successive values of `rseed', then the sequence of `rseed' values obtained in this way should be random and uniformly distributed among values 0 <= rseed <= 2^{32} - 1. This in fact turns out to be the case, in the sense that these sequences pass a large number of reasonable statistical tests for joint independence and for having a uniform distribution. It is rare that random numbers are used in the form `rseed'. A function that is more useful is int nrand(int m); which returns a random integer X that is uniformly distributed in the range 0,1,2,...,m-1 for m>0. Successive calls to nrand() return independent values. The code for nrand() in statlib.c is essentially int nrand(int m) { if (m<=1) return 0; // Return 0 for errors in m rseed = A*rseed + B; return (m * (rseed >> 16)) >> 16; } The operation `(x >> 16)' for an integer x shifts all of its bits by 16 bit locations to the right (i.e. towards lower-order bits) while replacing higher-order bits by zero. Thus `m * (x >> 16)' yields an `overflow' integer in the range 0,1,...,m-1 in the highest 16 bits while changing or churning up the lowest-order 16 bits. The next operation `(m * (x >> 16)) >> 16' erases the previous lower-order 16 bits and leaves only a value in the range 0,1,2,...,m-1. Note that these operations return an integer 0,1,...,m-1 using only the high-order bits of `rseed'. In fact, the lower-order bits of `rseed' are not random with nice statistical properties. Since A ends with 5 and B=1, the integer rseed itself will always be equal to 1 or 6 modulo 10. However, the high-order bits of the successive values of `rseed' will cause successive calls to nrand(m) to pass all reasonable statistical tests for independent random variables with a uniform distribution among 0,1,2,...,m-1. Another useful function is double drand(void); whose code in statlib.c is essentially double drand(void) { rseed = A*rseed + B; return (double)rseed/(2^{32}); } This scales the unsigned integer `rseed' to a double-precision floating-point number that is uniformly distributed in the range 0<=X<=1.0. (Note that the definition guarantees that always X<1.0 for X=drand(), which is occasionally useful.) Successive random numbers based on `rseed' cannot be independent for arbitrarily long runs. In fact, updates to `rseed' will repeat themselves indefinitely once it sees a value of `rseed' that it has seen before. Thus successive values of `rseed' must cycle with a period that divides 2^{32}. The choice of (A,B) above has a cycle length of 2^{32} approx 4,000,000,000. In fact, long simulations show that `nrand()' begins to show statistically significant periodic behavior at around 300,000,000 successive values. The standard numerical recipes book Press etal (``Numerical Recipes in C'') suggests that simple LCRNGs (like the one described above) not be used for more than 100,000,000 successive random numbers. All RNG functions in statlib.c increase a count `rand_nums_used' that can be reported by the function void show_randnums_used(void); in statlib.c. During the last time that Math408 was taught, none of the sample programs used more than 10,000,000 random numbers, and most used less than 1,000,000. A REASON TO KEEP TRACK OF RANDOM-NUMBER SEEDS: Note that when rseed=X_0 is set (or `SEEDED') at the beginning of a program, then successive `random values' of rseed X_1, X_2, X_3, ..., X_n, ... are strictly determined by X_0 no matter how large n is. This is extremely useful in scientific programming: If you want to run a simulation again and have the program calculate more statistics and print out more information, you can start the RNG with exactly the same seed. The random numbers X_1, X_2, X_3, ... will then be identical to what was generated before and your new program will reproduce the results of the old program exactly along with your new output. This is very useful for testing new versions of a program, since even `random' output starting with the same seed will be identical to before. SEEDING YOUR RNG: The protect the innards of the RNG package from the user (and allow a new RNG to be swapped in with minimal change), all uses of `rseed' are done through functions instead of working with `rseed' directly. In fact, `rseed' and `rand_nums_used' are `static global' variables in statlib.c that can be accessed only in statlib.c. This means that these values can be changed or read only by functions in statlib.c. For example, the RNG is initialized at MyStartingValue by the function call startseed=setrand(MyStartingValue); The code for setrand() in statlib.c essentially sets rseed=MyStartingValue, but also does several other useful things. For example, if MyStartingValue=0, a value is requested from the computer's system clock and that is used as the seed, as a way of getting a `truly random' starting value. The actual seed used is returned as `startseed' above. The value startseed!=MyStartingValue if MyStartingValue=0 (in which case `startseed' is a clock value). If your program displays `startseed', then you can repeat the same simulation by using `startseed' even if your first run seeded the RNG from the system clock. WHAT IF YOU NEED MORE THAN 300,000,000 RANDOM NUMBERS? In that case, you can use a `long-cycle' RNG. Several excellent RNGs of this type can be found on the Gnu Software Library Web site. */ #include #include #include #include #include "statlib.h" #define PROG_DATE "February 8, 2007" #define OUTPUT_NAME "RanksSims.txt" #define STARTSEED 123456 /* Starting random-number seed */ #define NSIMS 10000 /* Number of permutation simulations */ /* Starting random-number seed for the RNG and the number of simulations */ /* These are stored in global variables for ease of customization: */ /* For example, you could improve the program to change these */ /* variables from the command line */ unsigned long int wseed=STARTSEED; int nsims=NSIMS; /* Prototypes of functions to evaluate the two tests */ void do_signedranks(void); void do_ranksums(void); int main(void) { printf ( "\nProgram to illustrate finding exact P-values for permutation tests\n" "Wilcoxon signed rank and rank sum tests\n" "(This is %s on the Ma408 Web site.)\n", OUTPUT_NAME); printf ("Written by XXXX\n"); do_signedranks(); do_ranksums(); /* Show how many random numbers were used */ show_randnums_used(); /* No errors to report */ return 0; } /* WILCOXON SIGNED-RANK DATA (from onesamp_ranks.c) */ int nnsr=9; double bef[20] = { 1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30 }; double aft[20] = { 0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29 }; /* Signed-rank differences, ranks, and signs (from onesamp_ranks.txt) */ double xdif[20] = { -0.952, 0.147, -1.022, -0.430, -0.620, -0.590, -0.490, 0.080, -0.010 }; double srank[20] = { 8, 3, 9, 4, 7, 6, 5, 2, 1 }; int srsign[20] = { 0, 1, 0, 0, 0, 0, 0, 1, 0 }; double wsr_exact_Pvalue=2*(double)10/512; /* Subroutine to display a sign */ char *showsign(int ss) { if (ss>0) return "+"; else return "-"; } void do_signedranks(void) { int i, ns,nsuc; unsigned long int startseed; double meanwsrank, wsrankobs; double pp,pwid; printf ("\nWILCOXON SIGNED-RANK TEST:\n"); printf ("\nSigned rank (After-Before) data:\n"); printf ("Data (n=%d):\n", nnsr); for (i=0; i0) wsrankobs=wsrankobs+srank[i]; /* Center wsrankobs at its expected mean for a 2-sided P-value */ /* `(double)' is a CAST: It forces the next expression to be */ /* a double, not an int */ /* This is necessary to divide by 4 without doing integer division, */ /* which would truncate a fraction */ meanwsrank=(double)(nnsr*(nnsr+1))/4; printf ("Observed signed-rank statistic: %g Expected value: %g\n", wsrankobs, meanwsrank); wsrankobs=fabs(wsrankobs-meanwsrank); printf ("Centered at mean for a 2-sided P-value:\n"); printf (" Observed (centered) Wilcoxon signed-rank statistic: %g\n", wsrankobs); printf ("Exact permutation-test P-value: P = %g (2-sided)\n", wsr_exact_Pvalue); printf ( "\nEstimating the true P-value by doing %d random permutations:\n", nsims); /* Initial seed of random-number generator: */ /* If wseed=0, then the RNG is seeded from the system clock, */ /* and startseed!=0 is the true seed */ /* Otherwise, startseed=wseed */ startseed=setrand(wseed); printf ("Starting random-number seed: %lu\n", startseed); printf ("Exact P-value: %g\n", wsr_exact_Pvalue); /* Do nsims simulations */ nsuc=0; /* The number with rand(Wsrank) >= Obs(Wsrank) */ for (ns=1; ns<=nsims; ns++) { double randwsr; /* Randomize the sign array */ for (i=0; i0) randwsr=randwsr+srank[i]; randwsr=fabs(randwsr-meanwsrank); /* Center wsrankobs at its expected mean for a 2-sided P-value */ /* A `success' is a more extreme (or same) value of wsrank */ if (randwsr>=wsrankobs) nsuc++; /* Go on to next simulation */ } printf ("%d `success(es)' in %d trials\n", nsuc,nsims); printf (" (That is, randomized centered scores >= %g)\n", wsrankobs); printf ("Simulated P-value: %d/%d = %g\n", nsuc,nsims, (double)nsuc/nsims); printf ("95%% CONFIDENCE INTERVAL FOR TRUE P-VALUE:\n"); printf ("(The middle value is the estimated P-value.)\n"); pp=(double)nsuc/nsims; pwid=1.960*sqrt(pp*(1-pp)/nsims); printf (" (%.4f, %.4f, %.4f) EXACT: %.4f\n", pp-pwid, pp, pp+pwid, wsr_exact_Pvalue); } /* End of signed-rank simulation function */ /* Perfect shuffles: A key step in simulating P-values for the Wilcoxon rank-sum statistic is the `perfect shuffle' function void shuffle_dub(double *deck, int num); Thus function permutes the `num' elements of array of doubles deck[] of length num uniformly and independently subject to the fact that it is a permutation. Equivalently, it picks a permutation uniformly from the group S(num) of all permuations of `num' letters, with successive shuffles independent of previous shuffles. The code in statlib.c void shuffle_dub(double *deck, int num) { while (--num > 0) { int j=nrand(num+1); double dtemp=deck[num]; deck[num]=deck[j]; deck[j]=dtemp; }} says that, for each k=num-1,num-2,...,1, the k-th element is swapped with a randomly chosen double with offset j=k,k-1,...,1, and is thereafter fixed. The option j=k allows the k-th element to be unchanged. The result is a `perfect shuffle' with any element being equally likely to end up in any position, any pair of elements equally likely to end up in any pair of positions, any triple of elements equally likely to end up in any three positions, and so forth. The function `shuffle_dub' applies to arrays of doubles. Other functions in statlib.c can be used to apply `perfect shuffles' to arrays of integers and to arrays of chars. */ /* WILCOXON RANK-SUM DATA (see twosamps_ranks.c) */ /* First sample */ int mm=9; int xx[] = { 23, 21, 22, 14, 23, 19, 15, 15, 16 }; /* Midranks of first sample sorted in increasing order */ /* (from twosamps_ranks.txt) */ double mrankx[20] = { 1, 2.5, 2.5, 4, 5.5, 9.5, 12.5, 15.5, 15.5 }; /* Second sample */ int nn=14; int yy[] = { 27, 27, 22, 21, 27, 28, 19, 20, 31, 24, 21, 23, 23, 21 }; /* Midranks of second sample sorted in increasing order */ /* (from twosamps_ranks.txt) */ double mranky[20] = { 5.5, 7, 9.5, 9.5, 9.5, 12.5, 15.5, 15.5, 18, 20, 20, 20, 22, 23 }; /* Approximate 2-sided P-value from the large-sample */ /* normal approximation (from twosamps_ranks.txt) */ double wrsum_norm_approx_Pvalue=0.0118; /* Midranks for both samples in a single array */ double mrank[40]; void do_ranksums(void) { int i,r, ns,nsuc, nmn; unsigned long int startseed; double meanwrsumrank, wrsumobs; double pp,pwid; printf ("\nWILCOXON RANK-SUM TEST:\n"); /* Display both samples */ printf ("\nSample I (m=%d):\n", mm); for (i=0; i0 && (r%10)==0) printf ("\n"); printf (" %g", mrank[r]); } printf("\n"); /* The statistic W_X is the sum of the first mm midranks only */ wrsumobs=0.0; for (r=0; r= Obs(Wsrank) */ for (ns=1; ns<=nsims; ns++) { double simscore; /* Do a `perfect' random shuffle of mrank[] array */ /* Since a `perfect shuffle' of a perfect shuffle is also a */ /* perfect shuffle, we can keep reshuffling the same array */ shuffle_dub(mrank,nmn); /* Recompute the (uncentered) WSRank score */ simscore=0.0; for (i=0; i=wrsumobs) nsuc++; /* Go on to next simulation */ } printf ("%d `success(es)' in %d trials\n", nsuc,nsims); printf (" (That is, randomized centered scores >= %g)\n", wrsumobs); printf ("Simulated P-value: %d/%d = %g\n", nsuc,nsims, (double)nsuc/nsims); printf ("95%% CONFIDENCE INTERVAL FOR TRUE P-VALUE:\n"); printf ("(The middle value is the estimated P-value.)\n"); pp=(double)nsuc/nsims; pwid=1.960*sqrt(pp*(1-pp)/nsims); printf (" (%.4f, %.4f, %.4f) NORMAL APPROXIMATION: %.4f\n", pp-pwid, pp, pp+pwid, wrsum_norm_approx_Pvalue); } /* End of rank-sum simulation function */