/* Finding exact P-values for permutation tests: */
/* Wilcoxon signed rank and rank sum tests */

/* Random number generators (RNGs) on a computer:
   The simplest good RNGs are `Linear Congruential RNGs', which
   are based on a single integer:

       unsigned int rseed;

   One first STARTS or `seeds' the RNG by setting

       rseed=MyStartingValue;

   Thereafter, successive random numbers are generated by setting
   in sequence

       rseed = A*rseed + B

   for optimally chosen `magic constants' A and B, with overflows
   in the 32-bit unsigned integer `rseed' ignored. In other words,
   
       rseed = A*rseed + B  (modulo 2^32)
   
   The RNG in statlib.c uses one of a set of constants suggested for
   32-bit unsigned integers by Donald Knuth in his book Seminumerical
   Algorithms:

       A = 1664525   and   B = 1

   The idea is that, after a few random numbers have been generated,
   `rseed' attains and remains near 2^{32}. Thereafter, each successive
   choice of a number X_n = A*rseed + B  overflows the 32-bit integer.
   This causes most of the previous high-order bits of `rseed' to be lost.

   Thus if we follow the rule that we only trust the higher-order bits
   of successive values of `rseed', then the sequence of `rseed' values
   obtained in this way should be random and uniformly distributed
   among values 0 <= rseed <= 2^{32} - 1. This in fact turns out to be
   the case, in the sense that these sequences pass a large number of
   reasonable statistical tests for joint independence and for having
   a uniform distribution.

   It is rare that random numbers are used in the form `rseed'. A
   function that is more useful is

         int nrand(int m);

   which returns a random integer X that is uniformly distributed in
   the range 0,1,2,...,m-1 for m>0.  Successive calls to nrand()
   return independent values. The code for nrand() in statlib.c is
   essentially

         int nrand(int m)
	  { if (m<=1) return 0;	   // Return 0 for errors in m
	    rseed = A*rseed + B;
	    return  (m * (rseed >> 16)) >> 16; }

   The operation `(x >> 16)' for an integer x shifts all of its bits
   by 16 bit locations to the right (i.e. towards lower-order bits) while
   replacing higher-order bits by zero. Thus `m * (x >> 16)' yields an
   `overflow' integer in the range 0,1,...,m-1 in the highest 16 bits
   while changing or churning up the lowest-order 16 bits. The next
   operation `(m * (x >> 16)) >> 16' erases the previous lower-order
   16 bits and leaves only a value in the range 0,1,2,...,m-1.
   
   Note that these operations return an integer 0,1,...,m-1 using only
   the high-order bits of `rseed'. In fact, the lower-order bits of
   `rseed' are not random with nice statistical properties. Since A
   ends with 5 and B=1, the integer rseed itself will always be equal
   to 1 or 6 modulo 10. However, the high-order bits of the successive
   values of `rseed' will cause successive calls to nrand(m) to pass
   all reasonable statistical tests for independent random variables
   with a uniform distribution among 0,1,2,...,m-1.

   Another useful function is

           double drand(void);

   whose code in statlib.c is essentially

           double drand(void)
	    { rseed = A*rseed + B;
	      return (double)rseed/(2^{32}); }

   This scales the unsigned integer `rseed' to a double-precision
   floating-point number that is uniformly distributed in the range
   0<=X<=1.0. (Note that the definition guarantees that always X<1.0
   for X=drand(), which is occasionally useful.)

   Successive random numbers based on `rseed' cannot be independent
   for arbitrarily long runs. In fact, updates to `rseed' will repeat
   themselves indefinitely once it sees a value of `rseed' that it has
   seen before. Thus successive values of `rseed' must cycle with a
   period that divides 2^{32}. The choice of (A,B) above has a cycle
   length of 2^{32} approx 4,000,000,000.
   
   In fact, long simulations show that `nrand()' begins to show
   statistically significant periodic behavior at around 300,000,000
   successive values. The standard numerical recipes book Press etal
   (``Numerical Recipes in C'') suggests that simple LCRNGs (like the one
   described above) not be used for more than 100,000,000 successive
   random numbers.

   All RNG functions in statlib.c increase a count  `rand_nums_used'
   that can be reported by the function 

        void show_randnums_used(void);

   in statlib.c. During the last time that Math408 was taught, none
   of the sample programs used more than 10,000,000 random numbers,
   and most used less than 1,000,000.

   A REASON TO KEEP TRACK OF RANDOM-NUMBER SEEDS:

   Note that when rseed=X_0 is set (or `SEEDED') at the beginning of
   a program, then successive `random values' of rseed

       X_1, X_2, X_3, ..., X_n, ...

   are strictly determined by X_0 no matter how large n is. This is
   extremely useful in scientific programming: If you want to run
   a simulation again and have the program calculate more statistics
   and print out more information, you can start the RNG with exactly
   the same seed. The random numbers X_1, X_2, X_3, ... will then be
   identical to what was generated before and your new program will
   reproduce the results of the old program exactly along with your
   new output. This is very useful for testing new versions of a
   program, since even `random' output starting with the same seed
   will be identical to before.

   SEEDING YOUR RNG:

   The protect the innards of the RNG package from the user (and allow
   a new RNG to be swapped in with minimal change), all uses of `rseed'
   are done through functions instead of working with `rseed' directly.
   In fact, `rseed' and `rand_nums_used' are `static global' variables
   in statlib.c that can be accessed only in statlib.c. This means that
   these values can be changed or read only by functions in statlib.c.

   For example, the RNG is initialized at MyStartingValue by the
   function call

         startseed=setrand(MyStartingValue);

   The code for setrand() in statlib.c essentially sets
   rseed=MyStartingValue, but also does several other useful things.
   For example, if MyStartingValue=0, a value is requested from the
   computer's system clock and that is used as the seed, as a way
   of getting a `truly random' starting value. The actual seed used
   is returned as `startseed' above. The value startseed!=MyStartingValue
   if MyStartingValue=0 (in which case `startseed' is a clock value).
   If your program displays `startseed', then you can repeat the same
   simulation by using `startseed' even if your first run seeded the
   RNG from the system clock.

   WHAT IF YOU NEED MORE THAN 300,000,000 RANDOM NUMBERS?

   In that case, you can use a `long-cycle' RNG. Several excellent RNGs
   of this type can be found on the Gnu Software Library Web site.
				*/


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include "statlib.h"

#define PROG_DATE   "February 8, 2007"
#define OUTPUT_NAME "RanksSims.txt"

#define STARTSEED 123456  /* Starting random-number seed */
#define NSIMS 10000       /* Number of permutation simulations */

/* Starting random-number seed for the RNG and the number of simulations */
/* These are stored in global variables for ease of customization: */
/*   For example, you could improve the program to change these */
/*   variables from the command line */

unsigned long int wseed=STARTSEED;
int nsims=NSIMS;


/* Prototypes of functions to evaluate the two tests */
void do_signedranks(void);
void do_ranksums(void);

int main(void)
 { printf (
   "\nProgram to illustrate finding exact P-values for permutation tests\n"
   "Wilcoxon signed rank and rank sum tests\n"
   "(This is  %s  on the Ma408 Web site.)\n", OUTPUT_NAME);
   printf ("Written by XXXX\n");
   do_signedranks();
   do_ranksums();
   /* Show how many random numbers were used */  
   show_randnums_used();
   /* No errors to report */
   return 0; }


/* WILCOXON SIGNED-RANK DATA (from onesamp_ranks.c) */

int nnsr=9;
double bef[20] =
 { 1.83,  0.50,  1.62,  2.48, 1.68, 1.88, 1.55, 3.06, 1.30 };
double aft[20] =
 { 0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29 };

/* Signed-rank differences, ranks, and signs (from onesamp_ranks.txt) */
double xdif[20] =
 { -0.952, 0.147, -1.022, -0.430, -0.620, -0.590, -0.490, 0.080, -0.010 };
double srank[20] =
 { 8, 3, 9, 4, 7, 6, 5, 2, 1 };
int srsign[20] =
 { 0, 1, 0, 0, 0, 0, 0, 1, 0 };
double wsr_exact_Pvalue=2*(double)10/512;  

/* Subroutine to display a sign */
char *showsign(int ss)
 { if (ss>0) return "+";
   else return "-"; }


void do_signedranks(void)
 { int i, ns,nsuc;
   unsigned long int startseed;
   double meanwsrank, wsrankobs;
   double pp,pwid;
   printf ("\nWILCOXON SIGNED-RANK TEST:\n");
   printf ("\nSigned rank (After-Before) data:\n");
   printf ("Data (n=%d):\n", nnsr);
   for (i=0; i<nnsr; i++) printf ("  %g", xdif[i]);  printf("\n");
   printf ("Signed ranks with signs (n=%d):\n", nnsr);
   for (i=0; i<nnsr; i++) printf ("  %s%g", 
     showsign(srsign[i]),srank[i]);  printf("\n");
   /* Find observed Wilcoxon signed-rank value */
   wsrankobs=0.0;
   for (i=0; i<nnsr; i++)
     if (srsign[i]>0) wsrankobs=wsrankobs+srank[i];
   /* Center wsrankobs at its expected mean for a 2-sided P-value */
   /* `(double)' is a CAST: It forces the next expression to be */
   /*   a double, not an int */
   /* This is necessary to divide by 4 without doing integer division, */
   /*   which would truncate a fraction */
   meanwsrank=(double)(nnsr*(nnsr+1))/4;
   printf ("Observed signed-rank statistic:  %g   Expected value: %g\n", 
     wsrankobs, meanwsrank);
   wsrankobs=fabs(wsrankobs-meanwsrank);
   printf ("Centered at mean for a 2-sided P-value:\n");
   printf ("  Observed (centered) Wilcoxon signed-rank statistic:  %g\n", 
     wsrankobs);
   printf ("Exact permutation-test P-value: P = %g (2-sided)\n",
     wsr_exact_Pvalue);

   printf (
   "\nEstimating the true P-value by doing %d random permutations:\n",
     nsims);
   /* Initial seed of random-number generator: */
   /* If wseed=0, then the RNG is seeded from the system clock, */
   /*   and startseed!=0 is the true seed */
   /* Otherwise, startseed=wseed */
   startseed=setrand(wseed);
   printf ("Starting random-number seed: %lu\n", startseed);
   printf ("Exact P-value:  %g\n", wsr_exact_Pvalue);
   /* Do nsims simulations */
   nsuc=0;   /* The number with rand(Wsrank) >= Obs(Wsrank) */
   for (ns=1; ns<=nsims; ns++)
    { double randwsr;
      /* Randomize the sign array */
      for (i=0; i<nnsr; i++)
       { if (nrand(2)==1) srsign[i]=1;
         else srsign[i]=0; }
      /* Recompute the (uncentered) WSRank score */
      /* Note that srsign[] varies but srank[] is fixed */
      randwsr=0.0;
      for (i=0; i<nnsr; i++)
        if (srsign[i]>0) randwsr=randwsr+srank[i];
      randwsr=fabs(randwsr-meanwsrank);
      /* Center wsrankobs at its expected mean for a 2-sided P-value */
      /* A `success' is a more extreme (or same) value of wsrank */
      if (randwsr>=wsrankobs) nsuc++;
      /* Go on to next simulation */ }
   printf ("%d `success(es)' in %d trials\n", nsuc,nsims);
   printf ("  (That is, randomized centered scores >= %g)\n", wsrankobs);
   printf ("Simulated P-value:  %d/%d = %g\n",
     nsuc,nsims, (double)nsuc/nsims);
   printf ("95%% CONFIDENCE INTERVAL FOR TRUE P-VALUE:\n");
   printf ("(The middle value is the estimated P-value.)\n");
   pp=(double)nsuc/nsims;
   pwid=1.960*sqrt(pp*(1-pp)/nsims);
   printf ("  (%.4f, %.4f, %.4f)   EXACT: %.4f\n", 
     pp-pwid, pp, pp+pwid, wsr_exact_Pvalue);
     }   /* End of signed-rank simulation function */


/* Perfect shuffles:
   A key step in simulating P-values for the Wilcoxon rank-sum statistic
   is the `perfect shuffle' function

       void shuffle_dub(double *deck, int num);

   Thus function permutes the `num' elements of array of doubles deck[]
   of length num uniformly and independently subject to the fact that
   it is a permutation. Equivalently, it picks a permutation uniformly
   from the group S(num) of all permuations of `num' letters, with
   successive shuffles independent of previous shuffles. The code
   in statlib.c

       void shuffle_dub(double *deck, int num)
        { while (--num > 0)
           { int j=nrand(num+1);
             double dtemp=deck[num];  deck[num]=deck[j];  deck[j]=dtemp; }}

   says that, for each k=num-1,num-2,...,1, the k-th element is swapped
   with a randomly chosen double with offset j=k,k-1,...,1, and
   is thereafter fixed. The option j=k allows the k-th element to be
   unchanged. The result is a `perfect shuffle' with any element being
   equally likely to end up in any position, any pair of elements equally
   likely to end up in any pair of positions, any triple of elements
   equally likely to end up in any three positions, and so forth.

   The function `shuffle_dub' applies to arrays of doubles. Other
   functions in statlib.c can be used to apply `perfect shuffles'
   to arrays of integers and to arrays of chars.  */


/* WILCOXON RANK-SUM DATA (see twosamps_ranks.c) */

/* First sample */
int mm=9;
int xx[] = { 23,  21,  22,  14,  23,  19,  15,  15,  16 };
/* Midranks of first sample sorted in increasing order */
/*   (from twosamps_ranks.txt) */
double mrankx[20] =
 { 1, 2.5, 2.5, 4, 5.5, 9.5, 12.5, 15.5, 15.5 };

/* Second sample */
int nn=14;
int yy[] = { 27,  27,  22,  21,  27,  28,  19,  20,  31,  24,
   21,  23,  23,  21 };
/* Midranks of second sample sorted in increasing order */
/*   (from twosamps_ranks.txt) */
double mranky[20] =
 { 5.5, 7, 9.5, 9.5, 9.5, 12.5, 15.5, 15.5, 18, 20, 20, 20, 22, 23 };

/* Approximate 2-sided P-value from the large-sample */
/*   normal approximation (from twosamps_ranks.txt) */
double wrsum_norm_approx_Pvalue=0.0118;

/* Midranks for both samples in a single array */
double mrank[40];


void do_ranksums(void)
 { int i,r, ns,nsuc, nmn;
   unsigned long int startseed;
   double meanwrsumrank, wrsumobs;
   double pp,pwid;
   printf ("\nWILCOXON RANK-SUM TEST:\n");
   /* Display both samples */
   printf ("\nSample I (m=%d):\n", mm);
   for (i=0; i<mm; i++) 
     printf ("  %d", xx[i]);  printf("\n");
   printf ("Sample I midranks (m=%d):\n", mm);
   for (i=0; i<mm; i++) 
     printf ("  %g", mrankx[i]);  printf("\n");
   printf ("\nSample II (n=%d):\n", nn);
   for (i=0; i<nn; i++) 
     printf ("  %d", yy[i]);  printf("\n");
   printf ("Sample II midranks (n=%d):\n", nn);
   for (i=0; i<nn; i++) 
     printf ("  %g", mranky[i]);  printf("\n");

   /* Combine midranks together into one array */
   for (r=i=0; i<mm; i++)  mrank[r++]=mrankx[i];
   for (i=0; i<nn; i++)  mrank[r++]=mranky[i];
   nmn=mm+nn;
   printf ("\nJoint midranks in one array (n=%d):\n", nmn);
   for (r=0; r<nmn; r++)
    { /* Restrict to 10 midranks per row on the screen */
      if (r>0 && (r%10)==0) printf ("\n");
      printf ("  %g", mrank[r]); } printf("\n");

   /* The statistic W_X is the sum of the first mm midranks only */
   wrsumobs=0.0;
   for (r=0; r<mm; r++) wrsumobs=wrsumobs+mrank[r];
   meanwrsumrank=mm*(double)(mm+nn+1)/2;
   printf ("\nObserved W_X score: %g    Expected value: %g\n",
     wrsumobs, meanwrsumrank);
   wrsumobs=fabs(wrsumobs-meanwrsumrank);
   printf ("  Centered at expected value for 2-sided test:  %g\n", wrsumobs);
   printf ("Normal-approx permutation-test P-value: P = %g (2-sided)\n",
     wrsum_norm_approx_Pvalue);

   /* Initialize random-number generator at a starting seed */
   printf (
   "\nEstimating the true P-value by doing %d random permutations\n",
     nsims);
   printf ("  (specifically, by doing %d random shuffles):\n", nsims);
   /* Initial seed of random-number generator: */
   /* If wseed=0, then the RNG is seeded from the system clock, */
   /*   and startseed!=0 is the true seed */
   /* Otherwise, startseed=wseed */
   startseed=setrand(wseed);
   printf ("Starting random-number seed: %lu\n", startseed);
   printf ("Exact P-value:  %g\n", wrsum_norm_approx_Pvalue);

   /* Do nsims simulations */
   nsuc=0;   /* The number with rand(Wsrank) >= Obs(Wsrank) */
   for (ns=1; ns<=nsims; ns++)
    { double simscore;
      /* Do a `perfect' random shuffle of mrank[] array */
      /* Since a `perfect shuffle' of a perfect shuffle is also a */
      /*   perfect shuffle, we can keep reshuffling the same array */
      shuffle_dub(mrank,nmn);
      /* Recompute the (uncentered) WSRank score */
      simscore=0.0;
      for (i=0; i<mm; i++) simscore=simscore+mrank[i];
      /* Center at E(W_X) */
      simscore=fabs(simscore-meanwrsumrank);
      /* A `success' is a more extreme (or same) value of wranksum */
      if (simscore>=wrsumobs) nsuc++;
      /* Go on to next simulation */ }
   printf ("%d `success(es)' in %d trials\n", nsuc,nsims);
   printf ("  (That is, randomized centered scores >= %g)\n", wrsumobs);
   printf ("Simulated P-value:  %d/%d = %g\n",
     nsuc,nsims, (double)nsuc/nsims);
   printf ("95%% CONFIDENCE INTERVAL FOR TRUE P-VALUE:\n");
   printf ("(The middle value is the estimated P-value.)\n");
   pp=(double)nsuc/nsims;
   pwid=1.960*sqrt(pp*(1-pp)/nsims);
   printf ("  (%.4f, %.4f, %.4f)   NORMAL APPROXIMATION: %.4f\n", 
     pp-pwid, pp, pp+pwid, wrsum_norm_approx_Pvalue);
     }   /* End of rank-sum simulation function */