/* This program carries out a simple calculation. The programs after
     this will begin to do more substantial heavy lifting.

   Specifically, this program considers two samples of numbers:
  
     Sample I (Xs):   9.19  9.54  8.65  7.31  8.47  9.78
     Sample II (Ys):  8.73  8.17  6.40  6.31  7.09  7.99  5.89  6.38  8.24
  
   and carries out the classical two-sample t-test for the hypothesis
  
     H0: E(X)=E(Y)     

   The two non-comment lines tells the C compiler to include standard C
     ``header'' files that have prototypes for standard functions.
     #include <stdio,h> is needed for printf().  #include <math.h> is needed
     to introduce common math functions like sine (sin(x)), cosine (cos(x)),
     log (log(x) for base e, log10(x) for base 10, sqrt (square root),
     and many others. Of these, we use sqrt(x) and fabs(x), the latter
     of which gives the absolute value of a floating-point number. */

#include <stdio.h>
#include <math.h>

/* Variables in C:  C has many different types of variables, and also has 
     ways to create additional variable types of your own. This program uses
                                             
     integer variables   for integer values  and     
     `doubles'           for real numbers that might have fractional parts 
                                             
   `Double' is short for `double-precision floating-point' number. 
     These are the most common floating-point variables in C. 

   Variables in C must be ``declared'' before they are used, so that 
     the compiler will know what kind of code to generate for them. 
     They can also be ``initialized'' (set equal to a starting value) 
     at the same time. Two typical variable declarations are 
  
        int i;  double ff;
  
     which tell the compiler that `i' will be an integer variable and that
     `ff' is a double (that is, a double-precision floating-point variable).
  
   The syntax of variable names in C is that their names can be arbitrary
     strings of letters (a-z, A-Z) and digits (0-9), such that a letter
     comes first. For example,  x, ytop, xval, Xval, x2, x22, and f2g37zz
     are legal variable names. 21xtop is not. Case is significant,
     so that  xval, Xval, and XVAL are three different variables.
  
   Examples of declarations with initializations are
  
        int i=5;  double ff=37.1371;
  
     These mean that the integer variable  i  starts out with the 
     value 5 and  ff  starts out with the value 37.1371.
  
   In C, the statement i=5 (or x=y) means that the value on the right
     (5 or y) is stored in the variable on the left (i or x). In contrast,
     i==5 and x==y are logical statements, as in
  
       if (i==5) x=y;
  
     Here the statement `x=y' is only carried out if i==5 beforehand.
     The value of  i  (whether it was equal to 5 or not) is unchanged.
     Confusing the two is a common cause of programming errors, but modern
     C compilers will warn if you might have made this mistake.
  
   An ARRAY of variables is a list of variables that are kept together
     to make them easier to work with.  The declarations
  
        int ii[5];  double gg[10];
  
     define ``ii'' as a collection of 5 integer-value variables and
     ``gg'' as a collection of 10 doubles. An oddity of C is that array
     indices begin with 0 and not 1. That is, the first integer in the
     array `ii' is ii[0], the second is ii[1], and so forth.
  
   Examples of declarations of arrays with initializations are
  
        int ii[5] = { 2, 12, 151, -1, -2 };
        double gg[10] = { 1.0, 2, 5.1, 37.1, 22 };
  
     After this initiallization, the 5 integers in int ii[5]; (with
     their starting values) are
  
        ii[0]=2,  ii[1]=12,  ii[2]=151,  ii[3]=-1,  and ii[4]=-2
  
   In particular, even though the array ii[] is of length 5, the value
     ii[5] is not defined, since it points to one integer beyond the
     end of the array ii[]. Referring to a value outside of an array
     (like `ii[5]=7;') can cause a program to crash with what is called
     a ``segmentation fault''. These are caused by memory references to
     memory locations that have not been allocated by the computer's
     central processor and thus may not exist (or else may exist but
     be in some other program).
  
   The first of the two samples of data is defined by   */
                                              

int mm=6;
double xx[20] = { 9.19, 9.54, 8.65, 7.31, 8.47, 9.78 };

                                              
/* These statements define and initialize an integer mm, which is the 
     number of values in the first sample, and the first simple itself, 
     stored in the array xx[]. For simplicity, we declare more memory 
     for xx[] than we will actually need.
  
   The second sample size (nn) and the second sample (yy[]) are  */
                                              

int nn=9;
double yy[20] = { 8.73, 8.17, 6.40, 6.31, 7.09, 7.99, 5.89, 6.38, 8.24 };

/* The main() (starting) function of this program is:  */

int main(void)
 { /* main() begins with declarations of two integer variables: */
   int i, degfree;
   /* and follows with several declarations of doubles, most of which 
        will be scratch variables in the computations to follow: */
   double sum1,sum2, xmean,ymean, ss1,ss2, xss,yss;
   double hmn,pooledss,tt;
   /*   This is a declaration of a double with an initial value 
        that we will need below. */
   double pval=0.00985;

   /* By C convention, adjacent literal text strings "xxxxx" "yyyy" are 
        combined to form one text string "xxxxxyyyy", even if the two 
        text strings are on different lines. Thus we can print (display) 
        the following two lines with one printf statement. We do it this  
        way to make the source file (this program) easier to read.  */
   printf (
   "\nThis is the program TWOSAMP written by XXXXX.\n"
   "We want to test the hypothesis  H_0:mu_X=mu_Y  using data from\n"
     "  the two samples:\n\n");

/*    We now display the two samples before carrying out the test, so 
        that someone running the program will know what two samples 
        we were talking about.   

      The first argument in a printf() statement with variables is the 
        `format string'. In the format string, `%d' means that printf() 
        expects an integer value (here mm) at this point in the argument
        list of printf() following the format string, and will substitute
        the value of that integer variable for %d. `%f' and `%g' mean
        doubles. `\n' is the EOL character (for end-of-line), which starts
        a new line. 

      The `for' loop `for(i=0; i<mm; i++)' below traverses the first mm 
        values of the array xx[].  That is, for values of i in the range 
        i=0,1,2,...,mm-1, we carry out an operation with that value of i. 
        The statements `printf("\n");', which tell the computer to start 
        a new line, are NOT in the for(..) loops and are each executed only 
        once, so that each sample is displayed on one line.  */

   printf ("    X (m=%d):", mm);
   for (i=0; i<mm; i++)  printf ("  %g",xx[i]);  printf("\n");
   printf ("    Y (n=%d):", nn);
   for (i=0; i<nn; i++)  printf ("  %g",yy[i]);  printf("\n");

/*    In general, the syntax for a `for' loop is EITHER 

        for (INITIALIZE; TEST; UPDATE)  
           ONE_PROGRAM_STATEMENT;     

      or else   

        for (INITIALIZE; TEST; UPDATE)  
         { STATEMENT1;  STATEMENT2; ...; } 

        using curly braces {} to define a block of statements.  

      The PROGRAM BODY (ONE_PROGRAM_STATEMENT or { STATEMENT1; ...;} ) 
        for a for loop is carried out with variable values (usually changed 
        by the UPDATE step) until the TEST is false. If TEST is false on the 
        first try, then the program body is never executed. 
      Line endings in the source file are not important in C syntax. 
        Thus the `printf("\n");' statements are ``out of the loop'' in 
        the for(..) statements above and are executed only when the loop 
        is done. 

      We now start doing something useful: First, we compute the sample 
        mean (xmean) and sample variance (xss) for the first sample. 
        The `for' loop computes sum1=xx[0]+xx[1]+...+xx[mm-1] recursively, 
        so that  xmean=sum1/mm  is the sample mean. */

   sum1=0.0;           /*  So that sum1=sum1+xx[0];  means sum1=xx[0]; */
   for (i=0; i<mm; i++) sum1=sum1+xx[i];   /* This uses the integer i; */
   xmean=sum1/mm;

   /* We now compute the sample variance  xss=s_X^2  in the same way: */
   ss1=0.0;
   for (i=0; i<mm; i++) ss1 = ss1 + (xx[i]-xmean)*(xx[i]-xmean);
   xss=ss1/(mm-1);   /* Divide by mm-1, not mm, for the sample variance */

   /* The sample mean (ymean) and sample variance (yss) for the second 
        sample are */
   sum2=0.0;
   for (i=0; i<nn; i++) sum2=sum2+yy[i];
   ymean=sum2/nn;

   ss2=0.0;
   for (i=0; i<nn; i++) ss2=ss2+(yy[i]-ymean)*(yy[i]-ymean);
   yss=ss2/(nn-1);

   /* The two-sample t-statistic assumes that the two samples come from 
        two statistical sources that have the same variance. This common 
        variance is estimated by the POOLED SAMPLED VARIANCE estimator 
        for the two samples, which is by definition     
                                                        
         s_P^2 = ((mm-1)s_X^2 + (nn-1)s_Y^2)/(mm+nn-2)  
                                                        
      We can compute this here more easily since
                                                        
         ss1=(mm-1)*s_X^2    and    ss2=(nn-1)*s_Y^2    */

   pooledss=(ss1+ss2)/(mm+nn-2);

   /* By definition, the sample standard deviations are the square 
        roots of the sample variances, which C allows us to compute 
        on the fly within the printf() argument list: */

   printf("\nThe sample means and sample standard deviations are:\n\n");
   printf("    X:  (m=%d)   Xbar=%.2f   s_X=%.2f\n", mm, xmean, sqrt(xss));
   printf("    Y:  (n=%d)   Ybar=%.2f   s_Y=%.2f\n", nn, ymean, sqrt(yss));

   printf ("\nThe pooled estimate of the assumed common theoretical\n");
   printf ("  standard deviation is:  s_P=%.2f\n",  sqrt(pooledss));

   /* We now compute the two-sample T statistic: 

      The programming code below and above computes more-or-less 
        straightforwardly the formula for the two-sample t-statistic that 
        can be found in most beginning statistics books. More complicated 
        formulas can be calculated in the same way. */

   hmn=(1.0/mm)+(1.0/nn);   /*   Needed in denominator of T statistic */

   /* WARNING:  1/mm==0 above, since a/b for integers a,b in C always
        means the integer part of the ratio, which is 0 here. In contrast, 
        1.0/mm tells C that we want to use floating-point arithmetic, 
        which gives us the right answer, for example 0.25 if mm==4. 
        See below for more details about integer division.   */

   /* The t-statistic itself and its degrees of freedom:  */
   tt = (xmean-ymean)/sqrt(hmn*pooledss);
   degfree=mm+nn-2;

   printf ("\nThe two-sample t-statistic for H_0 is:\n\n");
   printf ("    T=%.2f    deg.free. = %d + %d - 2 = %d\n", 
     tt, mm,nn, degfree);
   /* fabs(..) below is another standard math function, which 
        computes the absolute value. Its prototype is also in  math.h */
   printf ("\nA table of Student-t P-values tells us that\n\n");
   printf ("    P = Prob(|t(%d)| >= %g) = %g  (two-sided)\n\n",
     degfree, fabs(tt), pval);
   printf ("so that we reject H_0 and conclude mu_X != mu_Y.\n");

   /* In this example, we found `pval' separately and stored its value 
        in the double variable `pval'. In most of this course we will
        use C computer code that computes Student-t P-values directly
        without having to refer to tables or other computers. 

      INTEGER DIVISION:  In C, ``integer division'' is quite different
        from  `floating-point division''. If you divide two integers, you
        get another integer with the remainder thrown away, so that 
                                                         
          3/3=1,  4/3=1,  5/3=1,  6/3=2,  7/3=2,  etc.   
                                                         
      This may look strange, but is useful for computing array indices 
      and other purposes. Floating-point division works as expected, 
      so that in contrast                                  
                                                           
          3.0/3=1.0,  4.0/3=1.33333...,  5.0/3=1.66666....,  etc
          
      If either value in a binary expression (like x*y or x/y) is a
        floating point, the only value is ``promoted'' to a floating
        point value and C uses floating point operations.     
                                                           
      By C convention, a literal number with a decimal point is a 
        floating-point constant (like 3.0 or 3.) and a number without 
        a decimal point (like 3 or 33) is an integer. Integers in 
        expressions with floats are first automatically converted to 
        floats, so that  3.0/2 = 3.0/2.0 = 1.50  (a float).  
                                                                
      In particular, if we had said `hmn=(1/mm)+(1/nn);' above, 
        then we would have gotten `hmn=0.0', since the compiler 
        would have assumed that we meant integer divison on the 
        right-hand side of the equation, and `x=0' for floating point
        x and integer 0 means that x=0.0 (i.e., floating point 0).

      The program has now done its task, and is free to go home. */

   return 0; }