R version 2.12.1 (2010-12-16) Copyright (C) 2010 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: i386-pc-mingw32/i386 (32-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > # Chi-squared tests > #################### > # Documentation on chi-squared tests: > help("chisq.test") > # Chi squared test of made-up cat food preference data > data<- c(28,40,40,60,52,92) > chisq.test(data) Chi-squared test for given probabilities data: data X-squared = 48.6154, df = 5, p-value = 2.659e-09 > > > # GenBank database of DNA sequences > # Install the "ape" package: > install.packages("ape") > # note: the USA(MO) server didn't have it, but the USA(OH) server did trying URL 'http://cran.case.edu/bin/windows/contrib/2.12/gee_4.13-16.zip' Content type 'application/zip' length 62519 bytes (61 Kb) opened URL downloaded 61 Kb trying URL 'http://cran.case.edu/bin/windows/contrib/2.12/ape_2.7-1.zip' Content type 'application/zip' length 1063476 bytes (1.0 Mb) opened URL downloaded 1.0 Mb package 'gee' successfully unpacked and MD5 sums checked package 'ape' successfully unpacked and MD5 sums checked The downloaded packages are in... > # Load the package of APE functions into this R session: > require("ape") Loading required package: ape Warning message: package 'ape' was built under R version 2.12.2 > # Get a particular sequence > ref<- c("NM_005369") > data<-read.GenBank(ref) > data 1 DNA sequence in binary format stored in a list. Sequence length: 3801 Label: NM_005369 Base composition: a c g t 0.327 0.168 0.214 0.291 > # It is wrong to use probabilities, or normalized frequencies, > # in a chi-squared test: > freqs<-c(0.327, 0.168, 0.214, 0.291) > chisq.test(freqs) Chi-squared test for given probabilities data: freqs X-squared = 0.0625, df = 3, p-value = 0.996 Warning message: In chisq.test(freqs) : Chi-squared approximation may be incorrect > # Get the raw frequency data (counts) from the GenBank data > counts<-base.freq(data,freq=TRUE) > counts a c g t 1244 639 812 1106 > # Now "counts" contains info on the degrees of freedom > # and produces a properly normalized chi-squared statistic > chisq.test(counts) Chi-squared test for given probabilities data: counts X-squared = 238.397, df = 3, p-value < 2.2e-16 > # Conclusion: reject H0:the distribution of a,c,g,t is uniform. > > # You could also get the counts through multiplying by the total > # number of bases, but then round-off error will give nonintegers: > freqs*3801 [1] 1242.927 638.568 813.414 1106.091 > # Re-read data with an updated method to get the sequence as a,c,g,t: > data<-read.GenBank(ref, as.character=TRUE) > data $NM_005369 [1] "a" "g" "g" "a" "c" "a" "t" "g" "c" "c" "t" "g" "a" "t" "c" "t" "t" "c" [19] "t" "t" "g" "a" "g" "c" "a" "g" "t" "c" "c" "a" "g" "c" "a" "c" "a" "a" ... [3781] "t" "a" "a" "a" "a" "a" "t" "t" "a" "a" "t" "t" "c" "t" "t" "c" "a" "g" [3799] "g" "t" "t" attr(,"species") [1] "Homo_sapiens" > # Count the a,c,g,t frequencies with the "table()" command > table(data) data a c g t 1244 639 812 1106 > # Now you can perform a chi-squared test. > chisq.test(table(data)) Chi-squared test for given probabilities data: table(data) X-squared = 238.397, df = 3, p-value < 2.2e-16 > # Contingency Tables > #################### > # Count 2x2 contingency tables and their probabilities: > # The row and column sums are marginal frequencies: > r<-c(4,6); c<-c(5,5) > # Possible tables are indexed by f11 (upper left corner) > f11<- c(0,1,2,3,4) > # Null hypothesis H0: X,Y are independent > # with the specified row and column sums > # implies a probability for each possible 2x2 table. > # Calculate the probs of these 5 2x2 tables with the > # hypergeometric pdf: > dhyper( f11, c[1], c[2], r[1]) [1] 0.02380952 0.23809524 0.47619048 0.23809524 0.02380952 > # Get the same result after switching r and c: > dhyper( f11, r[1], r[2], c[1]) [1] 0.02380952 0.23809524 0.47619048 0.23809524 0.02380952 > # Sum the probs of the "as bad or worse than f11=1" tables > # to get the p-value of the f11=1 experimental result. > sum(dhyper( c(1,0,3,4), r[1], r[2], c[1])) [1] 0.5238095 > fisher.test( matrix(c(1,3,4,2),nrow=2) ) Fisher's Exact Test for Count Data data: matrix(c(1, 3, 4, 2), nrow = 2) p-value = 0.5238 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.002560407 4.586187647 sample estimates: odds ratio 0.2033268 > > # Null hypothesis H0: X,Y are independent > # is equivalent to H0: fij = ri*cj/N > # We may also use a chi-squared test with these expected frequencies: > chisq.test( matrix(c(1,3,4,2),nrow=2) ) Pearson's Chi-squared test with Yates' continuity correction data: matrix(c(1, 3, 4, 2), nrow = 2) X-squared = 0.4167, df = 1, p-value = 0.5186 Warning message: In chisq.test(matrix(c(1, 3, 4, 2), nrow = 2)) : Chi-squared approximation may be incorrect