R version 2.12.1 (2010-12-16)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i386-pc-mingw32/i386 (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> # Chi-squared tests
> ####################
> # Documentation on chi-squared tests:
> help("chisq.test")
> # Chi squared test of made-up cat food preference data
> data<- c(28,40,40,60,52,92)
> chisq.test(data)
Chi-squared test for given probabilities
data: data
X-squared = 48.6154, df = 5, p-value = 2.659e-09
>
>
> # GenBank database of DNA sequences
> # Install the "ape" package:
> install.packages("ape")
> # note: the USA(MO) server didn't have it, but the USA(OH) server did
trying URL 'http://cran.case.edu/bin/windows/contrib/2.12/gee_4.13-16.zip'
Content type 'application/zip' length 62519 bytes (61 Kb)
opened URL
downloaded 61 Kb
trying URL 'http://cran.case.edu/bin/windows/contrib/2.12/ape_2.7-1.zip'
Content type 'application/zip' length 1063476 bytes (1.0 Mb)
opened URL
downloaded 1.0 Mb
package 'gee' successfully unpacked and MD5 sums checked
package 'ape' successfully unpacked and MD5 sums checked
The downloaded packages are in...
> # Load the package of APE functions into this R session:
> require("ape")
Loading required package: ape
Warning message:
package 'ape' was built under R version 2.12.2
> # Get a particular sequence
> ref<- c("NM_005369")
> data<-read.GenBank(ref)
> data
1 DNA sequence in binary format stored in a list.
Sequence length: 3801
Label: NM_005369
Base composition:
a c g t
0.327 0.168 0.214 0.291
> # It is wrong to use probabilities, or normalized frequencies,
> # in a chi-squared test:
> freqs<-c(0.327, 0.168, 0.214, 0.291)
> chisq.test(freqs)
Chi-squared test for given probabilities
data: freqs
X-squared = 0.0625, df = 3, p-value = 0.996
Warning message:
In chisq.test(freqs) : Chi-squared approximation may be incorrect
> # Get the raw frequency data (counts) from the GenBank data
> counts<-base.freq(data,freq=TRUE)
> counts
a c g t
1244 639 812 1106
> # Now "counts" contains info on the degrees of freedom
> # and produces a properly normalized chi-squared statistic
> chisq.test(counts)
Chi-squared test for given probabilities
data: counts
X-squared = 238.397, df = 3, p-value < 2.2e-16
> # Conclusion: reject H0:the distribution of a,c,g,t is uniform.
>
> # You could also get the counts through multiplying by the total
> # number of bases, but then round-off error will give nonintegers:
> freqs*3801
[1] 1242.927 638.568 813.414 1106.091
> # Re-read data with an updated method to get the sequence as a,c,g,t:
> data<-read.GenBank(ref, as.character=TRUE)
> data
$NM_005369
[1] "a" "g" "g" "a" "c" "a" "t" "g" "c" "c" "t" "g" "a" "t" "c" "t" "t" "c"
[19] "t" "t" "g" "a" "g" "c" "a" "g" "t" "c" "c" "a" "g" "c" "a" "c" "a" "a"
...
[3781] "t" "a" "a" "a" "a" "a" "t" "t" "a" "a" "t" "t" "c" "t" "t" "c" "a" "g"
[3799] "g" "t" "t"
attr(,"species")
[1] "Homo_sapiens"
> # Count the a,c,g,t frequencies with the "table()" command
> table(data)
data
a c g t
1244 639 812 1106
> # Now you can perform a chi-squared test.
> chisq.test(table(data))
Chi-squared test for given probabilities
data: table(data)
X-squared = 238.397, df = 3, p-value < 2.2e-16
> # Contingency Tables
> ####################
> # Count 2x2 contingency tables and their probabilities:
> # The row and column sums are marginal frequencies:
> r<-c(4,6); c<-c(5,5)
> # Possible tables are indexed by f11 (upper left corner)
> f11<- c(0,1,2,3,4)
> # Null hypothesis H0: X,Y are independent
> # with the specified row and column sums
> # implies a probability for each possible 2x2 table.
> # Calculate the probs of these 5 2x2 tables with the
> # hypergeometric pdf:
> dhyper( f11, c[1], c[2], r[1])
[1] 0.02380952 0.23809524 0.47619048 0.23809524 0.02380952
> # Get the same result after switching r and c:
> dhyper( f11, r[1], r[2], c[1])
[1] 0.02380952 0.23809524 0.47619048 0.23809524 0.02380952
> # Sum the probs of the "as bad or worse than f11=1" tables
> # to get the p-value of the f11=1 experimental result.
> sum(dhyper( c(1,0,3,4), r[1], r[2], c[1]))
[1] 0.5238095
> fisher.test( matrix(c(1,3,4,2),nrow=2) )
Fisher's Exact Test for Count Data
data: matrix(c(1, 3, 4, 2), nrow = 2)
p-value = 0.5238
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.002560407 4.586187647
sample estimates:
odds ratio
0.2033268
>
> # Null hypothesis H0: X,Y are independent
> # is equivalent to H0: fij = ri*cj/N
> # We may also use a chi-squared test with these expected frequencies:
> chisq.test( matrix(c(1,3,4,2),nrow=2) )
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(c(1, 3, 4, 2), nrow = 2)
X-squared = 0.4167, df = 1, p-value = 0.5186
Warning message:
In chisq.test(matrix(c(1, 3, 4, 2), nrow = 2)) :
Chi-squared approximation may be incorrect