Math 322 Biostatistics
Professor Wickerhauser

NEWS

Article on tests of randomness.

Confusingly, there is also a function named rpart.plot(),
which needs
install.packages("rpart.plot")
library(rpart.plot)
It may be used instead of plot.rpart() and has the same inputs.

There is an updated reg27.txt example
for tree classification. It includes more information about computing
misclassification rates with printcp() and predict().

More tree
documentation: Zach's example code for building and pruning Classification and
Regression Trees (CART) in R.

HW 11 Solutions are now available.

HW 10 Solutions are now available.

Plotting the output of rpart() should be done with plot.rpart(),
which needs
install.packages("plot.rpart")
library(plot.rpart)
and which only works with R versions 4 and higher.

The tree() function in R does not work in R version 4.0 and
higher, namely the latest versions. Use rpart() instead after
install.packages("rpart")
library(rpart)
Example file reg27.txt has been
updated to include rpart. Be sure to reload the page in your browser.

R EXAMPLES
Example R commands for:
 mean and median reg01.txt.
 histograms and samples reg02.txt.
 deviation and diversity reg03.txt.
 combinations and permutations reg04.txt.
 box plots and confidence intervals reg05.txt.
 use, power and sample size in t tests reg06.txt.
 Studentt test power demo reg33.txt
 Studentt tests from reduced data reg34.txt
 multinomial pdf calculation and sampling reg30.txt.
 Dirichlet pdf plotting in 3 variables reg35.txt.
 Matrix entry; MannWhitney, Wilcoxon, McNemar, and median
comparison tests reg07.txt.
 singlefactor analysis of variance with unequal
replication reg08.txt.
 multiple comparison of means reg09.txt.
 homoscedasticity tests reg10.txt.
 two and threefactor ANOVA reg11.txt.
 MANOVA demo reg12.txt.
 bivariate normal density estimation, sampling, and plotting with
persp() and contour() reg13.txt.
 simple linear regression reg14.txt.
 simple correlation reg15.txt.
 multiple linear regression and prediction reg16.txt.
 Kendall's W reg17.txt.
 partial correlation coefficients reg18.txt.
 goodness of fit reg19.txt.
 tests of independence in contingency tables reg20.txt.
 Fisher's exact test reg21.txt.
 binomial and hypergeometric densities reg22.txt.
 Poisson density and one randomness test reg23.txt.
 Tests of serial randomness reg24.txt.
 Installing the contributed package "snpar" reg25.txt.
 Gibbs sampling ma3220228Rcode.txt (from the February
28th, 2011 lecture).
 rejection, Metropolis, and MetropolisHastings sampling
metroph.txt (from the March 2nd, 2011 lecture).
 GenBank data and goodness of fit
322Rsession20110330.txt
(from the March 30th, 2011 lecture).
 multivariate visualization, principal components, Mahalanobis
distance, and linear discriminant analysis reg26.txt.
 classification trees, with gene data cleanup reg27.txt.
 clustering by means, medoids, agglomerative and divisive trees reg28.txt.
 multidimensional scaling by IsoMap reg29.txt.
 bootstrap method to estimate sampling error in nonnormal PDFs reg32.txt.
 How normal random variables combine to give chisquared and F densities reg36.txt.

reg37.txt is available to illustrate how the
minimum of k i.i.d. uniforms is distributed like Beta(1,k).

LINKS
 Zoom recordings links.
 Example Midterm (2016).
 Example Final (2020).
 Opensource software R
for statistical computing, and its manual.
 Download R from WUStL's
software archive.

Download R
Studio from its developer's website.
 Download a precompiled executable Maxima, for Windows,
from SourceForge.
 Maxima project home page,
for sources, documentation, links, and precompiled binary downloads for Linux,
Macintosh and other systems.
 Download old free MatLab (for
Windows or Linux PCs) from my website.
 There is an online Octave
to R dictionary, useful
for those who know MatLab or Octave well and want to learn the
corresponding R commands.
 R program in file deduct.R to solve HW 1's DNA
sequence counting problem.
 R program in file faker.R. Then "faker(n, mu, sd)"
generates n>1 samples with exact prescribed mean mu and exact standard
deviation sd.
 Notes (brillouin.pdf) on Brillouin
and Shannon diversity, for HW 1.
 Notes (condprob.pdf) on conditional
probabilities and continuous densities, for HW 3.
 R program in file bvnpdf.R,
to compute bivariate normal pdfs. Read the code for usage instructions.
 R program in file dagopear.R to perform
the D'AgostinoPearson test of normality and compute the associated
statistics.

R program in file cochran.R for Cochran's test of a
dichotomous variable without replication.

R program in file kendall.w.R to compute Kendall's
coefficient of concordance (Kendall's W). Call it using
kendall.w(tab)
where tab is a matrix with scores (or ranks) along its rows.
 NCI microarray data on 14 cancers:
 nci.info: some information on the data
 nci.names: just the 64 names identifying
14 cancers, to label the 64 rows of gene expression data.
 nci.data: gene expression data, 64 rows
of 6830 gene expression values. HINT: Save this to a file.
 Cleanedup NCI microarray data, 57 samples of 8 cancers with top
12 expressed genes: nci57x13.R, to be saved
into your R folder and read into the R session with
load("nci57x13.R"). The result is a data frame named "nci12".
 Download nci57x7.R and
load("nci57x7.R") to get the top 6 genes data. The result is a
data frame named "nci6".
 Download nci57x6831.R and
load("nci57x6831.R") to get the full gene expression data frame,
named simply "nci".
 Article and Table
1 on ABO blood types and cancer in Northern India, for the term project.
 WinBUGS and tutorials:
 Saed Sayad's notes on classifier evaluation:

Zach's example code
for building and pruning Classification and
Regression Trees (CART).

Article feller.pdf on tests of randomness.

Syllabus
Topics. This is a second course in applied statistics with
examples from biology and medicine. Topics include Bayes rule, Markov
chains, maximum likelihood estimation with MCMC, classical statistical
inference, ANOVA and MANOVA, multivariate visualization, multiple
regression, correlation, and classification.
Each student will be
required to perform and write a report on a
data analysis project.
Prerequisites. Math 3200, or Math 2200 and the permission of
the instructor.
Time. Classes meet Mondays, Wednesdays and Fridays, 3:00pm
to 3:50pm, in Steinberg Hall, room 105 (the Auditorium). Live and
recorded video of the lectures will also be available.
Text. The lectures will follow
Statistics Using R with Biological
Examples by Kim Seefeld and Ernst Linder, an etext that you
may download freely. (Alternative
local link.)
If you desire a paper copy, you may have it
printed and bound at any copy shop from this PDF file.
Supplementary readings and software may be found in the "LINKS" column above.
Homework. You are encouraged to collaborate on homework,
although each student must turn in solutions individually. Please
complete your solutions on CrowdMark by 11pm on the due date.
For full credit, homework solutions should be clearly legible with the
answers properly labeled. For computations, include the R commands
used, the input provided, and the output with
labels indicating which part of the solution is thus computed.
Suggestion: copy and paste your R session into a text editing
program and delete unnecessary text and space, then comment and
annotate as needed. Hand in homework as you would like to
get it if you were the grader.
Problem sets
will be assigned as follows:
 HW #1, due Mon, Feb 1
(Solutions)
 HW #2, due Mon, Feb 8
(Solutions)
 HW #3, due Mon, Feb 15
(Solutions)
 HW #4, due Mon, Feb 22
(Solutions)
 HW #5, due Mon, Mar 1
(Solutions)
 HW #6, due Mon, Mar 15
(Solutions)
 HW #7, due Wed, Mar 24
(Solutions)




Solutions, via CrowdMark, are due at 11:00pm on the due date. Late homework
will not be accepted.
Tests. There will be one Midterm Examination due on CrowdMark on
Monday, March 8th, 2021.
There will be one cumulative takehome Final
Examination, emphasizing the later material. It is due on
Friday, May 7th, 2021, on CrowdMark.
Project. There will be one data analysis project due
by Monday, May 3rd, 2021, on CrowdMark. A onepage Project Outline is
due Friday April 16th, 2021.
Late projects will not
be accepted. Projects may be
selected from this list, or chosen by the
student with the prior approval of the instructor.
Grading. One score will be assigned for homework, one for the
midterm examination, one for the final examination, and one for the project. These four
will contribute in respective shares of 40%, 20%, 20%, and 20% to the
course score. Letter grades, computed from the course score,
will be at least the following:
Course score at least:  90%  80%  70%  60% 
Letter grade at least:  A  B  C  D 
Students taking the Cr/NCr or P/F options will need a grade of D or better to
pass. Students taking the Audit option will need to attend 36 of the
40 class meetings to obtain a Successful Audit grade.
Computing. Students are encouraged to use R on their own computers or on
the computers available in the Arts and Sciences Computing
Center for both symbolic and numerical computations.
Office Hours. Mondays and Wednesdays 4:00pm5:00pm (after
class), Fridays 10:00am11:00pm, or by appointment. All office
hours will be held by remote videoconferencing.
Questions? Return to
M. Victor Wickerhauser's home page for contact information.