Math 460: Multivariate Statistical Analysis
Spring 2016

Instructor: Todd Kuffner (kuffner@math.wustl.edu)

Lecture: 2:30-4:00pm, Tuesday and Thursday, Cupples I, Room 218

Office Hours: Monday 3:00-4:00pm, Tuesday/Thursday 1:05-2:00pm in Room 18, Cupples I

Course Overview: This course introduces multivariate statistical analysis. The material will be presented at a level suitable for advanced undergraduate and master's degree students. Topics include: review of some important concepts (likelihood, quadratic forms, random vectors and matrices, multiple regression and variable selection), an overview of classical multivariate statistics, multivariate regression, dimensionality reduction, discriminant analysis and classification. Additional topics will be selected from modern statistical learning methodology. Emphasis will be given to applications using R.

Prerequisite: It is assumed that students are already familiar with probability at the level of Math 493, and have taken a course in linear models, such as Math 439. Familiarity with R is essential. A course in computer programming would be helpful. Knowledge of multivariate calculus and matrix algebra at the level of Math 233 and Math 309, respectively, is assumed.

Piazza: Make sure to enroll in this course on Piazza.

Textbook: There is no required textbook for the course, but I do recommend using additional references. Some good open-access books:
Another good (but not free) reference is: Applied Multivariate Statistical Analysis (Sixth Edition) by Johnson and Wichern

Computing: Familiarity with R is required. You can find many tutorials by clicking here. On the left side under Documentation, select Contributed to see a list of tutorials. Paul Hewson has compiled a wonderful resource page for R packages relevant for multivariate statistical analysis: click here. Also see his textbook link above, which includes material on matrices.

List of Topics (tentative):
• Types of multivariate data, visualizing multivariate data
• Review of matrix algebra, multivariate normal, distributions of quadratic forms, likelihood methods
• Review of variable selection in multiple regression (stepwise, least-angle regression, lasso)
• Review of tools for model assessment and inference (bootstrap, cross-validation)
• Multivariate regression
• Dimensionality reduction (principal component analysis, canonical correlations)
• Discriminant analysis (binary classification and multiclass linear discriminant analysis)
• Classification trees and regression trees
• Neural networks
• Support vector machines
• Cluster analysis
• Committee machines (bagging, boosting and random forests)
Grades: 30% Homework, 35% for each Midterm, 35% Final

Exams: 1 midterm and 1 final.

Homework: The lowest homework grade will be dropped. Homework is due at the beginning of class on the specified due date.

Final Course Grade: The letter grades for the course will be determined according to the following numerical grades on a 0-100 scale.
 A+ [98, 100] B+ [87, 90) C+ [77, 80) D+ [67, 70) F [0,60) A [93, 98) B [83, 87) C [73, 77) D [63, 67) A- [90, 93) B- [80, 83) C- [70, 73) D- [60, 63)

Course Schedule: This will be updated regularly. Future assignment due dates are tentative and subject to change.
 Week 1 01/18-01/22 Theme: Review Types and visualizations of multivariate data; introduction to classical multivariate analysis; random vectors and multivariate normal; matrix decompositions; matrix norms; basics of numerical analysis: error sources (data, truncation, rounding); machine precision; ill-conditioning and condition numbers of matrices; examples in R Week 2 01/25-01/29 Theme: Random Matrices Random matrices; sample covariance matrix; Wishart distribution; Hotelling's T-squared; maximum likelihood estimation; application to distribution of eigenvalues Week 3 02/01-02/05 Theme: Principal Components Analysis Dimensionality reduction; biplots; scree plots; geometric interpretation; image compression; applications in R Week 4 02/08-02/12 Theme: Acquiring Multivariate Data and Canonical Correlation Analysis Web scraping; applications to Twitter; sentiment analysis; R package twitteR Canonical variate and canonical correlation analysis; examples in R Week 5 02/15-02/19 Theme: Linear Models Review Example in R; matrix calculus; the hat matrix; review of vector spaces; geometric interpretation of least squares; decompositions of sums of squares (using orthogonal complements, and using projections); consistency of the normal equations; generalized inverses; projection matrices; Gauss-Markov theorem; properties of idempotent matrices; distributions of quadratic forms (Cochran's theorem); hypothesis testing and confidence intervals Common problems: collinearity; transformations; omitted variables; non-constant variance; p>n Week 6 02/22-02/26 Theme: Introduction to High-Dimensional Statistics Curse of dimensionality and failure of local averaging; geometry of high-dimensional spaces; vanishing volumes of high-dimensional balls (and crust concentration); false positive control in linear regression; poor properties of empirical covariance matrix; computational complexity; inadequacy of classical asymptotics Gaussian concentration inequality; Lipschitz functions; flattening of multivariate normal density in high dimensions Week 7 02/29-03/04 Theme: Model Selection in High-Dimensional Linear Regression Sparsity; Akaike Information Criterion; optimality and decision theory; oracle risk bounds; minimax risk bounds Week 8 03/07-03/11 Theme: Variable SelectionConvex optimization; Karush-Kuhn-Tucker conditions; Lagrangian duality; subgradients and gradient descent; examples of estimators and convex programs (lasso, elastic net) Algorithms; gradient descent; least angle regression; SCAD and nonconvex programs; examples in R R packages: lars, glmnet, flare Week 9 03/14-03/18 Spring Break Week 10 03/21-03/25 Theme: Practical Issues Tuning parameters; cross-validation; nonparametric bootstrap; bootstrap confidence intervals; more examples (Dantzig selector, square root lasso); dimension reduction for regression Week 11 03/28-04/01 Theme: Post-Selection Inference and Multiple Testing Selective inference, simultaneous inference; covariance test, spacing test; stability selection; polyhedral lemma; review of multiple testing; FDR, FWER, FCR; Benjamini-Hochberg procedure; sequential testing; ForwardStop R package: selectiveInference Week 12 04/04-04/08 Theme: Post-Selection InferenceHigh-dimensional inference; multi sample splitting; de-sparsified lasso; ridge projection R packages: hdi, PoSI Week 13 04/11-04/15 Theme: Multivariate Regression and ClassificationConcepts in multivariate regression; testing; linear discriminant analysis; support vector machines Week 14 04/18-04/22 Theme: Predictive ModelingClassification and regression trees; bagging; boosting; AdaBoost Week 15 04/25-04/29 Theme: Predictive ModelingArtificial neural networks; problems in statistical inference for predictive models Reading Period 05/02-05/04

Other Course Policies: Students are encouraged to look at the Faculty of Arts & Sciences policies.