Math 460: Multivariate Statistical Analysis

Spring 2016

Spring 2016

Instructor:
Todd Kuffner (kuffner@math.wustl.edu)

Lecture: 2:30-4:00pm, Tuesday and Thursday, Cupples I, Room 218

Office Hours: Monday 3:00-4:00pm, Tuesday/Thursday 1:05-2:00pm in Room 18, Cupples I

Course Overview: This course introduces multivariate statistical analysis. The material will be presented at a level suitable for advanced undergraduate and master's degree students. Topics include: review of some important concepts (likelihood, quadratic forms, random vectors and matrices, multiple regression and variable selection), an overview of classical multivariate statistics, multivariate regression, dimensionality reduction, discriminant analysis and classification. Additional topics will be selected from modern statistical learning methodology. Emphasis will be given to applications using R.

Prerequisite: It is assumed that students are already familiar with probability at the level of Math 493, and have taken a course in linear models, such as Math 439. Familiarity with R is essential. A course in computer programming would be helpful. Knowledge of multivariate calculus and matrix algebra at the level of Math 233 and Math 309, respectively, is assumed.

Piazza: Make sure to enroll in this course on Piazza.

Textbook: There is no required textbook for the course, but I do recommend using additional references. Some good open-access books:

List of Topics (tentative):

Exams: 1 midterm and 1 final.

Homework: The lowest homework grade will be dropped. Homework is due at the beginning of class on the specified due date.

Final Course Grade: The letter grades for the course will be determined according to the following numerical grades on a 0-100 scale.

Course Schedule: This will be updated regularly. Future assignment due dates are tentative and subject to change.

Other Course Policies: Students are encouraged to look at the Faculty of Arts & Sciences policies.

Lecture: 2:30-4:00pm, Tuesday and Thursday, Cupples I, Room 218

Office Hours: Monday 3:00-4:00pm, Tuesday/Thursday 1:05-2:00pm in Room 18, Cupples I

Course Overview: This course introduces multivariate statistical analysis. The material will be presented at a level suitable for advanced undergraduate and master's degree students. Topics include: review of some important concepts (likelihood, quadratic forms, random vectors and matrices, multiple regression and variable selection), an overview of classical multivariate statistics, multivariate regression, dimensionality reduction, discriminant analysis and classification. Additional topics will be selected from modern statistical learning methodology. Emphasis will be given to applications using R.

Prerequisite: It is assumed that students are already familiar with probability at the level of Math 493, and have taken a course in linear models, such as Math 439. Familiarity with R is essential. A course in computer programming would be helpful. Knowledge of multivariate calculus and matrix algebra at the level of Math 233 and Math 309, respectively, is assumed.

Piazza: Make sure to enroll in this course on Piazza.

Textbook: There is no required textbook for the course, but I do recommend using additional references. Some good open-access books:

- Multivariate Statistics with R by Paul Hewson; a link to the free E-book is on his website

- The Elements of Statistical Learning (Second Edition) by Hastie, Tibshirani and Friedman; available from http://statweb.stanford.edu/~tibs/ElemStatLearn/
- Multivariate Statistics: Old School by John Marden; available from http://stat.istics.net/Multivariate/

Computing: Familiarity with R
is required. You can find many tutorials by clicking here. On the left side under Documentation, select Contributed to see a list of tutorials. Paul Hewson has compiled a wonderful resource page for R packages relevant for multivariate statistical analysis: click here. Also see his textbook link above, which includes material on matrices.

List of Topics (tentative):

- Types of multivariate data, visualizing multivariate data

- Review of matrix algebra, multivariate normal, distributions of
quadratic forms, likelihood methods

- Review of variable selection in multiple regression (stepwise, least-angle regression, lasso)
- Review of tools for model assessment and inference (bootstrap, cross-validation)
- Multivariate regression
- Dimensionality reduction (principal component analysis, canonical correlations)
- Discriminant analysis (binary classification and multiclass linear discriminant analysis)
- Classification trees and regression trees
- Neural networks

- Support vector machines
- Cluster analysis
- Committee machines (bagging, boosting and random forests)

Exams: 1 midterm and 1 final.

Homework: The lowest homework grade will be dropped. Homework is due at the beginning of class on the specified due date.

Final Course Grade: The letter grades for the course will be determined according to the following numerical grades on a 0-100 scale.

A+ |
[98, 100] |
B+ |
[87, 90) |
C+ |
[77, 80) |
D+ |
[67, 70) |
F |
[0,60) |

A |
[93, 98) |
B |
[83, 87) |
C |
[73, 77) |
D |
[63, 67) |
||

A- |
[90, 93) |
B- |
[80, 83) |
C- |
[70, 73) |
D- |
[60, 63) |

Course Schedule: This will be updated regularly. Future assignment due dates are tentative and subject to change.

Week 1 01/18-01/22 |
Theme: Review Types and visualizations of multivariate data; introduction to classical multivariate analysis; random vectors and multivariate normal; matrix decompositions; matrix norms; basics of numerical analysis: error sources (data, truncation, rounding); machine precision; ill-conditioning and condition numbers of matrices; examples in R |

Week 2 01/25-01/29 |
Theme: Random Matrices Random matrices; sample covariance matrix; Wishart distribution; Hotelling's T-squared; maximum likelihood estimation; application to distribution of eigenvalues |

Week 3 02/01-02/05 |
Theme: Principal Components Analysis Dimensionality reduction; biplots; scree plots; geometric interpretation; image compression; applications in R |

Week 4 02/08-02/12 |
Theme: Acquiring Multivariate Data and Canonical Correlation Analysis Web scraping; applications to Twitter; sentiment analysis; R package twitteR Canonical variate and canonical correlation analysis; examples in R |

Week 5 02/15-02/19 |
Theme: Linear Models Review Example in R; matrix calculus; the hat matrix; review of vector spaces; geometric interpretation of least squares; decompositions of sums of squares (using orthogonal complements, and using projections); consistency of the normal equations; generalized inverses; projection matrices; Gauss-Markov theorem; properties of idempotent matrices; distributions of quadratic forms (Cochran's theorem); hypothesis testing and confidence intervals Common problems: collinearity; transformations; omitted variables; non-constant variance; p>n |

Week 6 02/22-02/26 |
Theme: Introduction to High-Dimensional Statistics Curse of dimensionality and failure of local averaging; geometry of high-dimensional spaces; vanishing volumes of high-dimensional balls (and crust concentration); false positive control in linear regression; poor properties of empirical covariance matrix; computational complexity; inadequacy of classical asymptotics Gaussian concentration inequality; Lipschitz functions; flattening of multivariate normal density in high dimensions |

Week 7 02/29-03/04 |
Theme: Model Selection in High-Dimensional Linear Regression Sparsity; Akaike Information Criterion; optimality and decision theory; oracle risk bounds; minimax risk bounds |

Week 8 03/07-03/11 |
Theme: Variable Selection Convex optimization; Karush-Kuhn-Tucker conditions; Lagrangian duality; subgradients and gradient descent; examples of estimators and convex programs (lasso, elastic net) Algorithms; gradient descent; least angle regression; SCAD and nonconvex programs; examples in R R packages: lars, glmnet, flare |

Week 9 03/14-03/18 |
Spring Break |

Week 10 03/21-03/25 |
Theme: Practical Issues Tuning parameters; cross-validation; nonparametric bootstrap; bootstrap confidence intervals; more examples (Dantzig selector, square root lasso); dimension reduction for regression |

Week 11 03/28-04/01 |
Theme: Post-Selection Inference and Multiple Testing Selective inference, simultaneous inference; covariance test, spacing test; stability selection; polyhedral lemma; review of multiple testing; FDR, FWER, FCR; Benjamini-Hochberg procedure; sequential testing; ForwardStop R package: selectiveInference |

Week 12 04/04-04/08 |
Theme: Post-Selection Inference High-dimensional inference; multi sample splitting; de-sparsified lasso; ridge projection R packages: hdi, PoSI |

Week 13 04/11-04/15 |
Theme: Multivariate Regression and Classification Concepts in multivariate regression; testing; linear discriminant analysis; support vector machines |

Week 14 04/18-04/22 |
Theme: Predictive Modeling Classification and regression trees; bagging; boosting; AdaBoost |

Week 15 04/25-04/29 |
Theme: Predictive Modeling Artificial neural networks; problems in statistical inference for predictive models |

Reading Period 05/02-05/04 |

Other Course Policies: Students are encouraged to look at the Faculty of Arts & Sciences policies.

- Academic integrity: students
are expected to adhere to the University's policy on academic
integrity.

- Auditing: There is an option to audit, but this
still involves enrolling in the course. See the Faculty of Arts &
Sciences policy on auditing.
Auditing students will still be expected to attend all
lectures and compete all required coursework and exams.

- Collaboration: students are encouraged to discuss homework with one another, but each student must submit separate solutions, and these must be the original work of the student. This also applies to any R code.
- Exam conflicts: Read the University policy.

- Late homework: only by prior arrangement.

- Missed exams: there are
no make-up exams. For valid excused absences with midterm exams - such as medical, family, transportation and weather-related
emergencies - the contribution of that midterm to the final course
grade will be redistributed equally to the other midterm exam and final
exam. Students missing both midterm exams and/or the final exam cannot earn a passing grade for the course.