Prof. William Shannon
Division of Biostatistics, School of Medicine
Washington University at St. Louis

Title:
Computational Discrete Mathematics and Statistics for Molecular Array Data

Abstract:
     A major stumbling block in analyzing high throughput molecular technology data is that there are many more genes (probes) than samples. Datasets with a small number of samples (N) relative to the number of measurements or variables (P) are faced with the large P small N problem, otherwise known as the ~Qcurse of dimensionality~R (Bellman, 1961). The ~Qcurse of dimensionality~R refers to the fact that fitting statistical models and making predictions gets very hard as the number of variables (dimensions) increases. Technically, Bellman showed that the mean integrated squared error increases faster than linearly in the number of dimensions. In other words, the inaccuracy, or error, of any mathematical or statistical model becomes large very fast as the dimensionality of the data increases. The implication for high throughput molecular data is that, due to the high number of variables, we can expect any standard models fit to this data to be highly inaccurate.

This talk will describe non-parametric computational and statistical data mining reduction tools we are developing for gene (probe) selection from high throughput molecular technologies. By reducing the number of genes to a subset most likely associated with the phenotype of interest, the analysis can be more focused, and therefore be more likely to identify true signal genes.

Talk Slides