Title:
Computational Discrete Mathematics and Statistics for Molecular
Array Data
Abstract:
This talk will describe non-parametric computational and statistical data
mining reduction tools we are developing for gene (probe) selection from
high
throughput molecular technologies. By reducing the number of genes to a
subset most likely associated with the phenotype of interest, the analysis
can be more
focused, and therefore be more likely to identify true signal genes.
A major stumbling block in analyzing high throughput molecular technology
data is that there are many more genes (probes) than samples. Datasets
with a small
number of samples (N) relative to the number of measurements or variables
(P) are faced with the large P small N problem, otherwise known as the
~Qcurse of
dimensionality~R (Bellman, 1961). The ~Qcurse of dimensionality~R refers
to the fact that fitting statistical models and making predictions gets
very hard as
the number of variables (dimensions) increases. Technically, Bellman
showed that the mean integrated squared error increases faster than
linearly in the number
of dimensions. In other words, the inaccuracy, or error, of any
mathematical or statistical model becomes large very fast as the
dimensionality of the data
increases. The implication for high throughput molecular data is that, due
to the high number of variables, we can expect any standard models fit to
this data
to be highly inaccurate.