Titles and Abstracts

Titles/abstracts for the Third Workshop on Higher-Order Asymptotics and Post-Selection Inference (WHOA-PSI)^{3}. Click here to go to the main conference page, where you can find more information. Contact: Todd Kuffner, email: kuffner@wustl.edu

Talks 

Karim Abadir, Imperial College London / American University in Cairo
Title: Link of moments before and after transformations, with an application to resampling from fat-tailed distributions
Abstract: Let x be a transformation of y, whose distribution is unknown. We derive an expansion formulating the expectations of x in terms of the expectations of y. Apart from the intrinsic interest in such a fundamental relation, our results can be applied to calculating E(x) by the low-order moments of a transformation which can be chosen to give a good approximation for E(x). To do so, we generalize the approach of bounding the terms in expansions of characteristic functions, and use our result to derive an explicit and accurate bound for the remainder when a finite number of terms are taken. We illustrate one of the implications of our method by providing accurate naive bootstrap confidence intervals for the mean of a fat-tailed distribution with an infinite variance, in which case currently-available bootstrap methods are asymptotically invalid and unreliable in finite sample.

Genevera Allen, Rice University
Title: Inference, Computation, and Visualization for Convex Clustering and Biclustering
Abstract: Hierarchical clustering enjoys wide popularity because of its fast computation, ease of interpretation, and appealing visualizations via the dendogram and cluster heatmap.  Recently, several have proposed and studied convex clustering and biclustering which, similar in spirit to hierarchical clustering, achieve cluster merges via convex fusion penalties.  While these techniques enjoy superior statistical performance, they suffer from slower computation and are not generally conducive to representation as a dendogram.  In the first part of the talk, we present new convex (bi) clustering methods and fast algorithms that inherit all of the advantages of hierarchical clustering. Specifically, we develop a new fast approximation and variation of the convex (bi)clustering solution path that can be represented as a dendogram or cluster heatmap.  Also, as one tuning parameter indexes the sequence of convex (bi)clustering solutions, we can use these to develop interactive and dynamic visualization strategies that allow one to watch data form groups as the tuning parameter varies. In the second part of this talk, we consider how to conduct inference for convex clustering solutions that addresses questions like: Are there clusters in my data set? Or, should two clusters be merged into one?  To achieve this, we develop a new data decomposition in terms of Hotelling's T^2-test that allows us to use the selective inference paradigm to test multivariate hypotheses for the first time.  We can use this approach to test hypotheses and calculate confidence ellipsoids on the cluster means resulting from convex clustering.  We apply these techniques to examples from text mining and cancer genomics. This is joint work with John Nagorski, Michael Weylandt, and Frederick Campbell.

Rina Foygel Barber, University of Chicago
Title: Robust inference with the knockoff filter
Abstract: In this talk, I will present ongoing work on the knockoff filter for inference in regression. In a high-dimensional model selection problem, we would like to select relevant features without too many false positives. The knockoff filter provides a tool for model selection by creating knockoff copies of each feature, testing the model selection algorithm for its ability to distinguish true from false covariates to control the false positives. In practice, the modeling assumptions that underlie the construction of the knockoffs may be violated, as we cannot know the exact dependence structure between the various features. Our ongoing work aims to determine and improve the robustness properties of the knockoff framework in this setting. We find that when knockoff features are constructed using estimated feature distributions whose errors are small in a KL divergence type measure, the knockoff filter provably controls the false discovery rate at only a slightly higher level. This work is joint with Emmanuel Candes and Richard Samworth.

Heather Battey, Imperial College London
Title: Large numbers of explanatory variables
Abstract: The lasso and its variants are powerful methods for regression analysis when there are a small number of study individuals and a large number of potential explanatory variables. There results a single model, while there may be several models equally compatible with the data. I will outline a different approach, whose aim is essentially a confidence set of effective simple representations. A probabilistic assessment of the method is given and post-selection inference is discussed in connection with the resulting `confidence set' of models.

Pierre Bellec, Rutgers University
Title: Model selection, model averaging?
Abstract: TBD

Jelena Bradic, UC San Diego
Title: Semi-supervised high-dimensional learning: in search of optimal inference
Abstract: to follow

Andreas Buja, University of Pennsylvania
Title: PoSI under Misspecification in high-dimensions and Construction of PoSI Statistics
Abstract: Berk et al. (2013) provided valid post-selection inference under a classical Gaussian linear model. In this talk, I will first present some recent advances for PoSI under misspecification as well as diverging number of covariates. After this discussion, we present some deficiencies of the ``max-|t|'' PoSI statistic and provide some remedies. From this, three different PoSI confidence regions arise which will be compared. Joint work with Lawrence D. Brown, Arun K. Kuchibhotla, Ed George, Linda Zhao, Junhui Cai.

Emmanuel Candes, Stanford University
Title: What do we really know about logistic regression? A modern maximum-likelihood theory
Abstract: Logistic regression is the most popular model in statistics and machine learning to fit binary outcomes and assess the statistical significance of explanatory variables. Alongside, there is a classical theory of maximum likelihood (ML) estimation, which is used by all statistical software packages to produce inference. In the common modern setting where the number of explanatory variables is not negligible compared to the sample size, we show that this theory leads to inferential conclusions that cannot be trusted. We develop a new theory that provides expressions for the bias and variance of the ML estimate and characterizes the asymptotic distribution of the likelihood-ratio statistic under some assumptions regarding the distribution of the explanatory variables. This novel theory can be used to provide valid inference. If time allows, we will also explain how our theory can deal with regularized logistic regression such as the logistic ridge or the logistic LASSO.

Hongyuan Cao, Florida State University
Title: Statistical Methods for Integrative Analysis of Multi-Omics Data
Abstract: Genome-wise complex trait analysis (GCTA) was developed and applied to heritability analyses on complex traits and more recently extended to mental disorders. However, besides the intensive computation, previous literature also limits the scope to univariate phenotype, which ignores mutually informative but partially independent pieces of information provided in other phenotypes. Our goal is to use such auxiliary information to improve power. We show that the proposed method leads to a large power increase, while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method over several state-of-the-art methods. We illustration our methods on dataset from a schizophrenia study.

Yunjin Choi, National University of Singapore
Title: Community detection via fused penalty
Abstract: In recent years, community detection has been an active research area in various fields including machine learning and statistics. While a plethora of works has been published over a past few years, most of the existing methods depend on a predetermined number of communities. Given the situation, determining the proper number of communities is directly related to the performance of these methods. Currently, there does not exist a golden rule for choosing the ideal number and people usually rely on their background knowledge of the domain to make their choices. To address this issue, we propose a community detection method which also finds the number of the underlying communities. Central to our method is fused l1 penalty applied on an induced graph from the given data. This method yields hierarchically structured communities. At each level, we use hypothesis test based on post-selection inference framework to investigate whether the detected community at the given level has captured the true population level community correctly.

Will Fithian, UC Berkeley
Title: AdaPT: An interactive procedure for multiple testing with side information
Abstract:  We consider the problem of multiple hypothesis testing with generic side information: for each hypothesis we observe both a p-value and some predictor encoding contextual information about the hypothesis. For large-scale problems, adaptively focusing power on the more promising hypotheses (those more likely to yield discoveries) can lead to much more powerful multiple testing procedures. We propose a general iterative framework for this problem, called the Adaptive p-value Thresholding (AdaPT) procedure, which adaptively estimates a Bayes-optimal p-value rejection threshold and controls the false discovery rate (FDR) in finite samples.At each iteration of the procedure, the analyst proposes a rejection threshold and observes partially censored p-values, estimates the false discovery proportion (FDP) below the threshold, and either stops to reject or proposes another threshold, until the estimated FDP is below $\alpha$. Our procedure is adaptive in an unusually strong sense, permitting the analyst to use any statistical or machine learning method she chooses to estimate the optimal threshold, and to switch between different models at each iteration as information accrues. 
This is joint work with Lihua Lei.


Jan Hannig, UNC Chapel Hill
Title: Model Selection without penalty using Generalized Fiducial Inference
Abstract: R. A. Fisher, the father of modern statistics, developed the idea of fiducial inference during the first half of the 20th century.  While his proposal led to interesting methods for quantifying uncertainty, other prominent statisticians of the time did not accept Fisher's approach as it became apparent that some of Fisher's bold claims about the properties of fiducial distribution did not hold up for multi-parameter problems.  Beginning around the year 2000, the authors and collaborators started to re-investigate the idea of fiducial inference and discovered that Fisher's approach, when properly generalized, would open doors to solve many important and difficult inference problems.  They termed their generalization of Fisher's idea as generalized fiducial inference (GFI). The main idea of GFI is to carefully transfer randomness from the data to the parameter space using an inverse of a data generating equation without the use of Bayes theorem. The resulting generalized fiducial distribution (GFD) can then be used for inference. After more than a decade of investigations, the authors and collaborators have developed a unifying theory for GFI, and provided GFI solutions to many challenging practical problems in different fields of science and industry.  Overall, they have demonstrated that GFI is a valid, useful, and promising approach for conducting statistical inference.  
    Standard penalized methods of variable selection and parameter estimation rely on the magnitude of coefficient estimates to decide which variables to include in the final model.  However, coefficient estimates are unreliable when the design matrix is collinear.  To overcome this challenge an entirely new perspective on variable selection is presented within a generalized fiducial inference framework.  This new procedure is able to effectively account for linear dependencies among subsets of covariates in a high-dimensional setting where $p$ can grow almost exponentially in $n$, as well as in the classical setting where $p \le n$.  It is shown that the procedure very naturally assigns small probabilities to subsets of covariates which include redundancies by way of explicit $L_{0}$ minimization.  Furthermore, with a typical sparsity assumption, it is shown that the proposed method is consistent in the sense that the probability of the true sparse subset of covariates converges in probability to 1 as $n \to \infty$, or as $n \to \infty$ and $p \to \infty$.  Very reasonable conditions are needed, and little restriction is placed on the class of possible subsets of covariates to achieve this consistency result.

(Joint work with Jonathan Williams)


Lucas Janson, Harvard University
Title: Should We Model X in High-Dimensional Inference?
Abstract: For answering questions about the relationship between a response variable Y and a set of explanatory variables X, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y | X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y | X to the joint distribution of X, especially when X is high-dimensional. First, assuming a model for X can often more closely match available domain knowledge, and allows for model checking and robustness that is unavailable when modeling Y | X. Second, there are substantial methodological payoffs in terms of interpretability, flexibility of models, and adaptability of algorithms for quantifying a hypothesized effect, all while being guaranteed exact (non-asymptotic) inference. I will briefly mention some of my recent and ongoing work on methods for high-dimensional inference that model X instead of Y | X, as well as some challenges and interesting directions for the future in this area.

Jessie Jeng, NC State University
Title: Efficient Signal Inclusion in Large-Scale Data Analysis
Abstract: This work addresses the challenge of efficiently capturing a high proportion of true signals for subsequent data analyses when signals are detectable but not identifiable. We develop a new analytic framework focusing on false negative control under dependence. We propose the signal missing rate as a new measure to account for the variability of false negative proportion. Novel data-adaptive procedures are developed to control signal missing rate without incurring unnecessary false positives under dependence. The proposed methods are applied to GWAS on human heights to effectively remove irrelevant SNPs while retaining a high proportion of relevant SNPs for subsequent polygenic analysis.

Tracy Ke, Harvard University
Title: Covariate assisted variable ranking
Abstract: TBD

Eric Laber, NC State University
Title: Sample Size Calculations for SMARTs
Abstract: Sequential Multiple Assignment Randomized Trials (SMARTs) are considered the gold standard for estimation and evaluation of treatment regimes. SMARTs are typically sized to ensure sufficient power for a simple comparison, e.g., the comparison of two fixed and non-overlapping treatment sequences.  Estimation of an optimal treatment regime is conducted as part of a secondary and hypothesis-generating analysis with formal evaluation of the estimated optimal regime deferred to a follow-up trial. However, running a follow-up trial to evaluate an estimated optimal treatment regime is costly and time-consuming; furthermore, the estimated optimal regime that is to be evaluated in such a follow-up trial may be far from optimal if the original trial was underpowered for estimation of an optimal regime.  We derive sample size procedures for a SMART that ensure: (i) sufficient power for comparing the optimal treatment regime with standard of care; and (ii) the estimated optimal regime is within a given tolerance of the true optimal regime with high-probability. We establish asymptotic validity of the proposed procedures and demonstrate their finite sample performance in a series of simulation experiments.

Soumendra Lahiri, NC State University
Title: On limit horizons in high dimensional inference
Abstract: We consider a common situation arising in many high dimensional statistical inference problems where the dimension $d$ diverges with the sample size $n$ and the statistic of interest is given by a function of component-wise summary statistics. The limit distribution of the statistic of interest is often influenced by an intricate interplay of underlying dependence structure of the component-wise summary statistics. Here, we introduce a new concept, called limit horizon (L.H.) that gives the boundary of the growth rate of $d$ as a function of $n$ where the natural approach to deriving the limit law by iterated limits works. Further, for $d$ growing at a faster rate beyond the L.H., the natural approach breaks down. We investigate the L.H. in some specific high dimensional problems.

Liza Levina, University of Michigan
Title: Matrix completion in network analysis
Abstract: Matrix completion is an active area of research in itself, and a natural tool to apply to network data, since many real networks are observed incompletely and/or with noise.  However, developing effective matrix completion algorithms for networks requires taking into account network- and task-specific missing data patterns.  This talk will discuss three examples of matrix completion used for network tasks.   First, we discuss the use of matrix completion for cross-validation on networks, a long-standing problem in network analysis. Two other examples focus on reconstructing incompletely observed networks, with structured missingness resulting from network sampling mechanisms.   One scenario we consider is egocentric sampling, where a set of nodes is selected first and then their connections to the entire network are observed.   Another scenario focuses on data from surveys, where people are asked to name a given number of friends.    We show that matrix completion can generally be very helpful in solving network problems, as long as the network structure is taken into account. 

This talk is based on joint work with Tianxi Li, Yun-Jhong Wu, and Ji Zhu.


Joshua Loftus, New York University
Title: Model selection bias invalidates goodness of fit tests
Abstract: We study goodness of fit tests in a variety of model selection settings and find that selection bias generally makes such tests conservative. Since selection methods choose the "best" model, a goodness of fit test will usually fail to reject, even when the incorrect model has been chosen. This is troubling, as it implies these tests in practice do not actually provide evidence in favor of the chosen model. We also explore post selection inference methods for adjusting goodness of fit tests analytically for simple examples and with simulations in more realistic settings.

Po-Ling Loh, University of Wisconsin
Title: Scale calibration for high-dimensional robust regression
Abstract: We present a new method for high-dimensional linear regression when a scale parameter of the error is unknown. The proposed estimator is based on a penalized Huber M-estimator, for which theoretical results on estimation error have recently been proposed in high-dimensional statistics literature. However, variance of the error term in the linear model is intricately connected to the parameter governing the shape of the Huber loss. The main idea is to use an adaptive technique, based on Lepski's method, to overcome the difficulties in solving a joint nonconvex optimization problem with respect to the location and scale parameters.

Taps Maiti, Michigan State University
Title: High Dimensional Discriminant Analysis for Spatially Dependent Data
Abstract: Linear discriminant analysis (LDA) is one of the most classical and popular classification techniques. However, it performs poorly in  high-dimensional classification. Many sparse discriminant methods have been proposed to make LDA applicable in high dimensional case. One issue of those methods is the structure of the covariance among features is ignored. We propose a new procedure for high dimensional discriminant analysis for spatially correlated data. Penalized maximum likelihood estimation (PMLE) is developed for feature selection and parameter estimation. Tapering technique is applied to reduce computation load. The theory shows that the method proposed can achieve consistent parameter estimation, features selection, and asymptotically optimal misclassification rate. Extensive simulation study shows a significant improvement in classification performance under spatial dependence.


Xiao-Li Meng, Harvard University
Title: Was there ever a pre-selection inference?
Abstract: This talk is dedicated to the memory of Larry Brown. Post-selection inference has become a buzz word in statistics, which seems to imply  that there was an era of pre-selection inference. But statistical inference has always been post-selection even in the narrow sense of model selection. Any goodness-of-fit test, for example, restricts our model class by empirical data, and hence it alters the relevant replications for generating our inferential statements. In general practice, we have at least seven S(ins) to worry about: selection in hypotheses; selection in data; selection in methodologies; selection in due diligence and debugging; selection in publications; selection in reporting and summary;  and selection in understanding and interpretation. Any of such selections, if not accounted for, threatens the reproducibility and  replicability of the inferential findings. Yet none of them can be reasonably quantified for the purposes of making post-selection adjustment. One way to combat this seemingly hopeless problem is to adopt the "expiration date" mentality. The expiration date of a medication has to be set as a low bound on the duration of efficacy, not some average duration, in order to guarantee the quality of the treatment. Hence using bounds is not much about being conservative, but about ensuring our procedures deliver what they promise, e.g., verifiably realizing their claimed confidence coverage, as in Berk, Brown, Buja, Zhang, and Zhao (2013, Annals of Statistics). The simple strategy of doubling variance will be used to illustrate this emphasis on quality assurance, in the context of guarding against model misspecification for constructing confidence intervals as well as uncongeniality in multiple imputation (Xie and Meng, 2017, Statistica Sinica).

Aaditya Ramdas, Carnegie Mellon University
Title: Towards "simultaneous selective inference" : a new framework for multiple testing
Abstract: Modern data science is often exploratory in nature, with hundreds or thousands of hypotheses being regularly tested on scientific datasets. The false discovery rate (FDR) has emerged as a dominant error metric in multiple hypothesis testing over the last two decades. I will argue that both (a) the FDR error metric, as well as (b) the current framework of multiple testing, where the scientist picks an arbitrary target error level (like 0.05) and the algorithm returns a set of rejected null hypotheses, may be rather inappropriate for exploratory data analysis.
   I will show that, luckily, most existing FDR algorithms (BH, STAR, LORD, AdaPT, Knockoffs, and several others) naturally satisfy a more uniform notion of error, yielding simultaneous confidence bands for the false discovery proportion through the entire path of the algorithm. This makes it possible to flip the traditional roles of the algorithm and the scientist, allowing the scientist to make post-hoc decisions after seeing the realization of an algorithm on the data. For example, the scientist can instead achieve an error guarantee for all target error levels simultaneously (and hence for any data-dependent error level). Remarkably, there is a relatively small price for this added flexibility, the analogous guarantees being less than a factor of 2 looser than if the error level was prespecified. The theoretical basis for this advance is founded in the theory of martingales : we move from optional stopping (used in FDR proofs) to optional spotting by proving uniform concentration bounds on relevant exponential supermartingales.  This is joint work with Eugene Katsevich.


Nancy Reid, University of Toronto
Title: A new look at F-tests
Abstract: Directional inference for vector parameters based on higher order approximations in likelihood inference is discussed in Davison et al. (JASA, 2014) and Fraser et al. (Biometrika, 2016). Here we explore examples of directional inference where the calculations can be simplified, and find that in several classical situations the directional test is equivalent to the usual F-test. This is joint work with Andrew McCormack, Nicola Sartori and Sri-Amirthan Theivendran.

Alessandro Rinaldo, Carnegie Mellon University
Title: Optimal Rates For Density-Based Clustering Using DBSCAN
Abstract: We study the problem of optimal estimation of the density cluster tree under various assumptions on the underlying density. We
formulate a new notion of clustering consistency which is better suited to smooth densities, and derive minimax rates of consistency for  cluster tree estimation for H ̈older smooth densities. We present a computationally efficient, rate optimal cluster tree estimator based on a straightforward extension of the popular density-based clustering algorithm DBSCAN. The resulting optimal rates for cluster tree estimation depend on the degree of smoothness of the underlying density and, interestingly, match minimax rates for density estimation under the supremum norm.  We also consider level set estimation and cluster consistency for densities with jump discontinuities, where the sizes of
the jumps and the distance among clusters are allowed to vanish as the sample size increases. We demonstrate that our DBSCAN-based algorithm remains minimax rate optimal in this setting as well. Joint work with Daren Wang and Xinyang Lu.

Richard Samworth, University of Cambridge
Title: Classification with imperfect training labels
Abstract: We study the effect of imperfect training data labels on the  performance of classification methods. In a general setting, where
the probability that an observation in the training dataset is mislabelled may depend on both the feature vector and the true label, we bound the excess risk of an arbitrary classifier trained with imperfect labels in terms of its excess risk for predicting a noisy label. This reveals conditions under which a classifier trained with imperfect labels remains consistent for classifying uncorrupted test data points. Furthermore, under stronger conditions, we derive detailed asymptotic properties for the popular $k$-nearest neighbour (knn), Support Vector Machine (SVM) and Linear Discriminant Analysis (LDA) classifiers. One consequence of these results is that the knn and SVM classifiers are robust to imperfect training labels, in the sense that the rate of convergence of the excess risks of these classifiers remains unchanged; in fact, it even turns out that in some cases, imperfect labels may improve the performance of these methods. On the other hand, the LDA classifier is shown to be typically inconsistent in the presence of label noise unless the prior probabilities of each class are equal.

Joint work with Tim Cannings and Yingying Fan.


Ana-Maria Staicu, NC State University
Title: Variable selection in functional linear model with varying smooth effects
Abstract: State-of-the-art robotic hand prosthetics generate finger and wrist movement through pattern recognition (PR) algorithms using features of forearm electromyogram (EMG) signals, but requires extensive training and is prone to poor predictions for conditions outside the training data (Scheme et al., 2010; Peerdeman et al., 2011). We propose a novel approach to develop a dynamic robotic limb by utilizing the recent history of EMG signals in a model that accounts for physiological features of hand movement which are ignored by PR algorithms. We do this by viewing EMG signals as functional covariates and develop a functional linear model that quantifies the effect of the EMG signals on finger/wrist velocity through a bivariate coefficient function that is allowed to vary with current finger/wrist position. The model is made parsimonious and interpretable through a two-step variable selection procedure, called Sequential Adaptive Functional Empirical group LASSO (SAFE-gLASSO). Numerical studies show excellent selection and prediction properties of SAFE-gLASSO compared to popular alternatives. For our motivating dataset, the method correctly identifies the few EMG signals that are known to be important for an able-bodied subject with negligible false positives and the model can be directly implemented in a robotic prosthetic.

Jonathan Taylor, Stanford University
Title: Approximate selective inference via maximum likelihood
Abstract: We consider an approximate version of the conditional approach to selective inference (after randomization). Approximation is used to bypass potentially expensive MCMC sampling in moderate dimensions. We use a large-deviations approximation from arxiv.org/1703.06176 (Panigrahi and Taylor), which leads to tractable estimating equations for the (approximate) maximum likelihood estimator and observed Fisher information. Through simulations we investigate the promise of this approach in low and higher dimensions. One clear upside of this approximation is that it allows the data analyst to pose several questions of the data before forming a target of interest, with questions being derived from convex problems as described in arxiv.org/1609.05609 (Tian et al.)

In terms of downside, theoretical justification seems difficult, particularly due to the lack of parameters to tweak having already reached asymptopia.

This is joint work with Snigdha Panigrahi.


Rob Tibshirani, Stanford University
Title: Some new ideas for post selection inference and model assessment
Abstract: TBD

Ryan Tibshirani, Carnegie Mellon University
Title: The LOCO parameter: the good, the bad, and the ugly (or: How I learned to stop worrying and love prediction)
Abstract: Assumption-free or assumption-lean inference has been gaining more and more attention these days.  A unsettled question, at least in the presenter's mind, is: what is an interesting parameter to study, when no real model is assumed to be correct?  This will be a mostly non-technical talk, discussing different approaches for selective inference, and what parameters (and "models", if any) they are centered around.  A focus will be the LOCO parameter proposed in Lei et al. (2018), its strengths, weaknesses, and a natural population-level analog.

Lan Wang, University of Minnesota
Title: A Tuning-free Approach to High-dimensional Regression
Abstract: We introduce a new tuning-free approach for high-dimensional regression with theoretical guarantee. The new procedure possesses several appealing properties simultaneously. Computationally, it can be efficiently solved via linear programming with an easily simulated tuning parameter, which automatically adapts to both the unknown random error distribution and the correlation structure of the design matrix. It is robust with substantial efficiency gain for heavy-tailed random errors while maintains high efficiency for normal random errors. It enjoys an essential scale-equivariance property that permits coherent interpretation when the response variable undergoes a scale transformation, a desirable property possessed by the classical least squares estimator but lost by Lasso and its variants. Under weak conditions for the random error distribution, we establish a finite-sample error bound with a near-oracle rate for the new estimator with the simulated tuning parameter. (Joint work with Bo Peng, Jelena Bradic, Runze Li and Yunan Wu)

Cun-Hui Zhang, Rutgers University
Title: Higher Criticism, SPRT and Test of Power One
Abstract: We develop a one-sided sequential probability ratio test for multiple null hypotheses with nearly optimal power in detecting the presence of signals which are rare and weak. This makes an interesting connection between test of power one and higher criticism, both involving the law of iterated logarithm. The sequential test guarantees the prescribed probability of type-I error. Nonlinear renewal theory is applied to show that the test is not overly conservative. This is joint work with Wenhua Jiang.

Hao Helen Zhang, University of Arizona
Title: Oracle P-value and Variable Screening
Abstract: P-value, first proposed by Fisher to measure inconsistency of data with a specified null hypothesis, plays a central role in statistical inference. For classical linear regression analysis, it is a standard procedure to calculate P-values for regression coefficients based on least squares estimator (LSE) to determine their significance. However, for high dimensional data when the number of predictors exceeds the sample size, ordinary least squares are no longer proper and there is not a valid definition for P-values based on LSE. It is also challenging to define sensible P-values for other high dimensional regression methods such as penalization and resampling methods. In this paper, we introduce a new concept called oracle P-value to generalize traditional P-values based on LSE to high dimensional sparse regression models. Then we propose several estimation procedures to approximate oracle P-values for real data analysis. We show that the oracle P-value framework is useful for developing new tools in high dimensional data analysis, including variable ranking, variable selection, and screening procedures with false discovery rate (FDR) control. Numerical examples are then presented to demonstrate performance of the proposed methods. This is the joint work with Ning Hao.

Kai Zhang, UNC Chapel Hill
Title: BET on Independence
Abstract: We study the problem of nonparametric dependence detection. Many existing methods suffer severe power loss due to non-uniform consistency,  which we illustrate with a paradox. To avoid such power loss, we approach the nonparametric test of independence through the new framework of binary expansion statistics (BEStat) and binary expansion testing (BET), which examine dependence through a novel binary expansion filtration approximation of the copula. Through a Hadamard transform, we find that the cross interactions of binary variables in the filtration are complete sufficient statistics for dependence. These interactions are also uncorrelated under the null. By utilizing these interactions, the BET avoids the problem of non-uniform consistency and improves upon a wide class of commonly used methods (a) by achieving the minimax rate in sample size requirement for reliable power and (b) by providing clear interpretations of global relationships upon rejection of independence. The binary expansion approach also connects the test statistics with the current computing system to facilitate efficient bitwise implementation. We illustrate the BET with a study of the distribution of stars in the night sky and with an exploratory data analysis of the TCGA breast cancer data.


Linda Zhao, University of Pennsylvania
Title: Generalized CP (GCp) in a model lean framework
Abstract: Linear models as working models have performed very well in practice. But most often the theoretical properties are obtained under the usual linear model assumptions such as linearity, homoscedasticity and normality. Using the least squared estimators we justify their desirable properties under much broader model assumptions, namely a model lean framework. Generalized CP (GCP) is proposed to estimate the prediction errors (testing errors). It is asymptotic unbiased. We study its properties especially the distribution of the difference among two sub-models.
 
Joint work with L. Brown, J. Cai, A. Kuchibhotla and the Wharton group



Posters

Mona Azadkia, Stanford University
Title: Matrix denoising with unknown noise variance
Abstract: Click here

Stephen Bates, Stanford University
Title: Model-X Knockoffs for Graphical Models
Abstract: Modern scientific applications require statistical methods for identifying relevant explanatory variables from a large number of possible explanatory variables with statistical guarantees that the number of spurious discoveries is controlled. The model-X knockoff framework provides false discovery rate guarantees for the selected features for any conditional distribution of the response on the features. The procedure requires a known distribution of the covariates X and a knockoff sampling mechanism for this distribution. In this work, we greatly expand the class of distributions for which model-X knockoffs can be sampled be introducing a knockoff sampler for arbitrary graphical models. Our proposed sampler is computationally tractable for graphs that have low treewidth, i.e. they are not too complex. Furthermore, we show that our sampler is able to generate knockoffs from any valid knockoff distributions, which means that the sampler can generate knockoffs with higher power than those from previously known samplers.

Thomas Berrett, University of Cambridge
Title: Efficient integral functional estimation via k-nearest neighbour distances
Abstract: Click here

Ran Dai, University of Chicago
Title: Post-selection inference on high-dimensional varying-coefficient quantile regression model
Abstract: Quantile regression has been successfully used to study heterogeneous and heavy tailed data. In this work, we study high-dimensional varying-coefficient quantile regression model that allows us capture non-stationary effects of the input variables across time. We develop new tools for statistical inference that allow us to construct valid confidence bands and honest tests for nonparametric coefficient functions of time and quantile. Our focus is on inference in a high-dimensional setting where the number of input variables exceeds the sample size. Performing statistical inference in this regime is challenging due usage of model selection techniques in estimation. Never the less, we are able to develop valid inferential tools that are applicable to a wide range of data generating processes and do not suffer from biases introduced by model selection. The statistical framework allows us to construct a confidence interval at a fixed
point in time and a fixed quantile based on a Normal approximation, as well as a uniform confidence band for the nonparametric coefficient
function based on a Gaussian process approximation. Joint work with Rina Foygel Barber and Mladen Kolar.

Eugene Katsevich, Stanford University
Title: Reconciling FDR control with post hoc filtering
Abstract: The false discovery rate (FDR) is a popular error criterion for large-scale multiple testing problems. A notable pitfall of the FDR is that filtering (i.e. subsetting) the rejection set post hoc might invalidate the FDR guarantee. In some applied settings, however, filtering is standard practice. For example, post hoc filtering is often employed in gene ontology enrichment analysis (where hypotheses have a directed acyclic graph structure) to remove redundancy among the set of rejected hypotheses. We propose Filtered BH, a filter-aware extension of the BH procedure. Assuming the filter can be specified in advance, Filtered BH takes as input this filter as well as a set of p-values and outputs a rejection set. This rejection set, when filtered, provably controls the FDR. Existing domain-specific filters can be easily integrated into Filtered BH, allowing scientists to continue the practice of filtering without sacrificing rigorous Type I error control.

Byol Kim, University of Chicago
Title: Statistical Inference for High-Dimensional Differential Networks
Abstract: Click here

John Kolassa, Rutgers University
Title: Conditional Likelihood Techniques applied to Partial Likelihood Regression for Survival Data
Abstract: Proportional hazards regression shares the possibility of infinite parameter estimation with logistic and multinomial regression.  This poster demonstrates how to perform conditional inference on finite components of the proportional hazards regression model in the presence of infinite estimates for nuisance parameters, by employing optimization techniques to reduce the data set to one yielding conditional inference approximating that of the desired regression model.

Lihua Lei, UC Berkeley
Title: TBD
Abstract: TBD

Keith Levin, University of Michigan
Title: Inferring Low-Rank Population Structure from Multiple Network Samples
Abstract: In increasingly many settings, particularly in neuroscience, data sets consist of multiple samples from a population of networks, in which a notion of vertex correspondence across networks is present. For example, in the case of neuroimaging data, fMRI data yields graphs whose vertices correspond to brain regions that are common across subjects. The behavior of these vertices can thus be sensibly compared across graphs. We consider the problem of estimating parameters of the network population distribution under this setting. In particular, we consider the case where the observed networks share a low-rank structure, but may differ in the noise structure on their edges. Our approach exploits this shared low-rank structure to denoise edge-level measurements of the observed networks and estimate the desired population-level parameters. We also explore the extent to which complexity of the edge-level error structure influences estimation and downstream inference.

Haoyang Liu, University of Chicago
Title: Between hard and soft thresholding: optimal iterative thresholding algorithms
Abstract:  Iterative thresholding algorithms seek to optimize a differentiable objective function over a sparsity or rank constraint by alternating between gradient steps and thresholding steps. This work examines the choice of the thresholding operator. We develop the notion of relative concavity of a thresholding operator, a quantity that characterizes the convergence performance of any thresholding operator on the target optimization problem. Surprisingly, we find that commonly used thresholding operators, such as hard thresholding and soft thresholding, are suboptimal in terms of convergence guarantees. Instead, a general class of thresholding operators, lying between hard thresholding and soft thresholding, is shown to be optimal with the strongest possible convergence guarantee among all thresholding operators.

Miles Lopes, UC Davis
Title: Bootstrapping spectral statistics in high dimensions
Abstract:  Spectral statistics play a central role in many multivariate testing problems. It is therefore of interest to approximate the distribution of functions of the eigenvalues of sample covariance matrices. Although bootstrap methods are an established approach to approximating the laws of spectral statistics in low-dimensional problems, these methods are relatively unexplored in the high-dimensional setting. The aim of this paper is to focus on linear spectral statistics (LSS) as a class of "prototype statistics" for developing a new bootstrap method in the high-dimensional setting. In essence, the method originates from the parametric bootstrap, and is motivated by the notion that, in high dimensions, it is difficult to obtain a non-parametric approximation to the full data-generating distribution. From a practical standpoint, the method is easy to use, and allows the user to circumvent the difficulties of complex asymptotic formulas for LSS. In addition to proving the consistency of the proposed method, we provide encouraging empirical results in a variety of settings. Lastly, and perhaps most interestingly, we show through simulations that the method can be applied successfully to statistics outside the class of LSS, such as the largest sample eigenvalue and others. (Joint work with Alexander Aue and Andrew Blandino.)

Yet Nguyen, Old Dominion University
Title: Identifying relevant covariates in RNA-seq analysis by pseudo-variable augmentation
Abstract:  RNA-sequencing (RNA-seq) technology enables the detection of differentially expressed genes, i.e., genes whose mean transcript abundance levels vary across conditions. In practice, an RNA-seq dataset often contains some explanatory variables that will be included in analysis with certainty in addition to a set of covariates that are subject to selection. Some of the covariates may be relevant to gene expression levels, while others may be irrelevant. Either ignoring relevant covariates or attempting to adjust for the effect of irrelevant covariates can be detrimental to identifying differentially expressed genes. We address this issue by proposing a covariate selection method using pseudo-covariates to control the expected proportion of selected covariates that are irrelevant. We show that the proposed method can accurately choose the most relevant covariates while holding the false selection rate below a specified level. We also show that our method performs better than methods for detecting differentially expressed genes that do not take covariate selection into account, or methods that use surrogate variables instead of the available covariates.

Chathurangi Pathiravasan, SIU Carbondale
Title: Bootstrapping hypotheses tests
Abstract:  Click here

Cornelis Potgieter, Southern Methodist University
Title: Simulation-Selection-Extrapolation: Estimation for High Dimensional Errors-in-Variables Models
Abstract:  Errors-in-variables models in a high-dimensional setting present a two-fold challenge: The presence of measurement error in the covariates can result in severely biased parameter estimates, while the high-dimensional nature of the data can obscure the covariates that are relevant to the outcome of interest. A new estimation procedure called SIMSELEX (SIMulation-SELection-EXtrapolation) is proposed. This procedure augments the traditional SIMEX approach with a variable selection step based on the group lasso. The SIMSELEX approach is shown to perform well in variable selection and has significantly lower estimation error than the naive estimator that ignores measurement error. Furthermore, SIMSELEX can be applied in a variety of errors-in-variables settings, including linear regression, logistic regression, and the Cox proportional odds model. The SIMSELEX procedure is compared to the matrix uncertainty selector and the conic programing estimator for a linear model, and to the generalized matrix uncertainty selector for a logistic regression model. Finally, the method is applied to analyze a microarray dataset that contains gene expression measurements of favorable histology Wilms tumors.

Martin Spindler, University of Hamburg
Title: Uniform Inference in High-Dimensional Gaussian Graphical Models
Abstract:  Graphical models have become a very popular tool for representing dependencies within a large set of variables and are key for representing causal structures. We provide results for uniform inference on high-dimensional graphical models with the number of target parameters being possible much larger than sample size. This is in particular important when certain features or structures of a causal model should be recovered. Our results highlight how in high-dimensional settings graphical models can be estimated and recovered with modern machine learning methods in complex data sets. We also demonstrate in simulation study that our procedure has good small sample properties. Joint work with Jannis Kuck an Sven Klaassen. A paper which might be of interest for the literature section on the workshop webpage is the following paper: https://www.annualreviews.org/doi/abs/10.1146/annurev-economics-012315-015826
resp.
https://arxiv.org/abs/1501.03430


Lei Sun, University of Chicago
Title: Empirical Bayes Normal Means with Correlated Noise
Abstract:  Recent technological advances have allowed scientists to perform large-scale simultaneous inference on ever-growing massive data sets. Many of these pursuits can be formulated statistically as multiple testing in the classic high-dimensional normal means problem, and a variety of methods have been developed in the past decade, among which empirical Bayes is a viable tool commonly applied. However, like many other multiple testing methods, this approach is prone to distortion by correlation which is ubiquitous in real-world statistical analysis. We develop Correlated Adaptive Shrinkage (CASH) to account for unknown correlation, detect elusive signals, and control false discoveries. Our methodology compares favorably in realistic simulations and real data analyses with popular multiple testing methods and sheds new light on the effect of correlation. Joint work with Matthew Stephens.

Zhipeng Wang, Genentech
Title: TBD
Abstract:  TBD

Andrew Womack, Indiana University
Title: Horseshoes with heavy tails
Abstract:  Locally adaptive shrinkage in the Bayesian framework is achieved through the use of local-global prior distributions that model both the global level of sparsity as well as individual shrinkage parameters for mean structure parameters. The most popular of these models is the Horseshoe prior and its variants due to their spike and slab behavior involving an asymptote at the origin and heavy tails. In this paper, we present an alternative Horseshoe prior that exhibits both a sharper asymptote at the origin as well as heavier tails, which we call the Heavy-tailed Horseshoe prior. We prove that mixing on the shape parameters provides improved spike and slab behavior as well as better reconstruction properties than other Horseshoe variants. Joint work with Zikun Yang.

Chiao-Yu Yang, UC Berkeley
Title: TBD
Abstract:  TBD

Qiyiwen Zhang, Washington University in St. Louis
Title: Bayesian variable selection and frequentist post-selection inference
Abstract: TBD