Research papers, master and doctoral theses about Methodology

Dependence Structure Estimation via Copula

107 - Jian Ma , Zengqi Sun 2019

Dependence strucuture estimation is one of the important problems in machine learning domain and has many applications in different scientific areas. In this paper, a theoretical framework for such estimation based on copula and copula entropy -- the probabilistic theory of representation and measurement of statistical dependence, is proposed. Graphical models are considered as a special case of the copula framework. A method of the framework for estimating maximum spanning copula is proposed. Due to copula, the method is irrelevant to the properties of individual variables, insensitive to outlier and able to deal with non-Gaussianity. Experiments on both simulated data and real dataset demonstrated the effectiveness of the proposed method.

Machine Learning Information Retrieval Methodology

Data-driven goodness-of-fit tests

180 - Mikhail Langovoy 2017

We propose and study a general method for construction of consistent statistical tests on the basis of possibly indirect, corrupted, or partially available observations. The class of tests devised in the paper contains Neymans smooth tests, data-driven score tests, and some types of multi-sample tests as basic examples. Our tests are data-driven and are additionally incorporated with model selection rules. The method allows to use a wide class of model selection rules that are based on the penalization idea. In particular, many of the optimal penalties, derived in statistical literature, can be used in our tests. We establish the behavior of model selection rules and data-driven tests under both the null hypothesis and the alternative hypothesis, derive an explicit detectability rule for alternative hypotheses, and prove a master consistency theorem for the tests from the class. The paper shows that the tests are applicable to a wide range of problems, including hypothesis testing in statistical inverse problems, multi-sample problems, and nonparametric hypothesis testing.

Statistics Theory Probability Methodology

Coherent Frameworks for Statistical Inference serving Integrating Decision Support Systems

141 - Jim Q. Smith , Martine J. Barons , Manuele Leonelli 2015

A subjective expected utility policy making centre, managing complex, dynamic systems, needs to draw on the expertise of a variety of disparate panels of experts and integrate this information coherently. To achieve this, diverse supporting probabilistic models need to be networked together, the output of one model providing the input to the next. In this paper we provide a technology for designing an integrating decision support system and to enable the centre to explore and compare the efficiency of different candidate policies. We develop a formal statistical methodology to underpin this tool. In particular, we derive sufficient conditions that ensure inference remains coherent before and after relevant evidence is accommodated into the system. The methodology is illustrated throughout using examples drawn from two decision support systems: one designed for nuclear emergency crisis management and the other to support policy makers in addressing the complex challenges of food poverty in the UK.

Methodology

On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

169 - Joyee Ghosh , Yingbo Li , Robin Mitra 2015

In logistic regression, separation occurs when a linear combination of the predictors can perfectly classify part or all of the observations in the sample, and as a result, finite maximum likelihood estimates of the regression coefficients do not exist. Gelman et al. (2008) recommended independent Cauchy distributions as default priors for the regression coefficients in logistic regression, even in the case of separation, and reported posterior modes in their analyses. As the mean does not exist for the Cauchy prior, a natural question is whether the posterior means of the regression coefficients exist under separation. We prove theorems that provide necessary and sufficient conditions for the existence of posterior means under independent Cauchy priors for the logit link and a general family of link functions, including the probit link. We also study the existence of posterior means under multivariate Cauchy priors. For full Bayesian inference, we develop a Gibbs sampler based on Polya-Gamma data augmentation to sample from the posterior distribution under independent Student-t priors including Cauchy priors, and provide a companion R package in the supplement. We demonstrate empirically that even when the posterior means of the regression coefficients exist under separation, the magnitude of the posterior samples for Cauchy priors may be unusually large, and the corresponding Gibbs sampler shows extremely slow mixing. While alternative algorithms such as the No-U-Turn Sampler in Stan can greatly improve mixing, in order to resolve the issue of extremely heavy tailed posteriors for Cauchy priors under separation, one would need to consider lighter tailed priors such as normal priors or Student-t priors with degrees of freedom larger than one.

Methodology

A multi-resolution approximation for massive spatial datasets

119 - Matthias Katzfuss 2015

Automated sensing instruments on satellites and aircraft have enabled the collection of massive amounts of high-resolution observations of spatial fields over large spatial regions. If these datasets can be efficiently exploited, they can provide new insights on a wide variety of issues. However, traditional spatial-statistical techniques such as kriging are not computationally feasible for big datasets. We propose a multi-resolution approximation (M-RA) of Gaussian processes observed at irregular locations in space. The M-RA process is specified as a linear combination of basis functions at multiple levels of spatial resolution, which can capture spatial structure from very fine to very large scales. The basis functions are automatically chosen to approximate a given covariance function, which can be nonstationary. All computations involving the M-RA, including parameter inference and prediction, are highly scalable for massive datasets. Crucially, the inference algorithms can also be parallelized to take full advantage of large distributed-memory computing environments. In comparisons using simulated data and a large satellite dataset, the M-RA outperforms a related state-of-the-art method.

Methodology Computation

Combining matching and linear regression: Introducing a mathematical framework and software for simulations, diagnostics and calibration

266 - Alireza S. Mahani , Mansour T.A. Sharabiani 2015

Combining matching and regression for causal inference provides double-robustness in removing treatment effect estimation bias due to confounding variables. In most real-world applications, however, treatment and control populations are not large enough for matching to achieve perfect or near-perfect balance on all confounding variables and their nonlinear/interaction functions, leading to trade-offs. [this fact is independent of regression, so a bit disjointed from first sentence.] Furthermore, variance is as important of a contributor as bias towards total error in small samples, and must therefore be factored into the methodological decisions. In this paper, we develop a mathematical framework for quantifying the combined impact of matching and linear regression on bias and variance of treatment effect estimation. The framework includes expressions for bias and variance in a misspecified linear regression, theorems regarding impact of matching on bias and variance, and a constrained bias estimation approach for quantifying misspecification bias and combining it with variance to arrive at total error. Methodological decisions can thus be based on minimization of this total error, given the practitioners assumption/belief about an intuitive parameter, which we call `omitted R-squared. The proposed methodology excludes the outcome variable from analysis, thereby avoiding overfit creep and making it suitable for observational study designs. All core functions for bias and variance calculation, as well as diagnostic tools for bias-variance trade-off analysis, matching calibration, and power analysis are made available to researchers and practitioners through an open-source R library, MatchLinReg.

Methodology

Entropic Empirical Mode Decomposition

53 - Sumit Kumar Ram , Marta Molinas 2015

Empirical Mode Decomposition(EMD) is an adaptive data analysis technique for analyzing nonlinear and nonstationary data[1]. EMD decomposes the original data into a number of Intrinsic Mode Functions(IMFs)[1] for giving better physical insight of the data. Permutation Entropy(PE) is a complexity measure[3] function which is widely used in the field of complexity theory for analyzing the local complexity of time series. In this paper we are combining the concepts of PE and EMD to resolve the mode mixing problem observed in determination of IMFs.

Methodology

Globally adaptive quantile regression with ultra-high dimensional data

137 - Qi Zheng , Limin Peng , Xuming He 2015

Quantile regression has become a valuable tool to analyze heterogeneous covaraite-response associations that are often encountered in practice. The development of quantile regression methodology for high-dimensional covariates primarily focuses on examination of model sparsity at a single or multiple quantile levels, which are typically pre-specified ad hoc by the users. The resulting models may be sensitive to the specific choices of the quantile levels, leading to difficulties in interpretation and erosion of confidence in the results. In this article, we propose a new penalization framework for quantile regression in the high-dimensional setting. We employ adaptive L1 penalties, and more importantly, propose a uniform selector of the tuning parameter for a set of quantile levels to avoid some of the potential problems with model selection at individual quantile levels. Our proposed approach achieves consistent shrinkage of regression quantile estimates across a continuous range of quantiles levels, enhancing the flexibility and robustness of the existing penalized quantile regression methods. Our theoretical results include the oracle rate of uniform convergence and weak convergence of the parameter estimators. We also use numerical studies to confirm our theoretical findings and illustrate the practical utility of our proposal

Methodology

Learning Single Index Models in High Dimensions

153 - Ravi Ganti , Nikhil Rao , Rebecca M. Willett 2015

Single Index Models (SIMs) are simple yet flexible semi-parametric models for classification and regression. Response variables are modeled as a nonlinear, monotonic function of a linear combination of features. Estimation in this context requires learning both the feature weights, and the nonlinear function. While methods have been described to learn SIMs in the low dimensional regime, a method that can efficiently learn SIMs in high dimensions has not been forthcoming. We propose three variants of a computationally and statistically efficient algorithm for SIM inference in high dimensions. We establish excess risk bounds for the proposed algorithms and experimentally validate the advantages that our SIM learning methods provide relative to Generalized Linear Model (GLM) and low dimensional SIM based learning methods.

Machine Learning Machine Learning Methodology

Exact simulation of the Wright-Fisher diffusion

258 - Paul A. Jenkins , Dario Spano 2015

The Wright-Fisher family of diffusion processes is a widely used class of evolutionary models. However, simulation is difficult because there is no known closed-form formula for its transition function. In this article we demonstrate that it is in fact possible to simulate exactly from a broad class of Wright-Fisher diffusion processes and their bridges. For those diffusions corresponding to reversible, neutral evolution, our key idea is to exploit an eigenfunction expansion of the transition function; this approach even applies to its infinite-dimensional analogue, the Fleming-Viot process. We then develop an exact rejection algorithm for processes with more general drift functions, including those modelling natural selection, using ideas from retrospective simulation. Our approach also yields methods for exact simulation of the moment dual of the Wright-Fisher diffusion, the ancestral process of an infinite-leaf Kingman coalescent tree. We believe our new perspective on diffusion simulation holds promise for other models admitting a transition eigenfunction expansion.

Methodology Probability Populations and Evolution

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد