ترغب بنشر مسار تعليمي؟ اضغط هنا

Consistent Estimation for Partition-wise Regression and Classification Models

91   0   0.0 ( 0 )
 نشر من قبل Rex Cheung
 تاريخ النشر 2016
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

Partition-wise models offer a flexible approach for modeling complex and multidimensional data that are capable of producing interpretable results. They are based on partitioning the observed data into regions, each of which is modeled with a simple submodel. The success of this approach highly depends on the quality of the partition, as too large a region could lead to a non-simple submodel, while too small a region could inflate estimation variance. This paper proposes an automatic procedure for choosing the partition (i.e., the number of regions and the boundaries between regions) as well as the submodels for the regions. It is shown that, under the assumption of the existence of a true partition, the proposed partition estimator is statistically consistent. The methodology is demonstrated for both regression and classification problems.



قيم البحث

اقرأ أيضاً

Parameter estimation of mixture regression model using the expectation maximization (EM) algorithm is highly sensitive to outliers. Here we propose a fast and efficient robust mixture regression algorithm, called Component-wise Adaptive Trimming (CAT ) method. We consider simultaneous outlier detection and robust parameter estimation to minimize the effect of outlier contamination. Robust mixture regression has many important applications including in human cancer genomics data, where the population often displays strong heterogeneity added by unwanted technological perturbations. Existing robust mixture regression methods suffer from outliers as they either conduct parameter estimation in the presence of outliers, or rely on prior knowledge of the level of outlier contamination. CAT was implemented in the framework of classification expectation maximization, under which a natural definition of outliers could be derived. It implements a least trimmed squares (LTS) approach within each exclusive mixing component, where the robustness issue could be transformed from the mixture case to simple linear regression case. The high breakdown point of the LTS approach allows us to avoid the pre-specification of trimming parameter. Compared with multiple existing algorithms, CAT is the most competitive one that can handle and adaptively trim off outliers as well as heavy tailed noise, in different scenarios of simulated data and real genomic data. CAT has been implemented in an R package `RobMixReg available in CRAN.
Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resoluti on, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data.
88 - Ray Bai 2020
We study estimation and variable selection in non-Gaussian Bayesian generalized additive models (GAMs) under a spike-and-slab prior for grouped variables. Our framework subsumes GAMs for logistic regression, Poisson regression, negative binomial regr ession, and gamma regression, and encompasses both canonical and non-canonical link functions. Under mild conditions, we establish posterior contraction rates and model selection consistency when $p gg n$. For computation, we propose an EM algorithm for obtaining MAP estimates in our model, which is available in the R package sparseGAM. We illustrate our method on both synthetic and real data sets.
The dual problem of testing the predictive significance of a particular covariate, and identification of the set of relevant covariates is common in applied research and methodological investigations. To study this problem in the context of functiona l linear regression models with predictor variables observed over a grid and a scalar response, we consider basis expansions of the functional covariates and apply the likelihood ratio test. Based on p-values from testing each predictor, we propose a new variable selection method, which is consistent in selecting the relevant predictors from set of available predictors that is allowed to grow with the sample size n. Numerical simulations suggest that the proposed variable selection procedure outperforms existing methods found in the literature. A real dataset from weather stations in Japan is analyzed.
Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of syst ematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا