Discriminant Analysis of Distributional Data viaFractional Programming

64 0 0.0 ( 0 )

Download Cite

Added by Sonia Dias Mrs

Publication date 2020

fields Mathematical Statistics

and research's language is English

Authors S. Dias - P. Brito - P. Amaral

Methodology

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We address classification of distributional data, where units are described by histogram or interval-valued variables. The proposed approach uses a linear discriminant function where distributions or intervals are represented by quantile functions, under specific assumptions. This discriminant function allows defining a score for each unit, in the form of a quantile function, which is used to classify the units in two a priori groups, using the Mallows distance. There is a diversity of application areas for the proposed linear discriminant method. In this work we classify the airline companies operating in NY airports based on air time and arrival/departure delays, using a full year fights.

rate research

Quadratic Discriminant Analysis by Projection

118 - Ruiyang Wu , Ning Hao 2021

Discriminant analysis, including linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), is a popular approach to classification problems. It is well known that LDA is suboptimal to analyze heteroscedastic data, for which QDA would be an ideal tool. However, QDA is less helpful when the number of features in a data set is moderate or high, and LDA and its variants often perform better due to their robustness against dimensionality. In this work, we introduce a new dimension reduction and classification method based on QDA. In particular, we define and estimate the optimal one-dimensional (1D) subspace for QDA, which is a novel hybrid approach to discriminant analysis. The new method can handle data heteroscedasticity with number of parameters equal to that of LDA. Therefore, it is more stable than the standard QDA and works well for data in moderate dimensions. We show an estimation consistency property of our method, and compare it with LDA, QDA, regularized discriminant analysis (RDA) and a few other competitors by simulated and real data examples.

Methodology

Distributional Representation of Longitudinal Data: Visualization, Regression and Prediction

265 - Alvaro Gajardo , Xiongtao Dai , Hans-Georg Muller 2021

We develop a representation of Gaussian distributed sparsely sampled longitudinal data whereby the data for each subject are mapped to a multivariate Gaussian distribution; this map is entirely data-driven. The proposed method utilizes functional principal component analysis and is nonparametric, assuming no prior knowledge of the covariance or mean structure of the longitudinal data. This approach naturally connects with a deeper investigation of the behavior of the functional principal component scores obtained for longitudinal data, as the number of observations per subject increases from sparse to dense. We show how this is reflected in the shrinkage of the distribution of the conditional scores given noisy longitudinal observations towards a point mass located at the true but unobservable FPCs. Mapping each subjects sparse observations to the corresponding conditional score distribution leads to useful visualizations and representations of sparse longitudinal data. Asymptotic rates of convergence as sample size increases are obtained for the 2-Wasserstein metric between the true and estimated conditional score distributions, both for a $K$-truncated functional principal component representation as well as for the case when $K=K(n)$ diverges with sample size $ntoinfty$. We apply these ideas to construct predictive distributions aimed at predicting outcomes given sparse longitudinal data.

Methodology

Unobserved classes and extra variables in high-dimensional discriminant analysis

69 - Michael Fop , Pierre-Alexandre Mattei , Charles Bouveyron 2021

In supervised classification problems, the test set may contain data points belonging to classes not observed in the learning phase. Moreover, the same units in the test data may be measured on a set of additional variables recorded at a subsequent stage with respect to when the learning sample was collected. In this situation, the classifier built in the learning phase needs to adapt to handle potential unknown classes and the extra dimensions. We introduce a model-based discriminant approach, Dimension-Adaptive Mixture Discriminant Analysis (D-AMDA), which can detect unobserved classes and adapt to the increasing dimensionality. Model estimation is carried out via a full inductive approach based on an EM algorithm. The method is then embedded in a more general framework for adaptive variable selection and classification suitable for data of large dimensions. A simulation study and an artificial experiment related to classification of adulterated honey samples are used to validate the ability of the proposed framework to deal with complex situations.

Methodology Computation Machine Learning

Distributional data analysis via quantile functions and its application to modelling digital biomarkers of gait in Alzheimers Disease

59 - Rahul Ghosal , Vijay R. Varma , Dmitri Volfson 2021

With the advent of continuous health monitoring via wearable devices, users now generate their unique streams of continuous data such as minute-level physical activity or heart rate. Aggregating these streams into scalar summaries ignores the distributional nature of data and often leads to the loss of critical information. We propose to capture the distributional properties of wearable data via user-specific quantile functions that are further used in functional regression and multi-modal distributional modelling. In addition, we propose to encode user-specific distributional information with user-specific L-moments, robust rank-based analogs of traditional moments. Importantly, this L-moment encoding results in mutually consistent functional and distributional interpretation of the results of scalar-on-function regression. We also demonstrate how L-moments can be flexibly employed for analyzing joint and individual sources of variation in multi-modal distributional data. The proposed methods are illustrated in a study of association of accelerometry-derived digital gait biomarkers with Alzheimers disease (AD) and in people with normal cognitive function. Our analysis shows that the proposed quantile-based representation results in a much higher predictive performance compared to simple distributional summaries and attains much stronger associations with clinical cognitive scales.

Methodology Applications

Beyond unidimensional poverty analysis using distributional copula models for mixed ordered-continuous outcomes

138 - Maike Hohberg , Francesco Donat , Giampiero Marra 2020

Poverty is a multidimensional concept often comprising a monetary outcome and other welfare dimensions such as education, subjective well-being or health, that are measured on an ordinal scale. In applied research, multidimensional poverty is ubiquitously assessed by studying each poverty dimension independently in univariate regression models or by combining several poverty dimensions into a scalar index. This inhibits a thorough analysis of the potentially varying interdependence between the poverty dimensions. We propose a multivariate copula generalized additive model for location, scale and shape (copula GAMLSS or distributional copula model) to tackle this challenge. By relating the copula parameter to covariates, we specifically examine if certain factors determine the dependence between poverty dimensions. Furthermore, specifying the full conditional bivariate distribution, allows us to derive several features such as poverty risks and dependence measures coherently from one model for different individuals. We demonstrate the approach by studying two important poverty dimensions: income and education. Since the level of education is measured on an ordinal scale while income is continuous, we extend the bivariate copula GAMLSS to the case of mixed ordered-continuous outcomes. The new model is integrated into the GJRM package in R and applied to data from Indonesia. Particular emphasis is given to the spatial variation of the income-education dependence and groups of individuals at risk of being simultaneously poor in both education and income dimensions.

Methodology Applications