No Arabic abstract
Because of its mathematical tractability, the Gaussian mixture model holds a special place in the literature for clustering and classification. For all its benefits, however, the Gaussian mixture model poses problems when the data is skewed or contains outliers. Because of this, methods have been developed over the years for handling skewed data, and fall into two general categories. The first is to consider a mixture of more flexible skewed distributions, and the second is based on incorporating a transformation to near normality. Although these methods have been compared in their respective papers, there has yet to be a detailed comparison to determine when one method might be more suitable than the other. Herein, we provide a detailed comparison on many benchmarking datasets, as well as describe a novel method to assess cluster separation.
In a recent paper [textit{M. Cristelli, A. Zaccaria and L. Pietronero, Phys. Rev. E 85, 066108 (2012)}], Cristelli textit{et al.} analysed relation between skewness and kurtosis for complex dynamical systems and identified two power-law regimes of non-Gaussianity, one of which scales with an exponent of 2 and the other is with $4/3$. Finally the authors concluded that the observed relation is a universal fact in complex dynamical systems. Here, we test the proposed universal relation between skewness and kurtosis with large number of synthetic data and show that in fact it is not universal and originates only due to the small number of data points in the data sets considered. The proposed relation is tested using two different non-Gaussian distributions, namely $q$-Gaussian and Levy distributions. We clearly show that this relation disappears for sufficiently large data sets provided that the second moment of the distribution is finite. We find that, contrary to the claims of Cristelli textit{et al.} regarding a power-law scaling regime, kurtosis saturates to a single value, which is of course different from the Gaussian case ($K=3$), as the number of data is increased. On the other hand, if the second moment of the distribution is infinite, then the kurtosis seems to never converge to a single value. The converged kurtosis value for the finite second moment distributions and the number of data points needed to reach this value depend on the deviation of the original distribution from the Gaussian case. We also argue that the use of kurtosis to compare distributions to decide which one deviates from the Gaussian more can lead to incorrect results even for finite second moment distributions for small data sets, whereas it is totally misleading for infinite second moment distributions where the difference depends on $N$ for all finite $N$.
This paper aims to enhance our understanding of substantive questions regarding self-reported happiness and well-being through the specification and use of multi-level models. To date, there have been numerous quantitative research studies of the happiness of individuals, based on single-level regression models, where typically a happiness index is related to a set of explanatory variables. There are also several single-level studies comparing aggregate happiness levels between countries. Nevertheless, there have been very few studies that attempt to simultaneously take into account variations in happiness and well-being at several different levels, such as individual, household, and area. Here, multilevel models are used with data from the British Household Panel Survey to assess the nature and extent of variations in happiness and well-being to determine the relative importance of the area (district, region), household and individual characteristics on these outcomes. Moreover, having taken into account the characteristics at these different levels in the multilevel models, the paper shows how it is possible to identify any areas that are associated with especially positive or negative feelings of happiness and well-being.
With the rise of the big data phenomenon in recent years, data is coming in many different complex forms. One example of this is multi-way data that come in the form of higher-order tensors such as coloured images and movie clips. Although there has been a recent rise in models for looking at the simple case of three-way data in the form of matrices, there is a relative paucity of higher-order tensor variate methods. The most common tensor distribution in the literature is the tensor variate normal distribution; however, its use can be problematic if the data exhibit skewness or outliers. Herein, we develop four skewed tensor variate distributions which to our knowledge are the first skewed tensor distributions to be proposed in the literature, and are able to parameterize both skewness and tail weight. Properties and parameter estimation are discussed, and real and simulated data are used for illustration.
In this paper, we propose to obtain the skewed version of a unimodal symmetric density using a skewing mechanism that is not based on a cumulative distribution function. Then we disturb the unimodality of the resulting skewed density. In order to introduce skewness we use the general method which transforms any continuous unimodal and symmetric distribution into a skewed one by changing the scale at each side of the mode.
We introduce the univariate two--piece sinh-arcsinh distribution, which contains two shape parameters that separately control skewness and kurtosis. We show that this new model can capture higher levels of asymmetry than the original sinh-arcsinh distribution (Jones and Pewsey, 2009), in terms of some asymmetry measures, while keeping flexibility of the tails and tractability. We illustrate the performance of the proposed model with real data, and compare it to appropriate alternatives. Although we focus on the study of the univariat