Do you want to publish a course? Click here

New clustering approach for symbolic polygonal data: application to the clustering of entrepreneurial regimes

63   0   0.0 ( 0 )
 Added by Andrej Srakar
 Publication date 2020
and research's language is English
 Authors Andrej Srakar




Ask ChatGPT about the research

Entrepreneurial regimes are topic, receiving ever more research attention. Existing studies on entrepreneurial regimes mainly use common methods from multivariate analysis and some type of institutional related analysis. In our analysis, the entrepreneurial regimes is analyzed by applying a novel polygonal symbolic data cluster analysis approach. Considering the diversity of data structures in Symbolic Data Analysis (SDA), interval-valued data is the most popular. Yet, this approach requires assuming equidistribution hypothesis. We use a novel polygonal cluster analysis approach to address this limitation with additional advantages: to store more information, to significantly reduce large data sets preserving the classical variability through polygon radius, and to open new possibilities in symbolic data analysis. We construct a dynamic cluster analysis algorithm for this type of data with proving main theorems and lemmata to justify its usage. In the empirical part we use dataset of Global Entrepreneurship Monitor (GEM) for year 2015 to construct typologies of countries based on responses to main entrepreneurial questions. The article presents a novel approach to clustering in statistical theory (with novel type of variables never accounted for) and application to a pressing issue in entrepreneurship with novel results.



rate research

Read More

124 - Fionn Murtagh 2008
An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.
We propose a new method for clustering of functional data using a $k$-means framework. We work within the elastic functional data analysis framework, which allows for decomposition of the overall variation in functional data into amplitude and phase components. We use the amplitude component to partition functions into shape clusters using an automated approach. To select an appropriate number of clusters, we additionally propose a novel Bayesian Information Criterion defined using a mixture model on principal components estimated using functional Principal Component Analysis. The proposed method is motivated by the problem of posterior exploration, wherein samples obtained from Markov chain Monte Carlo algorithms are naturally represented as functions. We evaluate our approach using a simulated dataset, and apply it to a study of acute respiratory infection dynamics in San Luis Potos{i}, Mexico.
Clustering task of mixed data is a challenging problem. In a probabilistic framework, the main difficulty is due to a shortage of conventional distributions for such data. In this paper, we propose to achieve the mixed data clustering with a Gaussian copula mixture model, since copulas, and in particular the Gaussian ones, are powerful tools for easily modelling the distribution of multivariate variables. Indeed, considering a mixing of continuous, integer and ordinal variables (thus all having a cumulative distribution function), this copula mixture model defines intra-component dependencies similar to a Gaussian mixture, so with classical correlation meaning. Simultaneously, it preserves standard margins associated to continuous, integer and ordered features, namely the Gaussian, the Poisson and the ordered multinomial distributions. As an interesting by-product, the proposed mixture model generalizes many well-known ones and also provides tools of visualization based on the parameters. At a practical level, the Bayesian inference is retained and it is achieved with a Metropolis-within-Gibbs sampler. Experiments on simulated and real data sets finally illustrate the expected advantages of the proposed model for mixed data: flexible and meaningful parametrization combined with visualization features.
The clustering for functional data with misaligned problems has drawn much attention in the last decade. Most methods do the clustering after those functional data being registered and there has been little research using both functional and scalar variables. In this paper, we propose a simultaneous registration and clustering (SRC) model via two-level models, allowing the use of both types of variables and also allowing simultaneous registration and clustering. For the data collected from subjects in different unknown groups, a Gaussian process functional regression model with time warping is used as the first level model; an allocation model depending on scalar variables is used as the second level model providing further information over the groups. The former carries out registration and modeling for the multi-dimensional functional data (2D or 3D curves) at the same time. This methodology is implemented using an EM algorithm, and is examined on both simulated data and real data.
In 2015, Driemel, Krivov{s}ija and Sohler introduced the $(k,ell)$-median problem for clustering polygonal curves under the Frechet distance. Given a set of input curves, the problem asks to find $k$ median curves of at most $ell$ vertices each that minimize the sum of Frechet distances over all input curves to their closest median curve. A major shortcoming of their algorithm is that the input curves are restricted to lie on the real line. In this paper, we present a randomized bicriteria-approximation algorithm that works for polygonal curves in $mathbb{R}^d$ and achieves approximation factor $(1+epsilon)$ with respect to the clustering costs. The algorithm has worst-case running-time linear in the number of curves, polynomial in the maximum number of vertices per curve, i.e. their complexity, and exponential in $d$, $ell$, $epsilon$ and $delta$, i.e., the failure probability. We achieve this result through a shortcutting lemma, which guarantees the existence of a polygonal curve with similar cost as an optimal median curve of complexity $ell$, but of complexity at most $2ell-2$, and whose vertices can be computed efficiently. We combine this lemma with the superset-sampling technique by Kumar et al. to derive our clustering result. In doing so, we describe and analyze a generalization of the algorithm by Ackermann et al., which may be of independent interest.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا