No Arabic abstract
The identification of precipitation regimes is important for many purposes such as agricultural planning, water resource management, and return period estimation. Since precipitation and other related meteorological data typically exhibit spatial dependency and different characteristics at different time scales, clustering such data presents unique challenges. In this paper, we develop a flexible model-based approach to cluster multi-scale spatial functional data to address such problems. The underlying clustering model is a functional linear model , and the cluster memberships are assumed to be a realization from a Markov random field with geographic covariates. The methodology is applied to a precipitation data from China to identify precipitation regimes.
There is increasing appetite for analysing multiple network data. This is different to analysing traditional data sets, where now each observation in the data comprises a network. Recent technological advancements have allowed the collection of this type of data in a range of different applications. This has inspired researchers to develop statistical models that most accurately describe the probabilistic mechanism that generates a network population and use this to make inferences about the underlying structure of the network data. Only a few studies developed to date consider the heterogeneity that can exist in a network population. We propose a Mixture of Measurement Error Models for identifying clusters of networks in a network population, with respect to similarities detected in the connectivity patterns among the networks nodes. Extensive simulation studies show our model performs well in both clustering multiple network data and inferring the model parameters. We further apply our model on two real world multiple network data sets resulting from the fields of Computing (Human Tracking Systems) and Neuroscience.
Understanding electrical energy demand at the consumer level plays an important role in planning the distribution of electrical networks and offering of off-peak tariffs, but observing individual consumption patterns is still expensive. On the other hand, aggregated load curves are normally available at the substation level. The proposed methodology separates substation aggregated loads into estimated mean consumption curves, called typical curves, including information given by explanatory variables. In addition, a model-based clustering approach for substations is proposed based on the similarity of their consumers typical curves and covariance structures. The methodology is applied to a real substation load monitoring dataset from the United Kingdom and tested in eight simulated scenarios.
Aiming to generate realistic synthetic times series of the bivariate process of daily mean temperature and precipitations, we introduce a non-homogeneous hidden Markov model. The non-homogeneity lies in periodic transition probabilities between the hidden states, and time-dependent emission distributions. This enables the model to account for the non-stationary behaviour of weather variables. By carefully choosing the emission distributions, it is also possible to model the dependance structure between the two variables. The model is applied to several weather stations in Europe with various climates, and we show that it is able to simulate realistic bivariate time series.
One of the classic concerns in statistics is determining if two samples come from thesame population, i.e. homogeneity testing. In this paper, we propose a homogeneitytest in the context of Functional Data Analysis, adopting an idea from multivariatedata analysis: the data depth plot (DD-plot). This DD-plot is a generalization of theunivariate Q-Q plot (quantile-quantile plot). We propose some statistics based onthese DD-plots, and we use bootstrapping techniques to estimate their distributions.We estimate the finite-sample size and power of our test via simulation, obtainingbetter results than other homogeneity test proposed in the literature. Finally, weillustrate the procedure in samples of real heterogeneous data and get consistent results.
We compare two major approaches to variable selection in clustering: model selection and regularization. Based on previous results, we select the method of Maugis et al. (2009b), which modified the method of Raftery and Dean (2006), as a current state of the art model selection method. We select the method of Witten and Tibshirani (2010) as a current state of the art regularization method. We compared the methods by simulation in terms of their accuracy in both classification and variable selection. In the first simulation experiment all the variables were conditionally independent given cluster membership. We found that variable selection (of either kind) yielded substantial gains in classification accuracy when the clusters were well separated, but few gains when the clusters were close together. We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting variables. In our second simulation experiment, there were correlations among the variables given the cluster memberships. We found that the model selection approach was substantially more accurate in terms of both classification and variable selection than the regularization approach, and that both gave more accurate classifications than $K$-means without variable selection.