No Arabic abstract
Correlation and similarity measures are widely used in all the areas of sciences and social sciences. Often the variables are not numbers but are instead qualitative descriptors called categorical data. We define and study similarity matrix, as a measure of similarity, for the case of categorical data. This is of interest due to a deluge of categorical data, such as movie ratings, top-10 rankings and data from social media, in the public domain that require analysis. We show that the statistical properties of the spectra of similarity matrices, constructed from categorical data, follow those from random matrix theory. We demonstrate this approach by applying it to the data of Indian general elections and sea level pressures in North Atlantic ocean.
In high-energy physics, with the search for ever smaller signals in ever larger data sets, it has become essential to extract a maximum of the available information from the data. Multivariate classification methods based on machine learning techniques have become a fundamental ingredient to most analyses. Also the multivariate classifiers themselves have significantly evolved in recent years. Statisticians have found new ways to tune and to combine classifiers to further gain in performance. Integrated into the analysis framework ROOT, TMVA is a toolkit which hosts a large variety of multivariate classification algorithms. Training, testing, performance evaluation and application of all available classifiers is carried out simultaneously via user-friendly interfaces. With version 4, TMVA has been extended to multivariate regression of a real-valued target vector. Regression is invoked through the same user interfaces as classification. TMVA 4 also features more flexible data handling allowing one to arbitrarily form combined MVA methods. A generalised boosting method is the first realisation benefiting from the new framework.
A spectral fitter based on the graphics processor unit (GPU) has been developed for Borexino solar neutrino analysis. It is able to shorten the fitting time to a superior level compared to the CPU fitting procedure. In Borexino solar neutrino spectral analysis, fitting usually requires around one hour to converge since it includes time-consuming convolutions in order to account for the detector response and pile-up effects. Moreover, the convergence time increases to more than two days when including extra computations for the discrimination of $^{11}$C and external $gamma$s. In sharp contrast, with the GPU-based fitter it takes less than 10 seconds and less than four minutes, respectively. This fitter is developed utilizing the GooFit project with customized likelihoods, pdfs and infrastructures supporting certain analysis methods. In this proceeding the design of the package, developed features and the comparison with the original CPU fitter are presented.
Partial Wave Analysis has traditionally been carried out using a set of tools handcrafted for each experiment. By taking an object-oriented approach, the design presented in this paper attempts to create a more generally useful, and easily extensible, environment for analyzing many different type of data.
Inspired by the quantum computing algorithms for Linear Algebra problems [HHL,TaShma] we study how the simulation on a classical computer of this type of Phase Estimation algorithms performs when we apply it to solve the Eigen-Problem of Hermitian matrices. The result is a completely new, efficient and stable, parallel algorithm to compute an approximate spectral decomposition of any Hermitian matrix. The algorithm can be implemented by Boolean circuits in $O(log^2 n)$ parallel time with a total cost of $O(n^{omega+1})$ Boolean operations. This Boolean complexity matches the best known rigorous $O(log^2 n)$ parallel time algorithms, but unlike those algorithms our algorithm is (logarithmically) stable, so further improvements may lead to practical implementations. All previous efficient and rigorous approaches to solve the Eigen-Problem use randomization to avoid bad condition as we do too. Our algorithm makes further use of randomization in a completely new way, taking random powers of a unitary matrix to randomize the phases of its eigenvalues. Proving that a tiny Gaussian perturbation and a random polynomial power are sufficient to ensure almost pairwise independence of the phases $(mod (2pi))$ is the main technical contribution of this work. This randomization enables us, given a Hermitian matrix with well separated eigenvalues, to sample a random eigenvalue and produce an approximate eigenvector in $O(log^2 n)$ parallel time and $O(n^omega)$ Boolean complexity. We conjecture that further improvements of our method can provide a stable solution to the full approximate spectral decomposition problem with complexity similar to the complexity (up to a logarithmic factor) of sampling a single eigenvector.
When dealing with non-stationary systems, for which many time series are available, it is common to divide time in epochs, i.e. smaller time intervals and deal with short time series in the hope to have some form of approximate stationarity on that time scale. We can then study time evolution by looking at properties as a function of the epochs. This leads to singular correlation matrices and thus poor statistics. In the present paper, we propose an ensemble technique to deal with a large set of short time series without any consideration of non-stationarity. We randomly select subsets of time series and thus create an ensemble of non-singular correlation matrices. As the selection possibilities are binomially large, we will obtain good statistics for eigenvalues of correlation matrices, which are typically not independent. Once we defined the ensemble, we analyze its behavior for constant and block-diagonal correlations and compare numerics with analytic results for the corresponding correlated Wishart ensembles. We discuss differences resulting from spurious correlations due to repeatitive use of time-series. The usefulness of this technique should extend beyond the stationary case if, on the time scale of the epochs, we have quasi-stationarity at least for most epochs.