No Arabic abstract
This paper introduces a new way to calculate distance-based statistics, particularly when the data are multivariate. The main idea is to pre-calculate the optimal projection directions given the variable dimension, and to project multidimensional variables onto these pre-specified projection directions; by subsequently utilizing the fast algorithm that is developed in Huo and Szekely [2016] for the univariate variables, the computational complexity can be improved from $O(m^2)$ to $O(n m cdot mbox{log}(m))$, where $n$ is the number of projection directions and $m$ is the sample size. When $n ll m/log(m)$, computational savings can be achieved. The key challenge is how to find the optimal pre-specified projection directions. This can be obtained by minimizing the worse-case difference between the true distance and the approximated distance, which can be formulated as a nonconvex optimization problem in a general setting. In this paper, we show that the exact solution of the nonconvex optimization problem can be derived in two special cases: the dimension of the data is equal to either $2$ or the number of projection directions. In the generic settings, we propose an algorithm to find some approximate solutions. Simulations confirm the advantage of our method, in comparison with the pure Monte Carlo approach, in which the directions are randomly selected rather than pre-calculated.
In some cases, computational benefit can be gained by exploring the hyper parameter space using a deterministic set of grid points instead of a Markov chain. We view this as a numerical integration problem and make three unique contributions. First, we explore the space using low discrepancy point sets instead of a grid. This allows for accurate estimation of marginals of any shape at a much lower computational cost than a grid based approach and thus makes it possible to extend the computational benefit to a hyper parameter space with higher dimensionality (10 or more). Second, we propose a new, quick and easy method to estimate the marginal using a least squares polynomial and prove the conditions under which this polynomial will converge to the true marginal. Our results are valid for a wide range of point sets including grids, random points and low discrepancy points. Third, we show that further accuracy and efficiency can be gained by taking into consideration the functional decomposition of the integrand and illustrate how this can be done using anchored f-ANOVA on weighted spaces.
When a Genetic Algorithm (GA), or a stochastic algorithm in general, is employed in a statistical problem, the obtained result is affected by both variability due to sampling, that refers to the fact that only a sample is observed, and variability due to the stochastic elements of the algorithm. This topic can be easily set in a framework of statistical and computational tradeoff question, crucial in recent problems, for which statisticians must carefully set statistical and computational part of the analysis, taking account of some resource or time constraints. In the present work we analyze estimation problems tackled by GAs, for which variability of estimates can be decomposed in the two sources of variability, considering some constraints in the form of cost functions, related to both data acquisition and runtime of the algorithm. Simulation studies will be presented to discuss the statistical and computational tradeoff question.
Statistical Data Assimilation (SDA) is the transfer of information from field or laboratory observations to a user selected model of the dynamical system producing those observations. The data is noisy and the model has errors; the information transfer addresses properties of the conditional probability distribution of the states of the model conditioned on the observations. The quantities of interest in SDA are the conditional expected values of functions of the model state, and these require the approximate evaluation of high dimensional integrals. We introduce a conditional probability distribution and use the Laplace method with annealing to identify the maxima of the conditional probability distribution. The annealing method slowly increases the precision term of the model as it enters the Laplace method. In this paper, we extend the idea of precision annealing (PA) to Monte Carlo calculations of conditional expected values using Metropolis-Hastings methods.
The information-based optimal subdata selection (IBOSS) is a computationally efficient method to select informative data points from large data sets through processing full data by columns. However, when the volume of a data set is too large to be processed in the available memory of a machine, it is infeasible to implement the IBOSS procedure. This paper develops a divide-and-conquer IBOSS approach to solving this problem, in which the full data set is divided into smaller partitions to be loaded into the memory and then subsets of data are selected from each partitions using the IBOSS algorithm. We derive both finite sample properties and asymptotic properties of the resulting estimator. Asymptotic results show that if the full data set is partitioned randomly and the number of partitions is not very large, then the resultant estimator has the same estimation efficiency as the original IBOSS estimator. We also carry out numerical experiments to evaluate the empirical performance of the proposed method.
It is useful to have mathematical criteria for evaluating errors in map projections. The Chebyshev criterion for minimizing rms (root mean square) local scale factor errors for conformal maps has been useful in developing conformal map projections of continents. Any local error criterion will be minimized ultimately by map projections with multiple interruptions, on which some pairs of points that are close on the globe are far apart on the map. Since it is as bad to have two points on the map at two times their proper separation as to have them at half their proper separation, it is the rms logarithmic distance, s, between random points in the mapped region that we will minimize. The best previously known projection of the entire sphere for distances is the Lambert equal-area azimuthal with an rms logarithmic distance error of s=0.343. For comparison, the Mercator has s=0.444, and the Mollweide has s=0.390. We present new projections: the Gott equal-area elliptical with perfect shapes on the central meridian, the Gott-Mugnolo equal-area elliptical and the Gott-Mugnolo azimuthal with rms logarithmic distance errors of s=0.365, s=0.348, and s=0.341 respectively, which improve on previous projections of their type. The Gott-Mugnolo azimuthal has the lowest distance errors of any map and is produced by a new technique using forces between pairs of points on a map which make them move so as to minimize s. The Gott equal-area elliptical projection produces a particularly attractive map of Mars, and the Gott-Mugnolo azimuthal projection produces an interesting map of the moon.