No Arabic abstract
We use the $k$-nearest neighbor probability distribution function ($k$NN-PDF, Banerjee & Abel 2021) to assess convergence in a scale-free $N$-body simulation. Compared to our previous two-point analysis, the $k$NN-PDF allows us to quantify our results in the language of halos and numbers of particles, while also incorporating non-Gaussian information. We find good convergence for 32 particles and greater at densities typical of halos, while 16 particles and fewer appears unconverged. Halving the softening length extends convergence to higher densities, but not to fewer particles. Our analysis is less sensitive to voids, but we analyze a limited range of underdensities and find evidence for convergence at 16 particles and greater even in sparse voids.
Cross-correlations between datasets are used in many different contexts in cosmological analyses. Recently, $k$-Nearest Neighbor Cumulative Distribution Functions ($k{rm NN}$-${rm CDF}$) were shown to be sensitive probes of cosmological (auto) clustering. In this paper, we extend the framework of nearest neighbor measurements to describe joint distributions of, and correlations between, two datasets. We describe the measurement of joint $k{rm NN}$-${rm CDF}$s, and show that these measurements are sensitive to all possible connected $N$-point functions that can be defined in terms of the two datasets. We describe how the cross-correlations can be isolated by combining measurements of the joint $k{rm NN}$-${rm CDF}$s and those measured from individual datasets. We demonstrate the application of these measurements in the context of Gaussian density fields, as well as for fully nonlinear cosmological datasets. Using a Fisher analysis, we show that measurements of the halo-matter cross-correlations, as measured through nearest neighbor measurements are more sensitive to the underlying cosmological parameters, compared to traditional two-point cross-correlation measurements over the same range of scales. Finally, we demonstrate how the nearest neighbor cross-correlations can robustly detect cross correlations between sparse samples -- the same regime where the two-point cross-correlation measurements are dominated by noise.
The use of summary statistics beyond the two-point correlation function to analyze the non-Gaussian clustering on small scales is an active field of research in cosmology. In this paper, we explore a set of new summary statistics -- the $k$-Nearest Neighbor Cumulative Distribution Functions ($k{rm NN}$-${rm CDF}$). This is the empirical cumulative distribution function of distances from a set of volume-filling, Poisson distributed random points to the $k$-nearest data points, and is sensitive to all connected $N$-point correlations in the data. The $k{rm NN}$-${rm CDF}$ can be used to measure counts in cell, void probability distributions and higher $N$-point correlation functions, all using the same formalism exploiting fast searches with spatial tree data structures. We demonstrate how it can be computed efficiently from various data sets - both discrete points, and the generalization for continuous fields. We use data from a large suite of $N$-body simulations to explore the sensitivity of this new statistic to various cosmological parameters, compared to the two-point correlation function, while using the same range of scales. We demonstrate that the use of $k{rm NN}$-${rm CDF}$ improves the constraints on the cosmological parameters by more than a factor of $2$ when applied to the clustering of dark matter in the range of scales between $10h^{-1}{rm Mpc}$ and $40h^{-1}{rm Mpc}$. We also show that relative improvement is even greater when applied on the same scales to the clustering of halos in the simulations at a fixed number density, both in real space, as well as in redshift space. Since the $k{rm NN}$-${rm CDF}$ are sensitive to all higher order connected correlation functions in the data, the gains over traditional two-point analyses are expected to grow as progressively smaller scales are included in the analysis of cosmological data.
Random walks including non-nearest-neighbor jumps appear in many real situations such as the diffusion of adatoms and have found numerous applications including PageRank search algorithm, however, related theoretical results are much less for this dynamical process. In this paper, we present a study of mixed random walks in a family of fractal scale-free networks, where both nearest-neighbor and next-nearest-neighbor jumps are included. We focus on trapping problem in the network family, which is a particular case of random walks with a perfect trap fixed at the central high-degree node. We derive analytical expressions for the average trapping time (ATT), a quantitative indicator measuring the efficiency of the trapping process, by using two different methods, the results of which are consistent with each other. Furthermore, we analytically determine all the eigenvalues and their multiplicities for the fundamental matrix characterizing the dynamical process. Our results show that although next-nearest-neighbor jumps have no effect on the leading sacling of the trapping efficiency, they can strongly affect the prefactor of ATT, providing insight into better understanding of random-walk process in complex systems.
We investigate the application of Hybrid Effective Field Theory (HEFT) -- which combines a Lagrangian bias expansion with subsequent particle dynamics from $N$-body simulations -- to the modeling of $k$-Nearest Neighbor Cumulative Distribution Functions ($k{rm NN}$-${rm CDF}$s) of biased tracers of the cosmological matter field. The $k{rm NN}$-${rm CDF}$s are sensitive to all higher order connected $N$-point functions in the data, but are computationally cheap to compute. We develop the formalism to predict the $k{rm NN}$-${rm CDF}$s of discrete tracers of a continuous field from the statistics of the continuous field itself. Using this formalism, we demonstrate how $k{rm NN}$-${rm CDF}$ statistics of a set of biased tracers, such as halos or galaxies, of the cosmological matter field can be modeled given a set of low-redshift HEFT component fields and bias parameter values. These are the same ingredients needed to predict the two-point clustering. For a specific sample of halos, we show that both the two-point clustering textit{and} the $k{rm NN}$-${rm CDF}$s can be well-fit on quasi-linear scales ($gtrsim 20 h^{-1}{rm Mpc}$) by the second-order HEFT formalism with the textit{same values} of the bias parameters, implying that joint modeling of the two is possible. Finally, using a Fisher matrix analysis, we show that including $k{rm NN}$-${rm CDF}$ measurements over the range of allowed scales in the HEFT framework can improve the constraints on $sigma_8$ by roughly a factor of $3$, compared to the case where only two-point measurements are considered. Combining the statistical power of $k{rm NN}$ measurements with the modeling power of HEFT, therefore, represents an exciting prospect for extracting greater information from small-scale cosmological clustering.
The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures available? This review attempts to answer this question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real-world datasets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, and the results showed large gaps between the performances of different distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN with this top performing distance degraded only about $20%$ while the noise level reaches $90%$, this is true for most of the distances used as well. This means that the KNN classifier using any of the top $10$ distances tolerate noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing to other distances.