No Arabic abstract
Several approaches to testing the hypothesis that two histograms are drawn from the same distribution are investigated. We note that single-sample continuous distribution tests may be adapted to this two-sample grouped data situation. The difficulty of not having a fully-specified null hypothesis is an important consideration in the general case, and care is required in estimating probabilities with ``toy Monte Carlo simulations. The performance of several common tests is compared; no single test performs best in all situations.
Straightforward methods for adapting the familiar chi^2 statistic to histograms of discrete events and other Poisson distributed data generally yield biased estimates of the parameters of a model. The bias can be important even when the total number of events is large. For the case of estimating a microcalorimeters energy resolution at 6 keV from the observed shape of the Mn K-alpha fluorescence spectrum, a poor choice of chi^2 can lead to biases of at least 10% in the estimated resolution when up to thousands of photons are observed. The best remedy is a Poisson maximum-likelihood fit, through a simple modification of the standard Levenberg-Marquardt algorithm for chi^2 minimization. Where the modification is not possible, another approach allows iterative approximation of the maximum-likelihood fit.
We present the asymptotic distribution for two-sided tests based on the profile likelihood ratio with lower and upper boundaries on the parameter of interest. This situation is relevant for branching ratios and the elements of unitary matrices such as the CKM matrix.
We investigate the problem of identity testing for multidimensional histogram distributions. A distribution $p: D rightarrow mathbb{R}_+$, where $D subseteq mathbb{R}^d$, is called a $k$-histogram if there exists a partition of the domain into $k$ axis-aligned rectangles such that $p$ is constant within each such rectangle. Histograms are one of the most fundamental nonparametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with {em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, let $q$ be an unknown $d$-dimensional $k$-histogram distribution in fixed dimension $d$, and $p$ be an explicitly given $d$-dimensional $k$-histogram. We want to correctly distinguish, with probability at least $2/3$, between the case that $p = q$ versus $|p-q|_1 geq epsilon$. We design an algorithm for this hypothesis testing problem with sample complexity $O((sqrt{k}/epsilon^2) 2^{d/2} log^{2.5 d}(k/epsilon))$ that runs in sample-polynomial time. Our algorithm is robust to model misspecification, i.e., succeeds even if $q$ is only promised to be {em close} to a $k$-histogram. Moreover, for $k = 2^{Omega(d)}$, we show a sample complexity lower bound of $(sqrt{k}/epsilon^2) cdot Omega(log(k)/d)^{d-1}$ when $dgeq 2$. That is, for any fixed dimension $d$, our upper and lower bounds are nearly matching. Prior to our work, the sample complexity of the $d=1$ case was well-understood, but no algorithm with sub-learning sample complexity was known, even for $d=2$. Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.
Identifying frequencies with low signal-to-noise ratios in time series of stellar photometry and spectroscopy, and measuring their amplitude ratios and peak widths accurately, are critical goals for asteroseismology. These are also challenges for time series with gaps or whose data are not sampled at a constant rate, even with modern Discrete Fourier Transform (DFT) software. Also the False-Alarm Probability introduced by Lomb and Scargle is an approximation which becomes less reliable in time series with longer data gaps. A rigorous statistical treatment of how to determine the significance of a peak in a DFT, called SigSpec, is presented here. SigSpec is based on an analytical solution of the probability that a DFT peak of a given amplitude does not arise from white noise in a non-equally spaced data set. The underlying Probability Density Function (PDF) of the amplitude spectrum generated by white noise can be derived explicitly if both frequency and phase are incorporated into the solution. In this paper, I define and evaluate an unbiased statistical estimator, the spectral significance, which depends on frequency, amplitude, and phase in the DFT, and which takes into account the time-domain sampling. I also compare this estimator to results from other well established techniques and demonstrate the effectiveness of SigSpec with a few examples of ground- and space-based photometric data, illustratring how SigSpec deals with the effects of noise and time-domain sampling in determining significant frequencies.
Differential measurements of particle collisions or decays can provide stringent constraints on physics beyond the Standard Model of particle physics. In particular, the distributions of the kinematical and angular variables that characterise heavy me- son multibody decays are non trivial and can sign the underlying interaction physics. In the era of high luminosity opened by the advent of the Large Hadron Collider and of Flavor Factories, differential measurements are less and less dominated by statistical precision and require a precise determination of efficiencies that depend simultaneously on several variables and do not factorise in these variables. This docu- ment is a reflection on the potential of multivariate techniques for the determination of such multidimensional efficiencies. We carried out two case studies that show that multilayer perceptron neural networks can determine and correct for the distortions introduced by reconstruction and selection criteria in the multidimensional phase space of the decays $B^{0}rightarrow K^{*0}(rightarrow K^{+}pi^{-}) mu^{+}mu^{-}$ and $D^{0}rightarrow K^{-}pi^{+}pi^{+}pi^{-}$, at the price of a minimal analysis effort. We conclude that this method can already be used for measurements which statistical precision does not yet reach the percent level and that with more sophisticated machine learning methods, the aforementioned potential is very promising.