No Arabic abstract
Weak lensing by large-scale structure is a powerful probe of cosmology if the apparent alignments in the shapes of distant galaxies can be accurately measured. We study the performance of a fully data-driven approach, based on MetaDetection, focusing on the more realistic case of observations with an anisotropic PSF. Under the assumption that PSF anisotropy is the only source of additive shear bias, we show how unbiased shear estimates can be obtained from the observed data alone. To do so, we exploit the finding that the multiplicative shear bias obtained with MetaDetection is nearly insensitive to the PSF ellipticity. In practice, this assumption can be validated by comparing the empirical corrections obtained from observations to those from simulated data. We show that our data-driven approach meets the stringent requirements for upcoming space and ground-based surveys, although further optimisation is possible.
In recent years, there has been growing interest in using Precipitable Water Vapor (PWV) derived from Global Positioning System (GPS) signal delays to predict rainfall. However, the occurrence of rainfall is dependent on a myriad of atmospheric parameters. This paper proposes a systematic approach to analyze various parameters that affect precipitation in the atmosphere. Different ground-based weather features like Temperature, Relative Humidity, Dew Point, Solar Radiation, PWV along with Seasonal and Diurnal variables are identified, and a detailed feature correlation study is presented. While all features play a significant role in rainfall classification, only a few of them, such as PWV, Solar Radiation, Seasonal and Diurnal features, stand out for rainfall prediction. Based on these findings, an optimum set of features are used in a data-driven machine learning algorithm for rainfall prediction. The experimental evaluation using a four-year (2012-2015) database shows a true detection rate of 80.4%, a false alarm rate of 20.3%, and an overall accuracy of 79.6%. Compared to the existing literature, our method significantly reduces the false alarm rates.
Data-driven evolutionary optimization has witnessed great success in solving complex real-world optimization problems. However, existing data-driven optimization algorithms require that all data are centrally stored, which is not always practical and may be vulnerable to privacy leakage and security threats if the data must be collected from different devices. To address the above issue, this paper proposes a federated data-driven evolutionary optimization framework that is able to perform data driven optimization when the data is distributed on multiple devices. On the basis of federated learning, a sorted model aggregation method is developed for aggregating local surrogates based on radial-basis-function networks. In addition, a federated surrogate management strategy is suggested by designing an acquisition function that takes into account the information of both the global and local surrogate models. Empirical studies on a set of widely used benchmark functions in the presence of various data distributions demonstrate the effectiveness of the proposed framework.
Data driven algorithm design is an important aspect of modern data science and algorithm design. Rather than using off the shelf algorithms that only have worst case performance guarantees, practitioners often optimize over large families of parametrized algorithms and tune the parameters of these algorithms using a training set of problem instances from their domain to determine a configuration with high expected performance over future instances. However, most of this work comes with no performance guarantees. The challenge is that for many combinatorial problems of significant importance including partitioning, subset selection, and alignment problems, a small tweak to the parameters can cause a cascade of changes in the algorithms behavior, so the algorithms performance is a discontinuous function of its parameters. In this chapter, we survey recent work that helps put data-driven combinatorial algorithm design on firm foundations. We provide strong computational and statistical performance guarantees, both for the batch and online scenarios where a collection of typical problem instances from the given application are presented either all at once or in an online fashion, respectively.
We study a data analysts problem of acquiring data from self-interested individuals to obtain an accurate estimation of some statistic of a population, subject to an expected budget constraint. Each data holder incurs a cost, which is unknown to the data analyst, to acquire and report his data. The cost can be arbitrarily correlated with the data. The data analyst has an expected budget that she can use to incentivize individuals to provide their data. The goal is to design a joint acquisition-estimation mechanism to optimize the performance of the produced estimator, without any prior information on the underlying distribution of cost and data. We investigate two types of estimations: unbiased point estimation and confidence interval estimation. Unbiased estimators: We design a truthful, individually rational, online mechanism to acquire data from individuals and output an unbiased estimator of the population mean when the data analyst has no prior information on the cost-data distribution and individuals arrive in a random order. The performance of this mechanism matches that of the optimal mechanism, which knows the true cost distribution, within a constant factor. The performance of an estimator is evaluated by its variance under the worst-case cost-data correlation. Confidence intervals: We characterize an approximately optimal (within a factor $2$) mechanism for obtaining a confidence interval of the population mean when the data analyst knows the true cost distribution at the beginning. This mechanism is efficiently computable. We then design a truthful, individually rational, online algorithm that is only worse than the approximately optimal mechanism by a constant factor. The performance of an estimator is evaluated by its expected length under the worst-case cost-data correlation.
We present a new method to discriminate periodic from non-periodic irregularly sampled lightcurves. We introduce a periodic kernel and maximize a similarity measure derived from information theory to estimate the periods and a discriminator factor. We tested the method on a dataset containing 100,000 synthetic periodic and non-periodic lightcurves with various periods, amplitudes and shapes generated using a multivariate generative model. We correctly identified periodic and non-periodic lightcurves with a completeness of 90% and a precision of 95%, for lightcurves with a signal-to-noise ratio (SNR) larger than 0.5. We characterize the efficiency and reliability of the model using these synthetic lightcurves and applied the method on the EROS-2 dataset. A crucial consideration is the speed at which the method can be executed. Using hierarchical search and some simplification on the parameter search we were able to analyze 32.8 million lightcurves in 18 hours on a cluster of GPGPUs. Using the sensitivity analysis on the synthetic dataset, we infer that 0.42% in the LMC and 0.61% in the SMC of the sources show periodic behavior. The training set, the catalogs and source code are all available in http://timemachine.iic.harvard.edu.