No Arabic abstract
We present a practical implementation of a Monte Carlo method to estimate the significance of cross-correlations in unevenly sampled time series of data, whose statistical properties are modeled with a simple power-law power spectral density. This implementation builds on published methods, we introduce a number of improvements in the normalization of the cross-correlation function estimate and a bootstrap method for estimating the significance of the cross-correlations. A closely related matter is the estimation of a model for the light curves, which is critical for the significance estimates. We present a graphical and quantitative demonstration that uses simulations to show how common it is to get high cross-correlations for unrelated light curves with steep power spectral densities. This demonstration highlights the dangers of interpreting them as signs of a physical connection. We show that by using interpolation and the Hanning sampling window function we are able to reduce the effects of red-noise leakage and to recover steep simple power-law power spectral densities. We also introduce the use of a Neyman construction for the estimation of the errors in the power-law index of the power spectral density. This method provides a consistent way to estimate the significance of cross-correlations in unevenly sampled time series of data.
Astronomical surveys of celestial sources produce streams of noisy time series measuring flux versus time (light curves). Unlike in many other physical domains, however, large (and source-specific) temporal gaps in data arise naturally due to intranight cadence choices as well as diurnal and seasonal constraints. With nightly observations of millions of variable stars and transients from upcoming surveys, efficient and accurate discovery and classification techniques on noisy, irregularly sampled data must be employed with minimal human-in-the-loop involvement. Machine learning for inference tasks on such data traditionally requires the laborious hand-coding of domain-specific numerical summaries of raw data (features). Here we present a novel unsupervised autoencoding recurrent neural network (RNN) that makes explicit use of sampling times and known heteroskedastic noise properties. When trained on optical variable star catalogs, this network produces supervised classification models that rival other best-in-class approaches. We find that autoencoded features learned on one time-domain survey perform nearly as well when applied to another survey. These networks can continue to learn from new unlabeled observations and may be used in other unsupervised tasks such as forecasting and anomaly detection.
We introduce new methods for robust high-precision photometry from well-sampled images of a non-crowded field with a strongly varying point-spread function. For this work, we used archival imaging data of the open cluster M37 taken by MMT 6.5m telescope. We find that the archival light curves from the original image subtraction procedure exhibit many unusual outliers, and more than 20% of data get rejected by the simple filtering algorithm adopted by early analysis. In order to achieve better photometric precisions and also to utilize all available data, the entire imaging database was re-analyzed with our time-series photometry technique (Multi-aperture Indexing Photometry) and a set of sophisticated calibration procedures. The merit of this approach is as follows: we find an optimal aperture for each star with a maximum signal-to-noise ratio, and also treat peculiar situations where photometry returns misleading information with more optimal photometric index. We also adopt photometric de-trending based on a hierarchical clustering method, which is a very useful tool in removing systematics from light curves. Our method removes systematic variations that are shared by light curves of nearby stars, while true variabilities are preserved. Consequently, our method utilizes nearly 100% of available data and reduce the rms scatter several times smaller than archival light curves for brighter stars. This new data set gives a rare opportunity to explore different types of variability of short (~minutes) and long (~1 month) time scales in open cluster stars.
The hunt for Earth analogue planets orbiting Sun-like stars has forced the introduction of novel methods to detect signals at, or below, the level of the intrinsic noise of the observations. We present a new global periodogram method that returns more information than the classic Lomb-Scargle periodogram method for radial velocity signal detection. Our method uses the Minimum Mean Squared Error as a framework to determine the optimal number of genuine signals present in a radial velocity timeseries using a global search algorithm, meaning we can discard noise spikes from the data before follow-up analysis. This method also allows us to determine the phase and amplitude of the signals we detect, meaning we can track these quantities as a function of time to test if the signals are stationary or non-stationary. We apply our method to the radial velocity data for GJ876 as a test system to highlight how the phase information can be used to select against non-stationary sources of detected signals in radial velocity data, such as rotational modulation of star spots. Analysis of this system yields two new statistically significant signals in the combined Keck and HARPS velocities with periods of 10 and 15 days. Although a planet with a period of 15 days would relate to a Laplace resonant chain configuration with three of the other planets (8:4:2:1), we stress that follow-up dynamical analyses are needed to test the reliability of such a six planet system.
We compare the noise in interferometric measurements of the Vela pulsar from ground- and space-based antennas with theoretical predictions. The noise depends on both the flux density and the interferometric phase of the source. Because the Vela pulsar is bright and scintillating, these comparisons extend into both the low and high signal-to-noise regimes. Furthermore, our diversity of baselines explores the full range of variation in interferometric phase. We find excellent agreement between theoretical expectations and our estimates of noise among samples within the characteristic scintillation scales. Namely, the noise is drawn from an elliptical Gaussian distribution in the complex plane, centered on the signal. The major axis, aligned with the signal phase, varies quadratically with the signal, while the minor axis, at quadrature, varies with the same linear coefficients. For weak signal, the noise approaches a circular Gaussian distribution. Both the variance and covariance of the noise are also affected by artifacts of digitization and correlation. In particular, we show that gating introduces correlations between nearby spectral channels.
Irregularly sampled time series (ISTS) data has irregular temporal intervals between observations and different sampling rates between sequences. ISTS commonly appears in healthcare, economics, and geoscience. Especially in the medical environment, the widely used Electronic Health Records (EHRs) have abundant typical irregularly sampled medical time series (ISMTS) data. Developing deep learning methods on EHRs data is critical for personalized treatment, precise diagnosis and medical management. However, it is challenging to directly use deep learning models for ISMTS data. On the one hand, ISMTS data has the intra-series and inter-series relations. Both the local and global structures should be considered. On the other hand, methods should consider the trade-off between task accuracy and model complexity and remain generality and interpretability. So far, many existing works have tried to solve the above problems and have achieved good results. In this paper, we review these deep learning methods from the perspectives of technology and task. Under the technology-driven perspective, we summarize them into two categories - missing data-based methods and raw data-based methods. Under the task-driven perspective, we also summarize them into two categories - data imputation-oriented and downstream task-oriented. For each of them, we point out their advantages and disadvantages. Moreover, we implement some representative methods and compare them on four medical datasets with two tasks. Finally, we discuss the challenges and opportunities in this area.