أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Yao Xie

Multi-Resolution Spatio-Temporal Prediction with Application to Wind Power Generation

143 - Shixiang Zhu , Hanyu Zhang , Yao Xie 2021

This paper proposes a spatio-temporal model for wind speed prediction which can be run at different resolutions. The model assumes that the wind prediction of a cluster is correlated to its upstream influences in recent history, and the correlation b etween clusters is represented by a directed dynamic graph. A Bayesian approach is also described in which prior beliefs about the predictive errors at different data resolutions are represented in a form of Gaussian processes. The joint framework enhances the predictive performance by combining results from predictions at different data resolution and provides reasonable uncertainty quantification. The model is evaluated on actual wind data from the Midwest U.S. and shows a superior performance compared to traditional baselines.

تطبيقات الإحصاء

Survival Analysis with Graph-Based Regularization for Predictors

123 - Xi He , Liyan Xie , Yao Xie 2021

We study the variable selection problem in survival analysis to identify the most important factors affecting the survival time when the variables have prior knowledge that they have a mutual correlation through a graph structure. We consider the Cox proportional hazard model with a graph-based regularizer for variable selection. A computationally efficient algorithm is developed to solve the graph regularized maximum likelihood problem by connecting to group lasso. We provide theoretical guarantees about the recovery error and asymptotic distribution of the proposed estimators. The good performance and benefit of the proposed approach compared with existing methods are demonstrated in both synthetic and real data examples.

نظرية الإحصاء نظرية الإحصاء

Neural Tangent Kernel Maximum Mean Discrepancy

103 - Xiuyuan Cheng , Yao Xie 2021

We present a novel neural network Maximum Mean Discrepancy (MMD) statistic by identifying a connection between neural tangent kernel (NTK) and MMD statistic. This connection enables us to develop a computationally efficient and memory-efficient appro ach to compute the MMD statistic and perform neural network based two-sample tests towards addressing the long-standing challenge of memory and computational complexity of the MMD statistic, which is essential for online implementation to assimilate new samples. Theoretically, such a connection allows us to understand the properties of the new test statistic, such as Type-I error and testing power for performing the two-sample test, by leveraging analysis tools for kernel MMD. Numerical experiments on synthetic and real-world datasets validate the theory and demonstrate the effectiveness of the proposed NTK-MMD statistic.

التعلم الالي التعلم الآلي نظرية الإحصاء

Inferring Granger Causality from Irregularly Sampled Time Series

93 - Song Wei , Yao Xie , Christopher S. Josef 2021

Continuous, automated surveillance systems that incorporate machine learning models are becoming increasingly more common in healthcare environments. These models can capture temporally dependent changes across multiple patient variables and can enha nce a clinicians situational awareness by providing an early warning alarm of an impending adverse event such as sepsis. However, most commonly used methods, e.g., XGBoost, fail to provide an interpretable mechanism for understanding why a model produced a sepsis alarm at a given time. The black-box nature of many models is a severe limitation as it prevents clinicians from independently corroborating those physiologic features that have contributed to the sepsis alarm. To overcome this limitation, we propose a generalized linear model (GLM) approach to fit a Granger causal graph based on the physiology of several major sepsis-associated derangements (SADs). We adopt a recently developed stochastic monotone variational inequality-based estimator coupled with forwarding feature selection to learn the graph structure from both continuous and discrete-valued as well as regularly and irregularly sampled time series. Most importantly, we develop a non-asymptotic upper bound on the estimation error for any monotone link function in the GLM. We conduct real-data experiments and demonstrate that our proposed method can achieve comparable performance to popular and powerful prediction methods such as XGBoost while simultaneously maintaining a high level of interpretability.

التعلم الآلي نظرية الإحصاء تطبيقات الإحصاء

Conformal Anomaly Detection on Spatio-Temporal Observations with Missing Data

98 - Chen Xu , Yao Xie 2021

We develop a distribution-free, unsupervised anomaly detection method called ECAD, which wraps around any regression algorithm and sequentially detects anomalies. Rooted in conformal prediction, ECAD does not require data exchangeability but approxim ately controls the Type-I error when data are normal. Computationally, it involves no data-splitting and efficiently trains ensemble predictors to increase statistical power. We demonstrate the superior performance of ECAD on detecting anomalous spatio-temporal traffic flow.

تطبيقات الإحصاء المنهجية التعلم الالي

Kernel MMD Two-Sample Tests for Manifold Data

128 - Xiuyuan Cheng , Yao Xie 2021

We present a study of kernel MMD two-sample test statistics in the manifold setting, assuming the high-dimensional observations are close to a low-dimensional manifold. We characterize the property of the test (level and power) in relation to the ker nel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, we show that when data densities are supported on a $d$-dimensional sub-manifold $mathcal{M}$ embedded in an $m$-dimensional space, the kernel MMD two-sample test for data sampled from a pair of distributions $(p, q)$ that are Holder with order $beta$ is consistent and powerful when the number of samples $n$ is greater than $delta_2(p,q)^{-2-d/beta}$ up to certain constant, where $delta_2$ is the squared $ell_2$-divergence between two distributions on manifold. Moreover, to achieve testing consistency under this scaling of $n$, our theory suggests that the kernel bandwidth $gamma$ scales with $n^{-1/(d+2beta)}$. These results indicate that the kernel MMD two-sample test does not have a curse-of-dimensionality when the data lie on the low-dimensional manifold. We demonstrate the validity of our theory and the property of the MMD test for manifold data using several numerical experiments.

التعلم الالي التعلم الآلي نظرية الإحصاء

Online High-Dimensional Change-Point Detection using Topological Data Analysis

97 - Xiaojun Zheng , Simon Mak , Yao Xie 2021

Topological Data Analysis (TDA) is a rapidly growing field, which studies methods for learning underlying topological structures present in complex data representations. TDA methods have found recent success in extracting useful geometric structures for a wide range of applications, including protein classification, neuroscience, and time-series analysis. However, in many such applications, one is also interested in sequentially detecting changes in this topological structure. We propose a new method called Persistence Diagram based Change-Point (PD-CP), which tackles this problem by integrating the widely-used persistence diagrams in TDA with recent developments in nonparametric change-point detection. The key novelty in PD-CP is that it leverages the distribution of points on persistence diagrams for online detection of topological changes. We demonstrate the effectiveness of PD-CP in an application to solar flare monitoring.

المنهجية الطوبولوجيا الجبرية إحصاء

Sequential change-point detection for mutually exciting point processes over networks

109 - Haoyun Wang , Liyan Xie , Yao Xie 2021

We present a new CUSUM procedure for sequentially detecting change-point in the self and mutual exciting processes, a.k.a. Hawkes networks using discrete events data. Hawkes networks have become a popular model for statistics and machine learning due to their capability in modeling irregularly observed data where the timing between events carries a lot of information. The problem of detecting abrupt changes in Hawkes networks arises from various applications, including neuronal imaging, sensor network, and social network monitoring. Despite this, there has not been a computationally and memory-efficient online algorithm for detecting such changes from sequential data. We present an efficient online recursive implementation of the CUSUM statistic for Hawkes processes, both decentralized and memory-efficient, and establish the theoretical properties of this new CUSUM procedure. We then show that the proposed CUSUM method achieves better performance than existing methods, including the Shewhart procedure based on count data, the generalized likelihood ratio (GLR) in the existing literature, and the standard score statistic. We demonstrate this via a simulated example and an application to population code change-detection in neuronal networks.

التعلم الالي التعلم الآلي

Inferring serial correlation with dynamic backgrounds

91 - Song Wei , Yao Xie , Dobromir Rahnev 2021

Sequential data with serial correlation and an unknown, unstructured, and dynamic background is ubiquitous in neuroscience, psychology, and econometrics. Inferring serial correlation for such data is a fundamental challenge in statistics. We propose a total variation constrained least square estimator coupled with hypothesis tests to infer the serial correlation in the presence of unknown and unstructured dynamic background. The total variation constraint on the dynamic background encourages a piece-wise constant structure, which can approximate a wide range of dynamic backgrounds. The tuning parameter is selected via the Ljung-Box test to control the bias-variance trade-off. We establish a non-asymptotic upper bound for the estimation error through variational inequalities. We also derive a lower error bound via Fanos method and show the proposed method is near-optimal. Numerical simulation and a real study in psychology demonstrate the excellent performance of our proposed method compared with the state-of-the-art.

نظرية الإحصاء المنهجية نظرية الإحصاء

Testing Rank of Incomplete Unimodal Matrices

58 - Rui Zhang , Junting Chen , Yao Xie 2021

Several statistics-based detectors, based on unimodal matrix models, for determining the number of sources in a field are designed. A new variance ratio statistic is proposed, and its asymptotic distribution is analyzed. The variance ratio detector i s shown to outperform the alternatives. It is shown that further improvements are achievable via optimally selected rotations. Numerical experiments demonstrate the performance gains of our detection methods over the baseline approach.

تطبيقات الإحصاء نظرية الإحصاء المنهجية

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد