أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Xin Xing

Model-based Sparse Coding beyond Gaussian Independent Model

106 - Xin Xing , Rui Xie , Wenxuan Zhong 2021

Sparse coding aims to model data vectors as sparse linear combinations of basis elements, but a majority of related studies are restricted to continuous data without spatial or temporal structure. A new model-based sparse coding (MSC) method is propo sed to provide an effective and flexible framework for learning features from different data types: continuous, discrete, or categorical, and modeling different types of correlations: spatial or temporal. The specification of the sparsity level and how to adapt the estimation method to large-scale studies are also addressed. A fast EM algorithm is proposed for estimation, and its superior performance is demonstrated in simulation and multiple real applications such as image denoising, brain connectivity study, and spatial transcriptomic imaging.

المنهجية

Unified analysis of finite-size error for periodic Hartree-Fock and second order M{o}ller-Plesset perturbation theory

95 - Xin Xing , Xiaoxu Li , Lin Lin 2021

Despite decades of practice, finite-size errors in many widely used electronic structure theories for periodic systems remain poorly understood. For periodic systems using a general Monkhorst-Pack grid, there has been no rigorous analysis of the fini te-size error in the Hartree-Fock theory (HF) and the second order M{o}ller-Plesset perturbation theory (MP2), which are the simplest wavefunction based method, and the simplest post-Hartree-Fock method, respectively. Such calculations can be viewed as a multi-dimensional integral discretized with certain trapezoidal rules. Due to the Coulomb singularity, the integrand has many points of discontinuity in general, and standard error analysis based on the Euler-Maclaurin formula gives overly pessimistic results. The lack of analytic understanding of finite-size errors also impedes the development of effective finite-size correction schemes. We propose a unified method to obtain sharp convergence rates of finite-size errors for the periodic HF and MP2 theories. Our main technical advancement is a generalization of the result of [Lyness, 1976] for obtaining sharp convergence rates of the trapezoidal rule for a class of non-smooth integrands. Our result is applicable to three-dimensional bulk systems as well as low dimensional systems (such as nanowires and 2D materials). Our unified analysis also allows us to prove the effectiveness of the Madelung-constant correction to the Fock exchange energy, and the effectiveness of a recently proposed staggered mesh method for periodic MP2 calculations [Xing, Li, Lin, 2021]. Our analysis connects the effectiveness of the staggered mesh method with integrands with removable singularities, and suggests a new staggered mesh method for reducing finite-size errors of periodic HF calculations.

الفيزياء الحسابية التحليل العددي التحليل العددي

Staggered mesh method for correlation energy calculations of solids: Second order M{o}ller-Plesset perturbation theory

195 - Xin Xing , Xiaoxu Li , Lin Lin 2021

The calculation of the MP2 correlation energy for extended systems can be viewed as a multi-dimensional integral in the thermodynamic limit, and the standard method for evaluating the MP2 energy can be viewed as a trapezoidal quadrature scheme. We de monstrate that existing analysis neglects certain contributions due to the non-smoothness of the integrand, and may significantly underestimate finite-size errors. We propose a new staggered mesh method, which uses two staggered Monkhorst-Pack meshes for occupied and virtual orbitals, respectively, to compute the MP2 energy. The staggered mesh method circumvents a significant error source in the standard method, in which certain quadrature nodes are always placed on points where the integrand is discontinuous. One significant advantage of the proposed method is that there are no tunable parameters, and the additional numerical effort needed can be negligible compared to the standard MP2 calculation. Numerical results indicate that the staggered mesh method can be particularly advantageous for quasi-1D systems, as well as quasi-2D and 3D systems with certain symmetries.

الفيزياء الحسابية التحليل العددي التحليل العددي

Dynamic Image for 3D MRI Image Alzheimers Disease Classification

104 - Xin Xing , Gongbo Liang , Hunter Blanton 2020

We propose to apply a 2D CNN architecture to 3D MRI image Alzheimers disease classification. Training a 3D convolutional neural network (CNN) is time-consuming and computationally expensive. We make use of approximate rank pooling to transform the 3D MRI image volume into a 2D image to use as input to a 2D CNN. We show our proposed CNN model achieves $9.5%$ better Alzheimers disease classification accuracy than the baseline 3D models. We also show that our method allows for efficient training, requiring only 20% of the training time compared to 3D CNN models. The code is available online: https://github.com/UkyVision/alzheimer-project.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي التعلم الآلي

Efficient construction of an HSS preconditioner for symmetric positive definite $mathcal{H}^2$ matrices

159 - Xin Xing , Hua Huang , Edmond Chow 2020

In an iterative approach for solving linear systems with ill-conditioned, symmetric positive definite (SPD) kernel matrices, both fast matrix-vector products and fast preconditioning operations are required. Fast (linear-scaling) matrix-vector produc ts are available by expressing the kernel matrix in an $mathcal{H}^2$ representation or an equivalent fast multipole method representation. Preconditioning such matrices, however, requires a structured matrix approximation that is more regular than the $mathcal{H}^2$ representation, such as the hierarchically semiseparable (HSS) matrix representation, which provides fast solve operations. Previously, an algorithm was presented to construct an HSS approximation to an SPD kernel matrix that is guaranteed to be SPD. However, this algorithm has quadratic cost and was only designed for recursive binary partitionings of the points defining the kernel matrix. This paper presents a general algorithm for constructing an SPD HSS approximation. Importantly, the algorithm uses the $mathcal{H}^2$ representation of the SPD matrix to reduce its computational complexity from quadratic to quasilinear. Numerical experiments illustrate how this SPD HSS approximation performs as a preconditioner for solving linear systems arising from a range of kernel functions.

التحليل العددي التحليل العددي

Neural Gaussian Mirror for Controlled Feature Selection in Neural Networks

265 - Xin Xing , Yu Gui , Chenguang Dai 2020

Deep neural networks (DNNs) have become increasingly popular and achieved outstanding performance in predictive tasks. However, the DNN framework itself cannot inform the user which features are more or less relevant for making the prediction, which limits its applicability in many scientific fields. We introduce neural Gaussian mirrors (NGMs), in which mirrored features are created, via a structured perturbation based on a kernel-based conditional dependence measure, to help evaluate feature importance. We design two modifications of the DNN architecture for incorporating mirrored features and providing mirror statistics to measure feature importance. As shown in simulated and real data examples, the proposed method controls the feature selection error rate at a predefined level and maintains a high selection power even with the presence of highly correlated features.

التعلم الالي التعلم الآلي

A Scale-free Approach for False Discovery Rate Control in Generalized Linear Models

111 - Chenguang Dai , Buyu Lin , Xin Xing 2020

The generalized linear models (GLM) have been widely used in practice to model non-Gaussian response variables. When the number of explanatory features is relatively large, scientific researchers are of interest to perform controlled feature selectio n in order to simplify the downstream analysis. This paper introduces a new framework for feature selection in GLMs that can achieve false discovery rate (FDR) control in two asymptotic regimes. The key step is to construct a mirror statistic to measure the importance of each feature, which is based upon two (asymptotically) independent estimates of the corresponding true coefficient obtained via either the data-splitting method or the Gaussian mirror method. The FDR control is achieved by taking advantage of the mirror statistics property that, for any null feature, its sampling distribution is (asymptotically) symmetric about 0. In the moderate-dimensional setting in which the ratio between the dimension (number of features) p and the sample size n converges to a fixed value, we construct the mirror statistic based on the maximum likelihood estimation. In the high-dimensional setting where p is much larger than n, we use the debiased Lasso to build the mirror statistic. Compared to the Benjamini-Hochberg procedure, which crucially relies on the asymptotic normality of the Z statistic, the proposed methodology is scale free as it only hinges on the symmetric property, thus is expected to be more robust in finite-sample cases. Both simulation results and a real data application show that the proposed methods are capable of controlling the FDR, and are often more powerful than existing methods including the Benjamini-Hochberg procedure and the knockoff filter.

المنهجية

Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms

161 - Ping Ma , Xinlian Zhang , Xin Xing 2020

The statistical analysis of Randomized Numerical Linear Algebra (RandNLA) algorithms within the past few years has mostly focused on their performance as point estimators. However, this is insufficient for conducting statistical inference, e.g., cons tructing confidence intervals and hypothesis testing, since the distribution of the estimator is lacking. In this article, we develop an asymptotic analysis to derive the distribution of RandNLA sampling estimators for the least-squares problem. In particular, we derive the asymptotic distribution of a general sampling estimator with arbitrary sampling probabilities. The analysis is conducted in two complementary settings, i.e., when the objective of interest is to approximate the full sample estimator or is to infer the underlying ground truth model parameters. For each setting, we show that the sampling estimator is asymptotically normally distributed under mild regularity conditions. Moreover, the sampling estimator is asymptotically unbiased in both settings. Based on our asymptotic analysis, we use two criteria, the Asymptotic Mean Squared Error (AMSE) and the Expected Asymptotic Mean Squared Error (EAMSE), to identify optimal sampling probabilities. Several of these optimal sampling probability distributions are new to the literature, e.g., the root leverage sampling estimator and the predictor length sampling estimator. Our theoretical results clarify the role of leverage in the sampling process, and our empirical results demonstrate improvements over existing methods.

نظرية الإحصاء التعلم الآلي التعلم الالي

False Discovery Rate Control via Data Splitting

81 - Chenguang Dai , Buyu Lin , Xin Xing 2020

Selecting relevant features associated with a given response variable is an important issue in many scientific fields. Quantifying quality and uncertainty of a selection result via false discovery rate (FDR) control has been of recent interest. This paper introduces a way of using data-splitting strategies to asymptotically control the FDR while maintaining a high power. For each feature, the method constructs a test statistic by estimating two independent regression coefficients via data splitting. FDR control is achieved by taking advantage of the statistics property that, for any null feature, its sampling distribution is symmetric about zero. Furthermore, we propose Multiple Data Splitting (MDS) to stabilize the selection result and boost the power. Interestingly and surprisingly, with the FDR still under control, MDS not only helps overcome the power loss caused by sample splitting, but also results in a lower variance of the false discovery proportion (FDP) compared with all other methods in consideration. We prove that the proposed data-splitting methods can asymptotically control the FDR at any designated level for linear and Gaussian graphical models in both low and high dimensions. Through intensive simulation studies and a real-data application, we show that the proposed methods are robust to the unknown distribution of features, easy to implement and computationally efficient, and are often the most powerful ones amongst competitors especially when the signals are weak and the correlations or partial correlations are high among features.

المنهجية

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد