No Arabic abstract
The behaviors and skills of models in many geosciences, e.g., hydrology and ecosystem sciences, strongly depend on spatially varying parameters that need calibration. Here we propose a novel differentiable parameter learning (dPL) framework that solves a pattern recognition problem and learns a more robust, universal mapping. Crucially, dPL exhibits virtuous scaling curves not previously demonstrated to geoscientists: as training data collectively increases, dPL achieves better performance, more physical coherence, and better generalization, all with orders-of-magnitude lower computational cost. We demonstrate examples of calibrating models to soil moisture and streamflow, where dPL drastically outperformed state-of-the-art evolutionary and regionalization methods, or requires ~12.5% the training data to achieve the similar performance. The generic scheme promotes the integration of deep learning and process-based models, without mandating reimplementation.
There is significant interest in learning and optimizing a complex system composed of multiple sub-components, where these components may be agents or autonomous sensors. Among the rich literature on this topic, agent-based and domain-specific simulations can capture complex dynamics and subgroup interaction, but optimizing over such simulations can be computationally and algorithmically challenging. Bayesian approaches, such as Gaussian processes (GPs), can be used to learn a computationally tractable approximation to the underlying dynamics but typically neglect the detailed information about subgroups in the complicated system. We attempt to find the best of both worlds by proposing the idea of decomposed feedback, which captures group-based heterogeneity and dynamics. We introduce a novel decomposed GP regression to incorporate the subgroup decomposed feedback. Our modified regression has provably lower variance -- and thus a more accurate posterior -- compared to previous approaches; it also allows us to introduce a decomposed GP-UCB optimization algorithm that leverages subgroup feedback. The Bayesian nature of our method makes the optimization algorithm trackable with a theoretical guarantee on convergence and no-regret property. To demonstrate the wide applicability of this work, we execute our algorithm on two disparate social problems: infectious disease control in a heterogeneous population and allocation of distributed weather sensors. Experimental results show that our new method provides significant improvement compared to the state-of-the-art.
The information content of crystalline materials becomes astronomical when collective electronic behavior and their fluctuations are taken into account. In the past decade, improvements in source brightness and detector technology at modern x-ray facilities have allowed a dramatically increased fraction of this information to be captured. Now, the primary challenge is to understand and discover scientific principles from big data sets when a comprehensive analysis is beyond human reach. We report the development of a novel unsupervised machine learning approach, XRD Temperature Clustering (X-TEC), that can automatically extract charge density wave (CDW) order parameters and detect intra-unit cell (IUC) ordering and its fluctuations from a series of high-volume X-ray diffraction (XRD) measurements taken at multiple temperatures. We apply X-TEC to XRD data on a quasi-skutterudite family of materials, (Ca$_x$Sr$_{1-x}$)$_3$Rh$_4$Sn$_{13}$, where a quantum critical point arising from charge order is observed as a function of Ca concentration. We further apply X-TEC to XRD data on the pyrochlore metal, Cd$_2$Re$_2$O$_7$, to investigate its two much debated structural phase transitions and uncover the Goldstone mode accompanying them. We demonstrate how unprecedented atomic scale knowledge can be gained when human researchers connect the X-TEC results to physical principles. Specifically, we extract from the X-TEC-revealed selection rule that the Cd and Re displacements are approximately equal in amplitude, but out of phase. This discovery reveals a previously unknown involvement of $5d^2$ Re, supporting the idea of an electronic origin to the structural order. Our approach can radically transform XRD experiments by allowing in-operando data analysis and enabling researchers to refine experiments by discovering interesting regions of phase space on-the-fly.
This paper formulates and studies a novel algorithm for federated learning from large collections of local datasets. This algorithm capitalizes on an intrinsic network structure that relates the local datasets via an undirected empirical graph. We model such big data over networks using a networked linear regression model. Each local dataset has individual regression weights. The weights of close-knit sub-collections of local datasets are enforced to deviate only little. This lends naturally to a network Lasso problem which we solve using a primal-dual method. We obtain a distributed federated learning algorithm via a message passing implementation of this primal-dual method. We provide a detailed analysis of the statistical and computational properties of the resulting federated learning algorithm.
We present a supervised learning method to learn the propagator map of a dynamical system from partial and noisy observations. In our computationally cheap and easy-to-implement framework a neural network consisting of random feature maps is trained sequentially by incoming observations within a data assimilation procedure. By employing Takens embedding theorem, the network is trained on delay coordinates. We show that the combination of random feature maps and data assimilation, called RAFDA, outperforms standard random feature maps for which the dynamics is learned using batch data.
Regression problems that have closed-form solutions are well understood and can be easily implemented when the dataset is small enough to be all loaded into the RAM. Challenges arise when data is too big to be stored in RAM to compute the closed form solutions. Many techniques were proposed to overcome or alleviate the memory barrier problem but the solutions are often local optimal. In addition, most approaches require accessing the raw data again when updating the models. Parallel computing clusters are also expected if multiple models need to be computed simultaneously. We propose multiple learning approaches that utilize an array of sufficient statistics (SS) to address this big data challenge. This memory oblivious approach breaks the memory barrier when computing regressions with closed-form solutions, including but not limited to linear regression, weighted linear regression, linear regression with Box-Cox transformation (Box-Cox regression) and ridge regression models. The computation and update of the SS array can be handled at per row level or per mini-batch level. And updating a model is as easy as matrix addition and subtraction. Furthermore, multiple SS arrays for different models can be easily computed simultaneously to obtain multiple models at one pass through the dataset. We implemented our approaches on Spark and evaluated over the simulated datasets. Results showed our approaches can achieve closed-form solutions of multiple models at the cost of half training time of the traditional methods for a single model.