No Arabic abstract
In this paper, we discuss multiscale methods for nonlinear problems. The main idea of these approaches is to use local constraints and solve problems in oversampled regions for constructing macroscopic equations. These techniques are intended for problems without scale separation and high contrast, which often occur in applications. For linear problems, the local solutions with constraints are used as basis functions. This technique is called Constraint Energy Minimizing Generalized Multiscale Finite Element Method (CEM-GMsFEM). GMsFEM identifies macroscopic quantities based on rigorous analysis. In corresponding upscaling methods, the multiscale basis functions are selected such that the degrees of freedom have physical meanings, such as averages of the solution on each continuum. This paper extends the linear concepts to nonlinear problems, where the local problems are nonlinear. The main concept consists of: (1) identifying macroscopic quantities; (2) constructing appropriate oversampled local problems with coarse-grid constraints; (3) formulating macroscopic equations. We consider two types of approaches. In the first approach, the solutions of local problems are used as basis functions (in a linear fashion) to solve nonlinear problems. This approach is simple to implement; however, it lacks the nonlinear interpolation, which we present in our second approach. In this approach, the local solutions are used as a nonlinear forward map from local averages (constraints) of the solution in oversampling region. This local fine-grid solution is further used to formulate the coarse-grid problem. Both approaches are discussed on several examples and applied to single-phase and two-phase flow problems, which are challenging because of convection-dominated nature of the concentration equation.
In this paper, we develop a space-time upscaling framework that can be used for many challenging porous media applications without scale separation and high contrast. Our main focus is on nonlinear differential equations with multiscale coefficients. The framework is built on nonlinear nonlocal multi-continuum upscaling concept and significantly extends the results in the proceeding paper. Our approach starts with a coarse space-time partition and identifies test functions for each partition, which plays a role of multi-continua. The test functions are defined via optimization and play a crucial role in nonlinear upscaling. In the second stage, we solve nonlinear local problems in oversampled regions with some constraints defined via test functions. These local solutions define a nonlinear map from macroscopic variables determined with the help of test functions to the fine-grid fields. This map can be thought as a downscaled map from macroscopic variables to the fine-grid solution. In the final stage, we seek macroscopic variables in the entire domain such that the downscaled field solves the global problem in a weak sense defined using the test functions. We present an analysis of our approach for an example nonlinear problem. Our unified framework plays an important role in designing various upscaled methods. Because local problems are directly related to the fine-grid problems, it simplifies the process of finding local solutions with appropriate constraints. Using machine learning (ML), we identify the complex map from macroscopic variables to fine-grid solution. We present numerical results for several porous media applications, including two-phase flow and transport.
Recently, several approaches for multiscale simulations for problems with high contrast and no scale separation are introduced. Among them is the nonlocal multicontinua (NLMC) method, which introduces multiple macroscopic variables in each computational grid. These approaches explore the entire coarse block resolution and one can obtain optimal convergence results independent of contrast and scales. However, these approaches are not amenable to many multiscale simulations, where the subgrid effects are much smaller than the coarse-mesh resolution. For example, the molecular dynamics of shale gas occurs in much smaller length scales compared to the coarse-mesh size, which is of orders of meters. In this case, one can not explore the entire coarse-grid resolution in evaluating effective properties. In this paper, we merge the concepts of nonlocal multicontinua methods and Representative Volume Element (RVE) concepts to explore problems with extreme scale separation. The first step of this approach is to use sub-grid scale (sub to RVE) to write a large-scale macroscopic system. We call it intermediate scale macroscale system. In the next step, we couple this intermediate macroscale system to the simulation grid model, which are used in simulations. This is done using RVE concepts, where we relate intermediate macroscale variables to the macroscale variables defined on our simulation coarse grid. Our intermediate coarse model allows formulating macroscale variables correctly and coupling them to the simulation grid. We present the general concept of our approach and present details of single-phase flow. Some numerical results are presented. For nonlinear examples, we use machine learning techniques to compute macroscale parameters.
We consider fast deterministic algorithms to identify the best linearly independent terms in multivariate mixtures and use them to compute, up to a user-selected accuracy, an equivalent representation with fewer terms. One algorithm employs a pivoted Cholesky decomposition of the Gram matrix constructed from the terms of the mixture to select what we call skeleton terms and the other uses orthogonalization for the same purpose. Importantly, the multivariate mixtures do not have to be a separated representation of a function. Both algorithms require $O(r^2 N + p(d) r N) $ operations, where $N$ is the initial number of terms in the multivariate mixture, $r$ is the number of selected linearly independent terms, and $p(d)$ is the cost of computing the inner product between two terms of a mixture in $d$ variables. For general Gaussian mixtures $p(d) sim d^3$ since we need to diagonalize a $dtimes d$ matrix, whereas for separated representations $p(d) sim d$. Due to conditioning issues, the resulting accuracy is limited to about one half of the available significant digits for both algorithms. We also describe an alternative algorithm that is capable of achieving higher accuracy but is only applicable in low dimensions or to multivariate mixtures in separated form. We describe a number of initial applications of these algorithms to solve partial differential and integral equations and to address several problems in data science. For data science applications in high dimensions,we consider the kernel density estimation (KDE) approach for constructing a probability density function (PDF) of a cloud of points, a far-field kernel summation method and the construction of equivalent sources for non-oscillatory kernels (used in both, computational physics and data science) and, finally, show how to use the new algorithm to produce seeds for subdividing a cloud of points into groups.
Bioinformatics pipelines depend on shared POSIX filesystems for its input, output and intermediate data storage. Containerization makes it more difficult for the workloads to access the shared file systems. In our previous study, we were able to run both ML and non-ML pipelines on Kubeflow successfully. However, the storage solutions were complex and less optimal. This is because there are no established resource types to represent the concept of data source on Kubernetes. More and more applications are running on Kubernetes for batch processing. End users are burdened with configuring and optimising the data access, which is what we have experienced before. In this article, we are introducing a new concept of Dataset and its corresponding resource as a native Kubernetes object. We have leveraged the Dataset Lifecycle Framework which takes care of all the low-level details about data access in Kubernetes pods. Its pluggable architecture is designed for the development of caching, scheduling and governance plugins. Together, they manage the entire lifecycle of the custom resource Dataset. We use Dataset Lifecycle Framework to serve data from object stores to both ML and non-ML pipelines running on Kubeflow. With DLF, we make training data fed into ML models directly without being downloaded to the local disks, which makes the input scalable. We have enhanced the durability of training metadata by storing it into a dataset, which also simplifies the set up of the Tensorboard, separated from the notebook server. For the non-ML pipeline, we have simplified the 1000 Genome Project pipeline with datasets injected into the pipeline dynamically. In addition, our preliminary results indicate that the pluggable caching mechanism can improve the performance significantly.
The amount of data in our society has been exploding in the era of big data today. In this paper, we address several open challenges of big data stream classification, including high volume, high velocity, high dimensionality, high sparsity, and high class-imbalance. Many existing studies in data mining literature solve data stream classification tasks in a batch learning setting, which suffers from poor efficiency and scalability when dealing with big data. To overcome the limitations, this paper investigates an online learning framework for big data stream classification tasks. Unlike some existing online data stream classification techniques that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to derive a new effective second-order online learning algorithm for data stream classification. In addition, we also propose a new cost-sensitive sparse online learning algorithm by extending the framework with application to tackle online anomaly detection tasks where class distribution of data could be very imbalanced. We also analyze the theoretical bounds of the proposed method, and finally conduct an extensive set of experiments, in which encouraging results validate the efficacy of the proposed algorithms in comparison to a family of state-of-the-art techniques on a variety of data stream classification tasks.