No Arabic abstract
This paper establishes unified frameworks of renewable weighted sums (RWS) for various online updating estimations in the models with streaming data sets. The newly defined RWS lays the foundation of online updating likelihood, online updating loss function, online updating estimating equation and so on. The idea of RWS is intuitive and heuristic, and the algorithm is computationally simple. This paper chooses nonparametric model as an exemplary setting. The RWS applies to various types of nonparametric estimators, which include but are not limited to nonparametric likelihood, quasi-likelihood and least squares. Furthermore, the method and the theory can be extended into the models with both parameter and nonparametric function. The estimation consistency and asymptotic normality of the proposed renewable estimator are established, and the oracle property is obtained. Moreover, these properties are always satisfied, without any constraint on the number of data batches, which means that the new method is adaptive to the situation where streaming data sets arrive perpetually. The behavior of the method is further illustrated by various numerical examples from simulation experiments and real data analysis.
Under the environment of big data streams, it is a common situation where the variable set of a model may change according to the condition of data streams. In this paper, we propose a homogenization strategy to represent the heterogenous models that are gradually updated in the process of data streams. With the homogenized representations, we can easily construct various online updating statistics such as parameter estimation, residual sum of squares and $F$-statistic for the heterogenous updating regression models. The main difference from the classical scenarios is that the artificial covariates in the homogenized models are not identically distributed as the natural covariates in the original models, consequently, the related theoretical properties are distinct from the classical ones. The asymptotical properties of the online updating statistics are established, which show that the new method can achieve estimation efficiency and oracle property, without any constraint on the number of data batches. The behavior of the method is further illustrated by various numerical examples from simulation experiments.
In the research field of big data, one of important issues is how to recover the sequentially changing sets of true features when the data sets arrive sequentially. The paper presents a general framework for online updating variable selection and parameter estimation in generalized linear models with streaming datasets. This is a type of online updating penalty likelihoods with differentiable or non-differentiable penalty function. The online updating coordinate descent algorithm is proposed to solve the online updating optimization problem. Moreover, a tuning parameter selection is suggested in an online updating way. The selection and estimation consistencies, and the oracle property are established, theoretically. Our methods are further examined and illustrated by various numerical examples from both simulation experiments and a real data analysis.
In computational inverse problems, it is common that a detailed and accurate forward model is approximated by a computationally less challenging substitute. The model reduction may be necessary to meet constraints in computing time when optimization algorithms are used to find a single estimate, or to speed up Markov chain Monte Carlo (MCMC) calculations in the Bayesian framework. The use of an approximate model introduces a discrepancy, or modeling error, that may have a detrimental effect on the solution of the ill-posed inverse problem, or it may severely distort the estimate of the posterior distribution. In the Bayesian paradigm, the modeling error can be considered as a random variable, and by using an estimate of the probability distribution of the unknown, one may estimate the probability distribution of the modeling error and incorporate it into the inversion. We introduce an algorithm which iterates this idea to update the distribution of the model error, leading to a sequence of posterior distributions that are demonstrated empirically to capture the underlying truth with increasing accuracy. Since the algorithm is not based on rejections, it requires only limited full model evaluations. We show analytically that, in the linear Gaussian case, the algorithm converges geometrically fast with respect to the number of iterations. For more general models, we introduce particle approximations of the iteratively generated sequence of distributions; we also prove that each element of the sequence converges in the large particle limit. We show numerically that, as in the linear case, rapid convergence occurs with respect to the number of iterations. Additionally, we show through computed examples that point estimates obtained from this iterative algorithm are superior to those obtained by neglecting the model error.
Online image hashing has received increasing research attention recently, which processes large-scale data in a streaming fashion to update the hash functions on-the-fly. To this end, most existing works exploit this problem under a supervised setting, i.e., using class labels to boost the hashing performance, which suffers from the defects in both adaptivity and efficiency: First, large amounts of training batches are required to learn up-to-date hash functions, which leads to poor online adaptivity. Second, the training is time-consuming, which contradicts with the core need of online learning. In this paper, a novel supervised online hashing scheme, termed Fast Class-wise Updating for Online Hashing (FCOH), is proposed to address the above two challenges by introducing a novel and efficient inner product operation. To achieve fast online adaptivity, a class-wise updating method is developed to decompose the binary code learning and alternatively renew the hash functions in a class-wise fashion, which well addresses the burden on large amounts of training batches. Quantitatively, such a decomposition further leads to at least 75% storage saving. To further achieve online efficiency, we propose a semi-relaxation optimization, which accelerates the online training by treating different binary constraints independently. Without additional constraints and variables, the time complexity is significantly reduced. Such a scheme is also quantitatively shown to well preserve past information during updating hashing functions. We have quantitatively demonstrated that the collective effort of class-wise updating and semi-relaxation optimization provides a superior performance comparing to various state-of-the-art methods, which is verified through extensive experiments on three widely-used datasets.
This paper discusses an alternative to conditioning that may be used when the probability distribution is not fully specified. It does not require any assumptions (such as CAR: coarsening at random) on the unknown distribution. The well-known Monty Hall problem is the simplest scenario where neither naive conditioning nor the CAR assumption suffice to determine an updated probability distribution. This paper thus addresses a generalization of that problem to arbitrary distributions on finite outcome spaces, arbitrary sets of `messages, and (almost) arbitrary loss functions, and provides existence and characterization theorems for robust probability updating strategies. We find that for logarithmic loss, optimality is characterized by an elegant condition, which we call RCAR (reverse coarsening at random). Under certain conditions, the same condition also characterizes optimality for a much larger class of loss functions, and we obtain an objective and general answer to how one should update probabilities in the light of new information.