ترغب بنشر مسار تعليمي؟ اضغط هنا

Computationally efficient univariate filtering for massive data

62   0   0.0 ( 0 )
 نشر من قبل Michail Tsagris
 تاريخ النشر 2020
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce that computational cost by employing the score test or the simple Pearson correlation (or the t-test for binary responses). Extensive Monte Carlo simulation studies will demonstrate their advantages and disadvantages compared to the likelihood ratio test and examples with real data will illustrate the performance of the score test and the log-likelihood ratio test under realistic scenarios. Depending on the regression model used, the score test is 30 - 60,000 times faster than the log-likelihood ratio test and produces nearly the same results. Hence this paper strongly recommends to substitute the log-likelihood ratio test with the score test when coping with large scale data, massive data, big data, or even with data whose sample size is in the order of a few tens of thousands or higher.



قيم البحث

اقرأ أيضاً

254 - Hua Liu , Jinhong You , Jiguo Cao 2021
Massive data bring the big challenges of memory and computation for analysis. These challenges can be tackled by taking subsamples from the full data as a surrogate. For functional data, it is common to collect multiple measurements over their domain s, which require even more memory and computation time when the sample size is large. The computation would be much more intensive when statistical inference is required through bootstrap samples. To the best of our knowledge, this article is the first attempt to study the subsampling method for the functional linear model. We propose an optimal subsampling method based on the functional L-optimality criterion. When the response is a discrete or categorical variable, we further extend our proposed functional L-optimality subsampling (FLoS) method to the functional generalized linear model. We establish the asymptotic properties of the estimators by the FLoS method. The finite sample performance of our proposed FLoS method is investigated by extensive simulation studies. The FLoS method is further demonstrated by analyzing two large-scale datasets: the global climate data and the kidney transplant data. The analysis results on these data show that the FLoS method is much better than the uniform subsampling approach and can well approximate the results based on the full data while dramatically reducing the computation time and memory.
Spatio-temporal data sets are rapidly growing in size. For example, environmental variables are measured with ever-higher resolution by increasing numbers of automated sensors mounted on satellites and aircraft. Using such data, which are typically n oisy and incomplete, the goal is to obtain complete maps of the spatio-temporal process, together with proper uncertainty quantification. We focus here on real-time filtering inference in linear Gaussian state-space models. At each time point, the state is a spatial field evaluated on a very large spatial grid, making exact inference using the Kalman filter computationally infeasible. Instead, we propose a multi-resolution filter (MRF), a highly scalable and fully probabilistic filtering method that resolves spatial features at all scales. We prove that the MRF matrices exhibit a particular block-sparse multi-resolution structure that is preserved under filtering operations through time. We also discuss inference on time-varying parameters using an approximate Rao-Blackwellized particle filter, in which the integrated likelihood is computed using the MRF. We compare the MRF to existing approaches in a simulation study and a real satellite-data application.
An emulator is a fast-to-evaluate statistical approximation of a detailed mathematical model (simulator). When used in lieu of simulators, emulators can expedite tasks that require many repeated evaluations, such as sensitivity analyses, policy optim ization, model calibration, and value-of-information analyses. Emulators are developed using the output of simulators at specific input values (design points). Developing an emulator that closely approximates the simulator can require many design points, which becomes computationally expensive. We describe a self-terminating active learning algorithm to efficiently develop emulators tailored to a specific emulation task, and compare it with algorithms that optimize geometric criteria (random latin hypercube sampling and maximum projection designs) and other active learning algorithms (treed Gaussian Processes that optimize typical active learning criteria). We compared the algorithms root mean square error (RMSE) and maximum absolute deviation from the simulator (MAX) for seven benchmark functions and in a prostate cancer screening model. In the empirical analyses, in simulators with greatly-varying smoothness over the input domain, active learning algorithms resulted in emulators with smaller RMSE and MAX for the same number of design points. In all other cases, all algorithms performed comparably. The proposed algorithm attained satisfactory performance in all analyses, had smaller variability than the treed Gaussian Processes (it is deterministic), and, on average, had similar or better performance as the treed Gaussian Processes in 6 out of 7 benchmark functions and in the prostate cancer model.
In forecasting problems it is important to know whether or not recent events represent a regime change (low long-term predictive potential), or rather a local manifestation of longer term effects (potentially higher predictive potential). Mathematica lly, a key question is about whether the underlying stochastic process exhibits memory, and if so whether the memory is long in a precise sense. Being able to detect or rule out such effects can have a profound impact on speculative investment (e.g., in financial markets) and inform public policy (e.g., characterising the size and timescales of the earth systems response to the anthropogenic CO2 perturbation). Most previous work on inference of long memory effects is frequentist in nature. Here we provide a systematic treatment of Bayesian inference for long memory processes via the Autoregressive Fractional Integrated Moving Average (ARFIMA) model. In particular, we provide a new approximate likelihood for efficient parameter inference, and show how nuisance parameters (e.g., short memory effects) can be integrated over in order to focus on long memory parameters and hypothesis testing more directly than ever before. We illustrate our new methodology on both synthetic and observational data, with favorable comparison to the standard estimators.
156 - Matthias Katzfuss 2015
Automated sensing instruments on satellites and aircraft have enabled the collection of massive amounts of high-resolution observations of spatial fields over large spatial regions. If these datasets can be efficiently exploited, they can provide new insights on a wide variety of issues. However, traditional spatial-statistical techniques such as kriging are not computationally feasible for big datasets. We propose a multi-resolution approximation (M-RA) of Gaussian processes observed at irregular locations in space. The M-RA process is specified as a linear combination of basis functions at multiple levels of spatial resolution, which can capture spatial structure from very fine to very large scales. The basis functions are automatically chosen to approximate a given covariance function, which can be nonstationary. All computations involving the M-RA, including parameter inference and prediction, are highly scalable for massive datasets. Crucially, the inference algorithms can also be parallelized to take full advantage of large distributed-memory computing environments. In comparisons using simulated data and a large satellite dataset, the M-RA outperforms a related state-of-the-art method.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا