ترغب بنشر مسار تعليمي؟ اضغط هنا

fplyr: the split-apply-combine strategy for big data in R

48   0   0.0 ( 0 )
 نشر من قبل Federico Marotta
 تاريخ النشر 2020
  مجال البحث الاحصاء الرياضي
والبحث باللغة English
 تأليف Federico Marotta




اسأل ChatGPT حول البحث

We present fplyr, a new package for the R language to deal with big files. It allows users to easily implement the split-apply-combine strategy for files that are too big to fit into the available memory, without relying on data bases nor introducing non-native R classes. A custom function can be applied independently to each group of observations, and the results may be either returned or directly printed to one or more output files.


قيم البحث

اقرأ أيضاً

120 - HaiYing Wang , Yanyuan Ma 2020
We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive tw
We present and describe the GPFDA package for R. The package provides flexible functionalities for dealing with Gaussian process regression (GPR) models for functional data. Multivariate functional data, functional data with multidimensional inputs, and nonseparable and/or nonstationary covariance structures can be modeled. In addition, the package fits functional regression models where the mean function depends on scalar and/or functional covariates and the covariance structure is modeled by a GPR model. In this paper, we present the versatility of GPFDA with respect to mean function and covariance function specifications and illustrate the implementation of estimation and prediction of some models through reproducible numerical examples.
Process data refer to data recorded in the log files of computer-based items. These data, represented as timestamped action sequences, keep track of respondents response processes of solving the items. Process data analysis aims at enhancing educatio nal assessment accuracy and serving other assessment purposes by utilizing the rich information contained in response processes. The R package ProcData presented in this article is designed to provide tools for processing, describing, and analyzing process data. We define an S3 class proc for organizing process data and extend generic methods summary and print for class proc. Two feature extraction methods for process data are implemented in the package for compressing information in the irregular response processes into regular numeric vectors. ProcData also provides functions for fitting and making predictions from a neural-network-based sequence model. These functions call relevant functions in package keras for constructing and training neural networks. In addition, several response process generators and a real dataset of response processes of the climate control item in the 2012 Programme for International Student Assessment are included in the package.
68 - Johannes Buchner 2017
The data torrent unleashed by current and upcoming astronomical surveys demands scalable analysis methods. Many machine learning approaches scale well, but separating the instrument measurement from the physical effects of interest, dealing with vari able errors, and deriving parameter uncertainties is often an after-thought. Classic forward-folding analyses with Markov Chain Monte Carlo or Nested Sampling enable parameter estimation and model comparison, even for complex and slow-to-evaluate physical models. However, these approaches require independent runs for each data set, implying an unfeasible number of model evaluations in the Big Data regime. Here I present a new algorithm, collaborative nested sampling, for deriving parameter probability distributions for each observation. Importantly, the number of physical model evaluations scales sub-linearly with the number of data sets, and no assumptions about homogeneous errors, Gaussianity, the form of the model or heterogeneity/completeness of the observations need to be made. Collaborative nested sampling has immediate application in speeding up analyses of large surveys, integral-field-unit observations, and Monte Carlo simulations.
We introduce the UPG package for highly efficient Bayesian inference in probit, logit, multinomial logit and binomial logit models. UPG offers a convenient estimation framework for balanced and imbalanced data settings where sampling efficiency is en sured through Markov chain Monte Carlo boosting methods. All sampling algorithms are implemented in C++, allowing for rapid parameter estimation. In addition, UPG provides several methods for fast production of output tables and summary plots that are easily accessible to a broad range of users.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا