No Arabic abstract
Prediction rule ensembles (PREs) are a relatively new statistical learning method, which aim to strike a balance between predictive accuracy and interpretability. Starting from a decision tree ensemble, like a boosted tree ensemble or a random forest, PREs retain a small subset of tree nodes in the final predictive model. These nodes can be written as simple rules of the form if [condition] then [prediction]. As a result, PREs are often much less complex than full decision tree ensembles, while they have been found to provide similar predictive accuracy in many situations. The current paper introduces the methodology and shows how PREs can be fitted using the R package pre through several real-data examples from psychological research. The examples also illustrate a number of features of package textbf{pre} that may be particularly useful for applications in psychology: support for categorical, multivariate and count responses, application of (non-)negativity constraints, inclusion of confirmatory rules and standardized variable importance measures.
Prediction rule ensembles (PREs) are sparse collections of rules, offering highly interpretable regression and classification models. This paper presents the R package pre, which derives PREs through the methodology of Friedman and Popescu (2008). The implementation and functionality of package pre is described and illustrated through application on a dataset on the prediction of depression. Furthermore, accuracy and sparsity of PREs is compared with that of single trees, random forest and lasso regression in four benchmark datasets. Results indicate that pre derives ensembles with predictive accuracy comparable to that of random forests, while using a smaller number of variables for prediction.
With the availability of massive amounts of data from electronic health records and registry databases, incorporating time-varying patient information to improve risk prediction has attracted great attention. To exploit the growing amount of predictor information over time, we develop a unified framework for landmark prediction using survival tree ensembles, where an updated prediction can be performed when new information becomes available. Compared to the conventional landmark prediction, our framework enjoys great flexibility in that the landmark times can be subject-specific and triggered by an intermediate clinical event. Moreover, the nonparametric approach circumvents the thorny issue in model incompatibility at different landmark times. When both the longitudinal predictors and the outcome event time are subject to right censoring, existing tree-based approaches cannot be directly applied. To tackle the analytical challenges, we consider a risk-set-based ensemble procedure by averaging martingale estimating equations from individual trees. Extensive simulation studies are conducted to evaluate the performance of our methods. The methods are applied to the Cystic Fibrosis Patient Registry (CFFPR) data to perform dynamic prediction of lung disease in cystic fibrosis patients and to identify important prognosis factors.
Prior to adjustment, accounting conditions between national accounts data sets are frequently violated. Benchmarking is the procedure used by economic agencies to make such data sets consistent. It typically involves adjusting a high frequency time series (e.g. quarterly data) so it becomes consistent with a lower frequency version (e.g. annual data). Various methods have been developed to approach this problem of inconsistency between data sets. This paper introduces a new statistical procedure; namely wavelet benchmarking. Wavelet properties allow high and low frequency processes to be jointly analysed and we show that benchmarking can be formulated and approached succinctly in the wavelet domain. Furthermore the time and frequency localisation properties of wavelets are ideal for handling more complicated benchmarking problems. The versatility of the procedure is demonstrated using simulation studies where we provide evidence showing it substantially outperforms currently used methods. Finally, we apply this novel method of wavelet benchmarking to official Office of National Statistics (ONS) data.
Cells process external and internal signals through chemical interactions. Cells that constitute the immune system (e.g., antigen presenting cell, T-cell, B-cell, mast cell) can have different functions (e.g., adaptive memory, inflammatory response) depending on the type and number of receptor molecules on the cell surface and the specific intracellular signaling pathways activated by those receptors. Explicitly modeling and simulating kinetic interactions between molecules allows us to pose questions about the dynamics of a signaling network under various conditions. However, the application of chemical kinetics to biochemical signaling systems has been limited by the complexity of the systems under consideration. Rule-based modeling (BioNetGen, Kappa, Simmune, PySB) is an approach to address this complexity. In this chapter, by application to the Fc$varepsilon$RI receptor system, we will explore the origins of complexity in macromolecular interactions, show how rule-based modeling can be used to address complexity, and demonstrate how to build a model in the BioNetGen framework. Open source BioNetGen software and documentation are available at http://bionetgen.org.
We develop a new methodology for spatial regression of aggregated outputs on multi-resolution covariates. Such problems often occur with spatial data, for example in crop yield prediction, where the output is spatially-aggregated over an area and the covariates may be observed at multiple resolutions. Building upon previous work on aggregated output regression, we propose a regression framework to synthesise the effects of the covariates at different resolutions on the output and provide uncertainty estimation. We show that, for a crop yield prediction problem, our approach is more scalable, via variational inference, than existing multi-resolution regression models. We also show that our framework yields good predictive performance, compared to existing multi-resolution crop yield models, whilst being able to provide estimation of the underlying spatial effects.