No Arabic abstract
The R package CVEK introduces a suite of flexible machine learning models and robust hypothesis tests for learning the joint nonlinear effects of multiple covariates in limited samples. It implements the Cross-validated Ensemble of Kernels (CVEK)(Liu and Coull 2017), an ensemble-based kernel machine learning method that adaptively learns the joint nonlinear effect of multiple covariates from data, and provides powerful hypothesis tests for both main effects of features and interactions among features. The R Package CVEK provides a flexible, easy-to-use implementation of CVEK, and offers a wide range of choices for the kernel family (for instance, polynomial, radial basis functions, Matern, neural network, and others), model selection criteria, ensembling method (averaging, exponential weighting, cross-validated stacking), and the type of hypothesis test (asymptotic or parametric bootstrap). Through extensive simulations we demonstrate the validity and robustness of this approach, and provide practical guidelines on how to design an estimation strategy for optimal performance in different data scenarios.
Many modern statistical applications involve inference for complicated stochastic models for which the likelihood function is difficult or even impossible to calculate, and hence conventional likelihood-based inferential echniques cannot be used. In such settings, Bayesian inference can be performed using Approximate Bayesian Computation (ABC). However, in spite of many recent developments to ABC methodology, in many applications the computational cost of ABC necessitates the choice of summary statistics and tolerances that can potentially severely bias the estimate of the posterior. We propose a new piecewise ABC approach suitable for discretely observed Markov models that involves writing the posterior density of the parameters as a product of factors, each a function of only a subset of the data, and then using ABC within each factor. The approach has the advantage of side-stepping the need to choose a summary statistic and it enables a stringent tolerance to be set, making the posterior less approximate. We investigate two methods for estimating the posterior density based on ABC samples for each of the factors: the first is to use a Gaussian approximation for each factor, and the second is to use a kernel density estimate. Both methods have their merits. The Gaussian approximation is simple, fast, and probably adequate for many applications. On the other hand, using instead a kernel density estimate has the benefit of consistently estimating the true ABC posterior as the number of ABC samples tends to infinity. We illustrate the piecewise ABC approach for three examples; in each case, the approach enables exact matching between simulations and data and offers fast and accurate inference.
Stochastic differential equations (SDEs) are established tools to model physical phenomena whose dynamics are affected by random noise. By estimating parameters of an SDE intrinsic randomness of a system around its drift can be identified and separated from the drift itself. When it is of interest to model dynamics within a given population, i.e. to model simultaneously the performance of several experiments or subjects, mixed-effects modelling allows for the distinction of between and within experiment variability. A framework to model dynamics within a population using SDEs is proposed, representing simultaneously several sources of variation: variability between experiments using a mixed-effects approach and stochasticity in the individual dynamics using SDEs. These stochastic differential mixed-effects models have applications in e.g. pharmacokinetics/pharmacodynamics and biomedical modelling. A parameter estimation method is proposed and computational guidelines for an efficient implementation are given. Finally the method is evaluated using simulations from standard models like the two-dimensional Ornstein-Uhlenbeck (OU) and the square root models.
Dealing with biased data samples is a common task across many statistical fields. In survey sampling, bias often occurs due to unrepresentative samples. In causal studies with observational data, the treated versus untreated group assignment is often correlated with covariates, i.e., not random. Empirical calibration is a generic weighting method that presents a unified view on correcting or reducing the data biases for the tasks mentioned above. We provide a Python library EC to compute the empirical calibration weights. The problem is formulated as convex optimization and solved efficiently in the dual form. Compared to existing software, EC is both more efficient and robust. EC also accommodates different optimization objectives, supports weight clipping, and allows inexact calibration, which improves usability. We demonstrate its usage across various experiments with both simulated and real-world data.
Health economic evaluations often require predictions of survival rates beyond the follow-up period. Parametric survival models can be more convenient for economic modelling than the Cox model. The generalized gamma (GG) and generalized F (GF) distributions are extensive families that contain almost all commonly used distributions with various hazard shapes and arbitrary complexity. In this study, we present a new SAS macro for implementing a wide variety of flexible parametric models including the GG and GF distributions and their special cases, as well as the Gompertz distribution. Proper custom distributions are also supported. Different from existing SAS procedures, this macro not only supports regression on the location parameter but also on ancillary parameters, which greatly increases model flexibility. In addition, the SAS macro supports weighted regression, stratified regression and robust inference. This study demonstrates with several examples how the SAS macro can be used for flexible survival modeling and extrapolation.
We present a joint copula-based model for insurance claims and sizes. It uses bivariate copulae to accommodate for the dependence between these quantities. We derive the general distribution of the policy loss without the restrictive assumption of independence. We illustrate that this distribution tends to be skewed and multi-modal, and that an independence assumption can lead to substantial bias in the estimation of the policy loss. Further, we extend our framework to regression models by combining marginal generalized linear models with a copula. We show that this approach leads to a flexible class of models, and that the parameters can be estimated efficiently using maximum-likelihood. We propose a test procedure for the selection of the optimal copula family. The usefulness of our approach is illustrated in a simulation study and in an analysis of car insurance policies.