No Arabic abstract
Residual Networks (ResNets) have become state-of-the-art models in deep learning and several theoretical studies have been devoted to understanding why ResNet works so well. One attractive viewpoint on ResNet is that it is optimizing the risk in a functional space by combining an ensemble of effective features. In this paper, we adopt this viewpoint to construct a new gradient boosting method, which is known to be very powerful in data analysis. To do so, we formalize the gradient boosting perspective of ResNet mathematically using the notion of functional gradients and propose a new method called ResFGB for classification tasks by leveraging ResNet perception. Two types of generalization guarantees are provided from the optimization perspective: one is the margin bound and the other is the expected risk bound by the sample-splitting technique. Experimental results show superior performance of the proposed method over state-of-the-art methods such as LightGBM.
Automatic machine learning performs predictive modeling with high performing machine learning tools without human interference. This is achieved by making machine learning applications parameter-free, i.e. only a dataset is provided while the complete model selection and model building process is handled internally through (often meta) optimization. Projects like Auto-WEKA and auto-sklearn aim to solve the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem resulting in huge configuration spaces. However, for most real-world applications, the optimization over only a few different key learning algorithms can not only be sufficient, but also potentially beneficial. The latter becomes apparent when one considers that models have to be validated, explained, deployed and maintained. Here, less complex model are often preferred, for validation or efficiency reasons, or even a strict requirement. Automatic gradient boosting simplifies this idea one step further, using only gradient boosting as a single learning algorithm in combination with model-based hyperparameter tuning, threshold optimization and encoding of categorical features. We introduce this general framework as well as a concrete implementation called autoxgboost. It is compared to current AutoML projects on 16 datasets and despite its simplicity is able to achieve comparable results on about half of the datasets as well as performing best on two.
In this paper, we initiate a study of functional minimization in Federated Learning. First, in the semi-heterogeneous setting, when the marginal distributions of the feature vectors on client machines are identical, we develop the federated functional gradient boosting (FFGB) method that provably converges to the global minimum. Subsequently, we extend our results to the fully-heterogeneous setting (where marginal distributions of feature vectors may differ) by designing an efficient variant of FFGB called FFGB.C, with provable convergence to a neighborhood of the global minimum within a radius that depends on the total variation distances between the client feature distributions. For the special case of square loss, but still in the fully heterogeneous setting, we design the FFGB.L method that also enjoys provable convergence to a neighborhood of the global minimum but within a radius depending on the much tighter Wasserstein-1 distances. For both FFGB.C and FFGB.L, the radii of convergence shrink to zero as the feature distributions become more homogeneous. Finally, we conduct proof-of-concept experiments to demonstrate the benefits of our approach against natural baselines.
In this paper, we propose a density estimation algorithm called textit{Gradient Boosting Histogram Transform} (GBHT), where we adopt the textit{Negative Log Likelihood} as the loss function to make the boosting procedure available for the unsupervised tasks. From a learning theory viewpoint, we first prove fast convergence rates for GBHT with the smoothness assumption that the underlying density function lies in the space $C^{0,alpha}$. Then when the target density function lies in spaces $C^{1,alpha}$, we present an upper bound for GBHT which is smaller than the lower bound of its corresponding base learner, in the sense of convergence rates. To the best of our knowledge, we make the first attempt to theoretically explain why boosting can enhance the performance of its base learners for density estimation problems. In experiments, we not only conduct performance comparisons with the widely used KDE, but also apply GBHT to anomaly detection to showcase a further application of GBHT.
Predictive modeling based on genomic data has gained popularity in biomedical research and clinical practice by allowing researchers and clinicians to identify biomarkers and tailor treatment decisions more efficiently. Analysis incorporating pathway information can boost discovery power and better connect new findings with biological mechanisms. In this article, we propose a general framework, Pathway-based Kernel Boosting (PKB), which incorporates clinical information and prior knowledge about pathways for prediction of binary, continuous and survival outcomes. We introduce appropriate loss functions and optimization procedures for different outcome types. Our prediction algorithm incorporates pathway knowledge by constructing kernel function spaces from the pathways and use them as base learners in the boosting procedure. Through extensive simulations and case studies in drug response and cancer survival datasets, we demonstrate that PKB can substantially outperform other competing methods, better identify biological pathways related to drug response and patient survival, and provide novel insights into cancer pathogenesis and treatment response.
Discovery of causal relationships from observational data is an important problem in many areas. Several recent results have established the identifiability of causal DAGs with non-Gaussian and/or nonlinear structural equation models (SEMs). In this paper, we focus on nonlinear SEMs defined by non-invertible functions, which exist in many data domains, and propose a novel test for non-invertible bivariate causal models. We further develop a method to incorporate this test in structure learning of DAGs that contain both linear and nonlinear causal relations. By extensive numerical comparisons, we show that our algorithms outperform existing DAG learning methods in identifying causal graphical structures. We illustrate the practical application of our method in learning causal networks for combinatorial binding of transcription factors from ChIP-Seq data.