No Arabic abstract
A generalized additive model (GAM, Hastie and Tibshirani (1987)) is a nonparametric model by the sum of univariate functions with respect to each explanatory variable, i.e., $f({mathbf x}) = sum f_j(x_j)$, where $x_jinmathbb{R}$ is $j$-th component of a sample ${mathbf x}in mathbb{R}^p$. In this paper, we introduce the total variation (TV) of a function as a measure of the complexity of functions in $L^1_{rm c}(mathbb{R})$-space. Our analysis shows that a GAM based on TV-regularization exhibits a Rademacher complexity of $O(sqrt{frac{log p}{m}})$, which is tight in terms of both $m$ and $p$ in the agnostic case of the classification problem. In result, we obtain generalization error bounds for finite samples according to work by Bartlett and Mandelson (2002).
Machine Learning algorithms are increasingly being used in recent years due to their flexibility in model fitting and increased predictive performance. However, the complexity of the models makes them hard for the data analyst to interpret the results and explain them without additional tools. This has led to much research in developing various approaches to understand the model behavior. In this paper, we present the Explainable Neural Network (xNN), a structured neural network designed especially to learn interpretable features. Unlike fully connected neural networks, the features engineered by the xNN can be extracted from the network in a relatively straightforward manner and the results displayed. With appropriate regularization, the xNN provides a parsimonious explanation of the relationship between the features and the output. We illustrate this interpretable feature--engineering property on simulated examples.
Additive models form a widely popular class of regression models which represent the relation between covariates and response variables as the sum of low-dimensional transfer functions. Besides flexibility and accuracy, a key benefit of these models is their interpretability: the transfer functions provide visual means for inspecting the models and identifying domain-specific relations between inputs and outputs. However, in large-scale problems involving the prediction of many related tasks, learning independently additive models results in a loss of model interpretability, and can cause overfitting when training data is scarce. We introduce a novel multi-task learning approach which provides a corpus of accurate and interpretable additive models for a large number of related forecasting tasks. Our key idea is to share transfer functions across models in order to reduce the model complexity and ease the exploration of the corpus. We establish a connection with sparse dictionary learning and propose a new efficient fitting algorithm which alternates between sparse coding and transfer function updates. The former step is solved via an extension of Orthogonal Matching Pursuit, whose properties are analyzed using a novel recovery condition which extends existing results in the literature. The latter step is addressed using a traditional dictionary update rule. Experiments on real-world data demonstrate that our approach compares favorably to baseline methods while yielding an interpretable corpus of models, revealing structure among the individual tasks and being more robust when training data is scarce. Our framework therefore extends the well-known benefits of additive models to common regression settings possibly involving thousands of tasks.
Data clustering is a fundamental problem with a wide range of applications. Standard methods, eg the $k$-means method, usually require solving a non-convex optimization problem. Recently, total variation based convex relaxation to the $k$-means model has emerged as an attractive alternative for data clustering. However, the existing results on its exact clustering property, ie, the condition imposed on data so that the method can provably give correct identification of all cluster memberships, is only applicable to very specific data and is also much more restrictive than that of some other methods. This paper aims at the revisit of total variation based convex clustering, by proposing a weighted sum-of-$ell_1$-norm relating convex model. Its exact clustering property established in this paper, in both deterministic and probabilistic context, is applicable to general data and is much sharper than the existing results. These results provided good insights to advance the research on convex clustering. Moreover, the experiments also demonstrated that the proposed convex model has better empirical performance when be compared to standard clustering methods, and thus it can see its potential in practice.
We consider total variation minimization for manifold valued data. We propose a cyclic proximal point algorithm and a parallel proximal point algorithm to minimize TV functionals with $ell^p$-type data terms in the manifold case. These algorithms are based on iterative geodesic averaging which makes them easily applicable to a large class of data manifolds. As an application, we consider denoising images which take their values in a manifold. We apply our algorithms to diffusion tensor images, interferometric SAR images as well as sphere and cylinder valued images. For the class of Cartan-Hadamard manifolds (which includes the data space in diffusion tensor imaging) we show the convergence of the proposed TV minimizing algorithms to a global minimizer.
Recently the shape-restricted inference has gained popularity in statistical and econometric literature in order to relax the linear or quadratic covariate effect in regression analyses. The typical shape-restricted covariate effect includes monotonic increasing, decreasing, convexity or concavity. In this paper, we introduce the shape-restricted inference to the celebrated Cox regression model (SR-Cox), in which the covariate response is modeled as shape-restricted additive functions. The SR-Cox regression approximates the shape-restricted functions using a spline basis expansion with data driven choice of knots. The underlying minimization of negative log-likelihood function is formulated as a convex optimization problem, which is solved with an active-set optimization algorithm. The highlight of this algorithm is that it eliminates the superfluous knots automatically. When covariate effects include combinations of convex or concave terms with unknown forms and linear terms, the most interesting finding is that SR-Cox produces accurate linear covariate effect estimates which are comparable to the maximum partial likelihood estimates if indeed the forms are known. We conclude that concave or convex SR-Cox models could significantly improve nonlinear covariate response recovery and model goodness of fit.