No Arabic abstract
The effectiveness and performance of artificial neural networks, particularly for visual tasks, depends in crucial ways on the receptive field of neurons. The receptive field itself depends on the interplay between several architectural aspects, including sparsity, pooling, and activation functions. In recent literature there are several ad hoc proposals trying to make receptive fields more flexible and adaptive to data. For instance, different parameterizations of convolutional and pooling layers have been proposed to increase their adaptivity. In this paper, we propose the novel theoretical framework of density-embedded layers, generalizing the transformation represented by a neuron. Specifically, the affine transformation applied on the input is replaced by a scalar product of the input, suitably represented as a piecewise constant function, with a density function associated with the neuron. This density is shown to describe directly the receptive field of the neuron. Crucially, by suitably representing such a density as a linear combination of a parametric family of functions, we can efficiently train the densities by means of any automatic differentiation system, making it adaptable to the problem at hand, and computationally efficient to evaluate. This framework captures and generalizes recent methods, allowing a fine tuning of the receptive field. In the paper, we define some novel layers and we experimentally validate them on the classic MNIST dataset.
In many applications, there is a need to predict the effect of an intervention on different individuals from data. For example, which customers are persuadable by a product promotion? which patients should be treated with a certain type of treatment? These are typical causal questions involving the effect or the change in outcomes made by an intervention. The questions cannot be answered with traditional classification methods as they only use associations to predict outcomes. For personalised marketing, these questions are often answered with uplift modelling. The objective of uplift modelling is to estimate causal effect, but its literature does not discuss when the uplift represents causal effect. Causal heterogeneity modelling can solve the problem, but its assumption of unconfoundedness is untestable in data. So practitioners need guidelines in their applications when using the methods. In this paper, we use causal classification for a set of personalised decision making problems, and differentiate it from classification. We discuss the conditions when causal classification can be resolved by uplift (and causal heterogeneity) modelling methods. We also propose a general framework for causal classification, by using off-the-shelf supervised methods for flexible implementations. Experiments have shown two instantiations of the framework work for causal classification and for uplift (causal heterogeneity) modelling, and are competitive with the other uplift (causal heterogeneity) modelling methods.
Time Delay Neural Networks (TDNNs) are widely used in both DNN-HMM based hybrid speech recognition systems and recent end-to-end systems. Nevertheless, the receptive fields of TDNNs are limited and fixed, which is not desirable for tasks like speech recognition, where the temporal dynamics of speech are varied and affected by many factors. This paper proposes to use deformable TDNNs for adaptive temporal dynamics modeling in end-to-end speech recognition. Inspired by deformable ConvNets, deformable TDNNs augment the temporal sampling locations with additional offsets and learn the offsets automatically based on the ASR criterion, without additional supervision. Experiments show that deformable TDNNs obtain state-of-the-art results on WSJ benchmarks (1.42%/3.45% WER on WSJ eval92/dev93 respectively), outperforming standard TDNNs significantly. Furthermore, we propose the latency control mechanism for deformable TDNNs, which enables deformable TDNNs to do streaming ASR without accuracy degradation.
As one type of efficient unsupervised learning methods, clustering algorithms have been widely used in data mining and knowledge discovery with noticeable advantages. However, clustering algorithms based on density peak have limited clustering effect on data with varying density distribution (VDD), equilibrium distribution (ED), and multiple domain-density maximums (MDDM), leading to the problems of sparse cluster loss and cluster fragmentation. To address these problems, we propose a Domain-Adaptive Density Clustering (DADC) algorithm, which consists of three steps: domain-adaptive density measurement, cluster center self-identification, and cluster self-ensemble. For data with VDD features, clusters in sparse regions are often neglected by using uniform density peak thresholds, which results in the loss of sparse clusters. We define a domain-adaptive density measurement method based on K-Nearest Neighbors (KNN) to adaptively detect the density peaks of different density regions. We treat each data point and its KNN neighborhood as a subgroup to better reflect its density distribution in a domain view. In addition, for data with ED or MDDM features, a large number of density peaks with similar values can be identified, which results in cluster fragmentation. We propose a cluster center self-identification and cluster self-ensemble method to automatically extract the initial cluster centers and merge the fragmented clusters. Experimental results demonstrate that compared with other comparative algorithms, the proposed DADC algorithm can obtain more reasonable clustering results on data with VDD, ED and MDDM features. Benefitting from a few parameter requirements and non-iterative nature, DADC achieves low computational complexity and is suitable for large-scale data clustering.
How can we efficiently gather information to optimize an unknown function, when presented with multiple, mutually dependent information sources with different costs? For example, when optimizing a robotic system, intelligently trading off computer simulations and real robot testings can lead to significant savings. Existing methods, such as multi-fidelity GP-UCB or Entropy Search-based approaches, either make simplistic assumptions on the interaction among different fidelities or use simple heuristics that lack theoretical guarantees. In this paper, we study multi-fidelity Bayesian optimization with complex structural dependencies among multiple outputs, and propose MF-MI-Greedy, a principled algorithmic framework for addressing this problem. In particular, we model different fidelities using additive Gaussian processes based on shared latent structures with the target function. Then we use cost-sensitive mutual information gain for efficient Bayesian global optimization. We propose a simple notion of regret which incorporates the cost of different fidelities, and prove that MF-MI-Greedy achieves low regret. We demonstrate the strong empirical performance of our algorithm on both synthetic and real-world datasets.
In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop tw