No Arabic abstract
Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a large-scale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.
We propose a simple but efficient method termed Guided Learning for weakly-labeled semi-supervised sound event detection (SED). There are two sub-targets implied in weakly-labeled SED: audio tagging and boundary detection. Instead of designing a single model by considering a trade-off between the two sub-targets, we design a teacher model aiming at audio tagging to guide a student model aiming at boundary detection to learn using the unlabeled data. The guidance is guaranteed by the audio tagging performance gap of the two models. In the meantime, the student model liberated from the trade-off is able to provide more excellent boundary detection results. We propose a principle to design such two models based on the relation between the temporal compression scale and the two sub-targets. We also propose an end-to-end semi-supervised learning process for these two models to enable their abilities to rise alternately. Experiments on the DCASE2018 Task4 dataset show that our approach achieves competitive performance.
Generative modeling of 3D shapes has become an important problem due to its relevance to many applications across Computer Vision, Graphics, and VR. In this paper we build upon recently introduced 3D mesh-convolutional Variational AutoEncoders which have shown great promise for learning rich representations of deformable 3D shapes. We introduce a supervised generative 3D mesh model that disentangles the latent shape representation into independent generative factors. Our extensive experimental analysis shows that learning an explicitly disentangled representation can both improve random shape generation as well as successfully address downstream tasks such as pose and shape transfer, shape-invariant temporal synchronization, and pose-invariant shape matching.
We introduce supervised feature ranking and feature subset selection algorithms for multivariate time series (MTS) classification. Unlike most existing supervised/unsupervised feature selection algorithms for MTS our techniques do not require a feature extraction step to generate a one-dimensional feature vector from the time series. Instead it is based on directly computing similarity between individual time series and assessing how well the resulting cluster structure matches the labels. The techniques are amenable to heterogeneous MTS data, where the time series measurements may have different sampling resolutions, and to multi-modal data.
Action detection and temporal segmentation of actions in videos are topics of increasing interest. While fully supervised systems have gained much attention lately, full annotation of each action within the video is costly and impractical for large amounts of video data. Thus, weakly supervised action detection and temporal segmentation methods are of great importance. While most works in this area assume an ordered sequence of occurring actions to be given, our approach only uses a set of actions. Such action sets provide much less supervision since neither action ordering nor the number of action occurrences are known. In exchange, they can be easily obtained, for instance, from meta-tags, while ordered sequences still require human annotation. We introduce a system that automatically learns to temporally segment and label actions in a video, where the only supervision that is used are action sets. An evaluation on three datasets shows that our method still achieves good results although the amount of supervision is significantly smaller than for other related methods.
We consider the problem of the extraction of semantic attributes, supervised only with classification labels. For example, when learning to classify images of birds into species, we would like to observe the emergence of features that zoologists use to classify birds. To tackle this problem, we propose training a neural network with discrete features in the last layer, which is followed by two heads: a multi-layered perceptron (MLP) and a decision tree. Since decision trees utilize simple binary decision stumps we expect those discrete features to obtain semantic meaning. We present a theoretical analysis as well as a practical method for learning in the intersection of two hypothesis classes. Our results on multiple benchmarks show an improved ability to extract a set of features that are highly correlated with the set of unseen attributes.