ﻻ يوجد ملخص باللغة العربية
Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.
Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human computer interaction. In recent years, deep neural net
Image representations, from SIFT and bag of visual words to Convolutional Neural Networks (CNNs) are a crucial component of almost all computer vision systems. However, our understanding of them remains limited. In this paper we study several landmar
The aim of this study is developing an automatic system for detection of gait-related health problems using Deep Neural Networks (DNNs). The proposed system takes a video of patients as the input and estimates their 3D body pose using a DNN based met
Hierarchical structures exist in both linguistics and Natural Language Processing (NLP) tasks. How to design RNNs to learn hierarchical representations of natural languages remains a long-standing challenge. In this paper, we define two different typ
Understanding spoken language is a highly complex problem, which can be decomposed into several simpler tasks. In this paper, we focus on Spoken Language Understanding (SLU), the module of spoken dialog systems responsible for extracting a semantic i