ﻻ يوجد ملخص باللغة العربية
Causality knowledge is crucial for many artificial intelligence systems. Conventional textual-based causality knowledge acquisition methods typically require laborious and expensive human annotations. As a result, their scale is often limited. Moreover, as no context is provided during the annotation, the resulting causality knowledge records (e.g., ConceptNet) typically do not take the context into consideration. To explore a more scalable way of acquiring causality knowledge, in this paper, we jump out of the textual domain and investigate the possibility of learning contextual causality from the visual signal. Compared with pure text-based approaches, learning causality from the visual signal has the following advantages: (1) Causality knowledge belongs to the commonsense knowledge, which is rarely expressed in the text but rich in videos; (2) Most events in the video are naturally time-ordered, which provides a rich resource for us to mine causality knowledge from; (3) All the objects in the video can be used as context to study the contextual property of causal relations. In detail, we first propose a high-quality dataset Vis-Causal and then conduct experiments to demonstrate that with good language and visual representation models as well as enough training signals, it is possible to automatically discover meaningful causal knowledge from the videos. Further analysis also shows that the contextual property of causal relations indeed exists, taking which into consideration might be crucial if we want to use the causality knowledge in real applications, and the visual signal could serve as a good resource for learning such contextual causality.
Mixed-integer linear programs (MILPs) are widely used in artificial intelligence and operations research to model complex decision problems like scheduling and routing. Designing such programs however requires both domain and modelling expertise. In
Navigating and understanding the real world remains a key challenge in machine learning and inspires a great variety of research in areas such as language grounding, planning, navigation and computer vision. We propose an instruction-following task t
We introduce the Neural State Machine, seeking to bridge the gap between the neural and symbolic views of AI and integrate their complementary strengths for the task of visual reasoning. Given an image, we first predict a probabilistic graph that rep
A machine intelligence pipeline usually consists of six components: problem, representation, model, loss, optimizer and metric. Researchers have worked hard trying to automate many components of the pipeline. However, one key component of the pipelin
Recent research efforts enable study for natural language grounded navigation in photo-realistic environments, e.g., following natural language instructions or dialog. However, existing methods tend to overfit training data in seen environments and f