Do you want to publish a course? Click here

OCID-Ref: A 3D Robotic Dataset With Embodied Language For Clutter Scene Grounding

OCID-RED: مجموعة بيانات ثلاثية الأبعاد الروبوتية مع لغة مجسمة لتأريض فوضى المشهد

234   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

To effectively apply robots in working environments and assist humans, it is essential to develop and evaluate how visual grounding (VG) can affect machine performance on occluded objects. However, current VG works are limited in working environments, such as offices and warehouses, where objects are usually occluded due to space utilization issues. In our work, we propose a novel OCID-Ref dataset featuring a referring expression segmentation task with referring expressions of occluded objects. OCID-Ref consists of 305,694 referring expressions from 2,300 scenes with providing RGB image and point cloud inputs. To resolve challenging occlusion issues, we argue that it's crucial to take advantage of both 2D and 3D signals to resolve challenging occlusion issues. Our experimental results demonstrate the effectiveness of aggregating 2D and 3D signals but referring to occluded objects still remains challenging for the modern visual grounding systems. OCID-Ref is publicly available at https://github.com/lluma/OCID-Ref



References used
https://aclanthology.org/
rate research

Read More

This paper introduces a new video-and-language dataset with human actions for multimodal logical inference, which focuses on intentional and aspectual expressions that describe dynamic human actions. The dataset consists of 200 videos, 5,554 action l abels, and 1,942 action triplets of the form (subject, predicate, object) that can be easily translated into logical semantic representations. The dataset is expected to be useful for evaluating multimodal inference systems between videos and semantically complicated sentences including negation and quantification.
This paper describes the annotation process of an offensive language data set for Romanian on social media. To facilitate comparable multi-lingual research on offensive language, the annotation guidelines follow some of the recent annotation efforts for other languages. The final corpus contains 5000 micro-blogging posts annotated by a large number of volunteer annotators. The inter-annotator agreement and the initial automatic discrimination results we present are in line with earlier annotation efforts.
People live in various environments, although they can understand scenes around them with just a glance. To do this, they depend on their high ability to effectively process visual data and connect it to wide pre-knowledge about what they are expecte d to see. This is not the case for computers, which can’t reach high levels of scene understanding until now. Most researches treat scene understanding as a usual classification problem, where they have just to classify scenes in predefined limited categories (forest, city, garden). They normally used classification or machine learning algorithms, which limit their ability to understand scenes and reduces their chances to be used in a practical way because of a required training phase of these algorithms. Some researches try to make use of knowledge in Ontologies to reach a high level scene understanding, but these researches are still limited to specific domains only. In this thesis we are trying to understand scene images without any pre-knowledge about their domain. We will not treat this problem as a normal classification problem; however we will extract high level concepts from scene images. These concepts will not only represent objects in the scene, but they will also reflect the places and events in the scene. To do this, we develop a novel algorithm named SMHITS. It depends on a semantically rich common sense knowledge base to extract associated concepts with a primitive group of concepts. To use SMHITS in scene understanding, we also develop a system named ICES. Instead of using a classification or machine learning algorithm, ICES depends on a large dataset of images that is independent of any scene domain. Results show the superiority of SMHITS comparing to current ConceptNet associated concepts extraction algorithm, as it has higher precision and can take advantage of expansion of its knowledge base. Results also show that ICES output concepts are semantically rich.
In this paper we present a prototypical implementation of a pipeline that allows the automatic generation of a German Sign Language avatar from 2D video material. The presentation is accompanied by the source code. We record human pose movements duri ng signing with computer vision models. The joint coordinates of hands and arms are imported as landmarks to control the skeleton of our avatar. From the anatomically independent landmarks, we create another skeleton based on the avatar's skeletal bone architecture to calculate the bone rotation data. This data is then used to control our human 3D avatar. The avatar is displayed on AR glasses and can be placed virtually in the room, in a way that it can be perceived simultaneously to the verbal speaker. In further work it is aimed to be enhanced with speech recognition and machine translation methods for serving as a sign language interpreter. The prototype has been shown to people of the deaf and hard-of-hearing community for assessing its comprehensibility. Problems emerged with the transferred hand rotations, hand gestures were hard to recognize on the avatar due to deformations like twisted finger meshes.
Reviewing contracts is a time-consuming procedure that incurs large expenses to companies and social inequality to those who cannot afford it. In this work, we propose document-level natural language inference (NLI) for contracts'', a novel, real-wor ld application of NLI that addresses such problems. In this task, a system is given a set of hypotheses (such as Some obligations of Agreement may survive termination.'') and a contract, and it is asked to classify whether each hypothesis is entailed by'', contradicting to'' or not mentioned by'' (neutral to) the contract as well as identifying evidence'' for the decision as spans in the contract. We annotated and release the largest corpus to date consisting of 607 annotated contracts. We then show that existing models fail badly on our task and introduce a strong baseline, which (a) models evidence identification as multi-label classification over spans instead of trying to predict start and end tokens, and (b) employs more sophisticated context segmentation for dealing with long documents. We also show that linguistic characteristics of contracts, such as negations by exceptions, are contributing to the difficulty of this task and that there is much room for improvement.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا