Do you want to publish a course? Click here

Commonsense is defined as the knowledge on which everyone agrees. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenes of wedding ceremonies var y across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.
This paper presents a study that compares non-manual markers of polar and wh-questions to statements in Kazakh-Russian Sign Language (KRSL) in a dataset collected for NLP tasks. The primary focus of the study is to demonstrate the utility of computer vision solutions for the linguistic analysis of non-manuals in sign languages, although additional corrections are required to account for biases in the output. To this end, we analyzed recordings of 10 triplets of sentences produced by 9 native signers using both manual annotation and computer vision solutions (such as OpenFace). We utilize and improve the computer vision solution, and briefly describe the results of the linguistic analysis.
This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-s hot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.
Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions. One key challenge in this task is to ground instructions with the current visual information that the agent per ceives. Most of the existing work employs soft attention over individual words to locate the instruction required for the next action. However, different words have different functions in a sentence (e.g., modifiers convey attributes, verbs convey actions). Syntax information like dependencies and phrase structures can aid the agent to locate important parts of the instruction. Hence, in this paper, we propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes. Empirically, our agent outperforms the baseline model that does not use syntax information on the Room-to-Room dataset, especially in the unseen environment. Besides, our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages (English, Hindi, and Telugu). We also show that our agent is better at aligning instructions with the current visual information via qualitative visualizations.
The problem of interpretation of knowledge learned by multi-head self-attention in transformers has been one of the central questions in NLP. However, a lot of work mainly focused on models trained for uni-modal tasks, e.g. machine translation. In th is paper, we examine masked self-attention in a multi-modal transformer trained for the task of image captioning. In particular, we test whether the multi-modality of the task objective affects the learned attention patterns. Our visualisations of masked self-attention demonstrate that (i) it can learn general linguistic knowledge of the textual input, and (ii) its attention patterns incorporate artefacts from visual modality even though it has never accessed it directly. We compare our transformer's attention patterns with masked attention in distilgpt-2 tested for uni-modal text generation of image captions. Based on the maps of extracted attention weights, we argue that masked self-attention in image captioning transformer seems to be enhanced with semantic knowledge from images, exemplifying joint language-and-vision information in its attention patterns.
The limits of applicability of vision-and language models are defined by the coverage of their training data. Tasks like vision question answering (VQA) often require commonsense and factual information beyond what can be learned from task-specific d atasets. This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers. We use an auxiliary training objective that encourages the learned representations to align with graph embeddings of matching entities in a KB. We empirically study the relevance of various KBs to multiple tasks and benchmarks. The technique brings clear benefits to knowledge-demanding question answering tasks (OK-VQA, FVQA) by capturing semantic and relational knowledge absent from existing models. More surprisingly, the technique also benefits visual reasoning tasks (NLVR2, SNLI-VE). We perform probing experiments and show that the injection of additional knowledge regularizes the space of embeddings, which improves the representation of lexical and semantic similarities. The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
The study aims to study the availability of the dimensions of organizational intelligence in the private insurance companies in the Syrian coast, and study the nature and strength of the relationship between the dimensions of organizational intellige nce and performance. To achieve this, three hypotheses were formulated. The researcher used the questionnaire technique to collect the data analyzed using statistical tests, the most important of which were: the one-sample T. test, the Pearson Correlation test, and the simple regression test. The researcher found several results, the most important of which is: Organizational intelligence in the studied companies is well evaluated, there is a positive relationship between the dimensions of organizational intelligence and performance, and there is a statistically significant effect of organizational intelligence on performance in the companies under study.
إعادة تشكيل وضعيات الإنسان ثلاثية الأبعاد من صورة واحدة ثنائية الأبعاد هي مشكلة تمثل تحديا للعديد من الباحثين. وفي السنوات الأخيرة، كان هناك اتجاه صاعد نحو تحليل الهندسة ثلاثية الأبعاد للكائنات بما في ذلك الأشكال والوضع بدلاً من مجرد تقديم مربعات مر بوطة. حيث أن التفكير الهندسي ثلاثي الأبعاد يؤدي إلى توفير معلومات أكثر ثراءً عن المشهد لمهام لاحقة عالية المستوى مثل فهم المشهد والواقع المعزز والتفاعل مع الكمبيوتر البشري، بالإضافة أيضًا تحسين اكتشاف الكائنات [3]، [4]. ولذلك كانت إعادة التشكيل ثلاثية الأبعاد مشكلة مدروسة جيداً، وكانت هناك العديد من التقنيات القابلة للتطبيق عمليًا مثل البنية من الحركة، والأنظمة الصوتية متعددة المقاييس ومستشعرات العمق، ولكن هذه التقنيات محدودة في بعض السيناريوهات. هنا في هذه الورقة، نعرض كيف تم التعامل مع المشكلة في العقود القليلة الماضية، وتحليل التطورات الأخيرة في هذا المجال، والاتجاهات المحتملة للبحث في المستقبل.
This Research Aims To Study The Requirements For The Implementation Of Strategic Planning In The Port Governmental Company Of Tartous, And The Existence Of Some Of These Requirements In The Port Governmental Company Of Tartous.
The research aims to examine the relationship between organizational culture on the one hand, and between knowledge management practices in Tishreen University, on the other hand, where the researcher distributed a questionnaire to a sample of the colleges in the University of Tishreen reached (205) questionnaire was recovered (222) to identify them, it was good of them use (158) questionnaire. To test the relationship, the researcher used the Student T test for one sample, as well as the Pearson correlation coefficient.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا