Do you want to publish a course? Click here

New Domain, Major Effort? How Much Data is Necessary to Adapt a Temporal Tagger to the Voice Assistant Domain

مجال جديد، جهد كبير؟ما مقدار البيانات الضرورية لتكييف علامة زمنية إلى مجال مساعد الصوت

323   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

Reliable tagging of Temporal Expressions (TEs, e.g., Book a table at L'Osteria for Sunday evening) is a central requirement for Voice Assistants (VAs). However, there is a dearth of resources and systems for the VA domain, since publicly-available temporal taggers are trained only on substantially different domains, such as news and clinical text. Since the cost of annotating large datasets is prohibitive, we investigate the trade-off between in-domain data and performance in DA-Time, a hybrid temporal tagger for the English VA domain which combines a neural architecture for robust TE recognition, with a parser-based TE normalizer. We find that transfer learning goes a long way even with as little as 25 in-domain sentences: DA-Time performs at the state of the art on the news domain, and substantially outperforms it on the VA domain.



References used
https://aclanthology.org/
rate research

Read More

Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impa ct of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase identification. We complement our study with an analysis of the cost-benefit trade-off of training such models. Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost.
In this study, basic methodologies and procedures for generation synthetic time histories in time domain and frequency domain are summarized. These synthetic time histories are matching Syrian spectrum and compatible with wide range of buildings m odels and soil types according to the seismic parameters of Lattakia city. This paper will discuss the Selection and scaling criteria of three real time history records available in strong ground motion databases to satisfy the Syrian spectrum, and the suitability as input to time history analysis of civil engineering structures.
In the few recent years, besides the traditional web a new web has appeared. It is called the Web of Linked Data. It has been developed to present data in a machinereadable form. The main idea is to describe data using a set of terms called web ont ology. At this time, tools and standards related to the semantic web are becoming comprehensive and stable; however, publishing university data as linked data still faces some major challenges. First of all, there is no unified, well-accepted vocabulary for describing university-related information. This article aims to find the ontology which could be used to describe the data in the university domain, so it could be possible to integrate this data with data from other universities and do queries on it. The web ontology was built by reusing the available vocabularies on the web and adding new classes and properties. The ontology has been organized by using Protégé.
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.
One of the first building blocks to create a voice assistant relates to the task of tagging entities or attributes in user queries. This can be particularly challenging when entities are in the tenth of millions, as is the case of e.g. music catalogs . Training slot tagging models at an industrial scale requires large quantities of accurately labeled user queries, which are often hard and costly to gather. On the other hand, voice assistants typically collect plenty of unlabeled queries that often remain unexploited. This paper presents a weakly-supervised methodology to label large amounts of voice query logs, enhanced with a manual filtering step. Our experimental evaluations show that slot tagging models trained on weakly-supervised data outperform models trained on hand-annotated or synthetic data, at a lower cost. Further, manual filtering of weakly-supervised data leads to a very significant reduction in Sentence Error Rate, while allowing us to drastically reduce human curation efforts from weeks to hours, with respect to hand-annotation of queries. The method is applied to successfully bootstrap a slot tagging system for a major music streaming service that currently serves several tens of thousands of daily voice queries.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا