Do you want to publish a course? Click here

Building a Corpus for Corporate Websites Machine Translation Evaluation. A Step by Step Methodological Approach

بناء كوربوس لمواقع الويب التقييم لتقييم الترجمة.خطوة بخطوة منهجية منهجية

397   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

The aim of this paper is to describe the process carried out to develop a paral-lel corpus comprised of texts extracted from the corporate websites of south-ern Spanish SMEs from the sanitary sector which will serve as the basis for MT quality assessment. The stages for compiling the parallel corpora were: (i) selection of websites with content translated in English and Spanish, (ii) downloading of the HTML files of the selected websites, (iii) files filtering and pairing of English files with their Spanish equivalents, (iv) compilation of individual corpora (EN and ES) for each of the selected websites, (v) merging of the individual corpora into a two general corpus one in English and the other in Spanish, (vi) selection a representative sample of segments to be used as original (ES) and reference translations (EN), (vii) building of the parallel corpus intended for MT evaluation. The parallel corpus generated will serve to future Machine Translation quality assessment. In addition, the monolingual corpora generated during the process could as a base to carry out research focused on linguistic -- bilingual or monolingual − analysis.



References used
https://aclanthology.org/
rate research

Read More

In this paper we describe the process of build-ing a corporate corpus that will be used as a ref-erence for modelling and computing threadsfrom conversations generated using commu-nication and collaboration tools. The overallgoal of the reconstructio n of threads is to beable to provide value to the collorator in var-ious use cases, such as higlighting the impor-tant parts of a running discussion, reviewingthe upcoming commitments or deadlines, etc. Since, to our knowledge, there is no avail-able corporate corpus for the French languagewhich could allow us to address this prob-lem of thread constitution, we present here amethod for building such corpora includingdifferent aspects and steps which allowed thecreation of a pipeline to pseudo-anonymisedata. Such a pipeline is a response to theconstraints induced by the General Data Pro-tection Regulation GDPR in Europe and thecompliance to the secrecy of correspondence.
The current random behavior of stakeholders within the Al-Abrash river basin in Syrian coastal region, the lake and the river, threatens more than ever to pollute the whole basin. The goal of this paper is to address the state of shared management of water resources among local players through game theory application based on two self-interest strategies for each player to reach a balance point taking into consideration the government intervention as the organizer of the game. Therefore, non-cooperative game theory NCGT adopted as an analytical approach for modeling planning assets conflicts. ArcGIS software adopted to define different areas according to its risk/land-use types. The result shows that the equilibrium point "non-cooperate-non-cooperate" strategy between the players could lean towards "cooperative-cooperative" strategy in the light of the provincial government effect, adopting innovating competitive planning policies. That will lead to an interactive economical-environmental balance in the river basin and helps to reach rational decisions. Therefore, this paper could be classified as one of the studies seeking to apply the participatory planning approach toward sustainable development. Index Terms-Al-Abrash river basin, environmental protection strategy, game theory, participatory approach.
Conversations are often held in laboratories and companies. A summary is vital to grasp the content of a discussion for people who did not attend the discussion. If the summary is illustrated as an argument structure, it is helpful to grasp the discu ssion's essentials immediately. Our purpose in this paper is to predict a link structure between nodes that consist of utterances in a conversation: classification of each node pair into linked'' or not-linked.'' One approach to predict the structure is to utilize machine learning models. However, the result tends to over-generate links of nodes. To solve this problem, we introduce a two-step method to the structure prediction task. We utilize a machine learning-based approach as the first step: a link prediction task. Then, we apply a score-based approach as the second step: a link selection task. Our two-step methods dramatically improved the accuracy as compared with one-step methods based on SVM and BERT.
Recently, a number of commercial Machine Translation (MT) providers have started to offer glossary features allowing users to enforce terminology into the output of a generic model. However, to the best of our knowledge it is not clear how such featu res would impact terminology accuracy and the overall quality of the output. The present contribution aims at providing a first insight into the performance of the glossary-enhanced generic models offered by four providers. Our tests involve two different domains and language pairs, i.e. Sportswear En--Fr and Industrial Equipment De--En. The output of each generic model and of the glossaryenhanced one will be evaluated relying on Translation Error Rate (TER) to take into account the overall output quality and on accuracy to assess the compliance with the glossary. This is followed by a manual evaluation. The present contribution mainly focuses on understanding how these glossary features can be fruitfully exploited by language service providers (LSPs), especially in a scenario in which a customer glossary is already available and is added to the generic model as is.
This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا