Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Creating and Evaluating a Synthetic Norwegian Clinical Corpus for De-Identification

إنشاء وتقييم كوربوس السريرية النرويجية الاصطناعية لإلغاء تحديد

712 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Building tools to remove sensitive information such as personal names, addresses, and telephone numbers - so called Protected Health Information (PHI) - from clinical free text is an important task to make clinical texts available for research. These de-identification tools must be assessed regarding their quality in the form of the measurements precision and re- call. To assess such tools, gold standards - annotated clinical text - must be available. Such gold standards exist for larger languages. For Norwegian, how- ever, there are no such resources. Therefore, an already existing Norwegian synthetic clinical corpus, NorSynthClinical, has been extended with PHIs and annotated by two annotators, obtaining an inter-annotator agreement of 0.94 F1-measure. In total, the corpus has 409 annotated PHI instances and is called NorSynthClinical PHI. A de-identification hybrid tool (machine learning and rule-based meth- ods) for Norwegian was developed and trained with open available resources, and obtained an overall F1-measure of 0.73 and a recall of 0.62, when tested using NorSynthClinical PHI. NorSynthClinical PHI is made open and available at Github to be used by the research community.

References used

https://aclanthology.org/

rate research

Creating and Evaluating Resources for Sentiment Analysis in the Low-resource Language: Sindhi

628 - Association for Computation Linguistics 2021 مقالة

In this paper, we develop Sindhi subjective lexicon using a merger of existing English resources: NRC lexicon, list of opinion words, SentiWordNet, Sindhi-English bilingual dictionary, and collection of Sindhi modifiers. The positive or negative sent iment score is assigned to each Sindhi opinion word. Afterwards, we determine the coverage of the proposed lexicon with subjectivity analysis. Moreover, we crawl multi-domain tweet corpus of news, sports, and finance. The crawled corpus is annotated by experienced annotators using the Doccano text annotation tool. The sentiment annotated corpus is evaluated by employing support vector machine (SVM), recurrent neural network (RNN) variants, and convolutional neural network (CNN).

low-resource language creating and evaluating evaluating resources لغة الموارد المنخفضة خلق وتقييم تقييم الموارد صناعة حمض الفوسفور المزيد..

Corpus Creation and Evaluation for Speech-to-Text and Speech Translation

753 - Association for Computation Linguistics 2021 مقالة

The National Virtual Translation Center (NVTC) seeks to acquire human language technology (HLT) tools that will facilitate its mission to provide verbatim English translations of foreign language audio and video files. In the text domain, NVTC has be en using translation memory (TM) for some time and has reported on the incorporation of machine translation (MT) into that workflow (Miller et al., 2020). While we have explored the use of speech-totext (STT) and speech translation (ST) in the past (Tzoukermann and Miller, 2018), we have now invested in the creation of a substantial human-made corpus to thoroughly evaluate alternatives. Results from our analysis of this corpus and the performance of HLT tools point the way to the most promising ones to deploy in our workflow.

virtual translation center national virtual translation verbatim english translations مركز الترجمة الافتراضية الترجمة الوطنية الافتراضية فرباتي الترجمات الإنجليزية صناعة حمض الفوسفور المزيد..

Negation in Norwegian: an annotated dataset

539 - Association for Computation Linguistics 2021 مقالة

This paper introduces NorecNeg -- the first annotated dataset of negation for Norwegian. Negation cues and their in-sentence scopes have been annotated across more than 11K sentences spanning more than 400 documents for a subset of the Norwegian Revi ew Corpus (NoReC). In addition to providing in-depth discussion of the annotation guidelines, we also present a first set of benchmark results based on a graph-parsing approach.

norwegian review corpus annotated dataset norwegian المراجعة النرويجية Corpus. DataSet المشروح النرويجية صناعة حمض الفوسفور المزيد..

HB Deid - HB De-identification tool demonstrator

956 - Association for Computation Linguistics 2021 مقالة

This paper describes a freely available web-based demonstrator called HB Deid. HB Deid identifies so-called protected health information, PHI, in a text written in Swedish and removes, masks, or replaces them with surrogates or pseudonyms. PHIs are n amed entities such as personal names, locations, ages, phone numbers, dates. HB Deid uses a CRF model trained on non-sensitive annotated text in Swedish, as well as a rule-based post-processing step for finding PHI. The final step in obscuring the PHI is then to either mask it, show only the class name or use a rule-based pseudonymisation system to replace it.

de-identification tool demonstrator de-identification tool deid أداة تعريف لتحديد الهوية أداة تحديد الهوية كد صناعة حمض الفوسفور المزيد..

Synthetic Examples Improve Cross-Target Generalization: A Study on Stance Detection on a Twitter corpus.

807 - Association for Computation Linguistics 2021 مقالة

Cross-target generalization is a known problem in stance detection (SD), where systems tend to perform poorly when exposed to targets unseen during training. Given that data annotation is expensive and time-consuming, finding ways to leverage abundan t unlabeled in-domain data can offer great benefits. In this paper, we apply a weakly supervised framework to enhance cross-target generalization through synthetically annotated data. We focus on Twitter SD and show experimentally that integrating synthetic data is helpful for cross-target generalization, leading to significant improvements in performance, with gains in F1 scores ranging from +3.4 to +5.1.

improve cross-target generalization improve cross-target تحسين التعميم المستهدف تحسين الهدف عبر صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Creating and Evaluating a Synthetic Norwegian Clinical Corpus for De-Identification

إنشاء وتقييم كوربوس السريرية النرويجية الاصطناعية لإلغاء تحديد

Ask ChatGPT about the research

Read More

suggested questions