تعتمد نماذج التعلم العميق الحديثة لمعالجة اللغة الطبيعية بشكل كبير على كميات كبيرة من النصوص المشروح.ومع ذلك، قد يكون الحصول على مثل هذه النصوص صعبة عندما تحتوي على معلومات شخصية أو سرية، على سبيل المثال، في المجالات الصحية أو القانونية.في هذا العمل، نقترح طريقة لإلغاء تحديد المستندات النصية النموذج الحرة من خلال تقسيم البيانات الحساسة بعناية فيها.نظهر أن طريقتنا تحافظ على الأداة المساعدة للبيانات لتصنيف النصوص ووضع التسلسل والتسجيل الإجابة على المهام.
Modern deep learning models for natural language processing rely heavily on large amounts of annotated texts. However, obtaining such texts may be difficult when they contain personal or confidential information, for example, in health or legal domains. In this work, we propose a method of de-identifying free-form text documents by carefully redacting sensitive data in them. We show that our method preserves data utility for text classification, sequence labeling and question answering tasks.
References used
https://aclanthology.org/
Neural language models are known to have a high capacity for memorization of training samples. This may have serious privacy im- plications when training models on user content such as email correspondence. Differential privacy (DP), a popular choice
We curated WikiPII, an automatically labeled dataset composed of Wikipedia biography pages, annotated for personal information extraction. Although automatic annotation can lead to a high degree of label noise, it is an inexpensive process and can ge
News recommendation techniques can help users on news platforms obtain their preferred news information. Most existing news recommendation methods rely on centrally stored user behavior data to train models and serve users. However, user data is usua
Recent neural text-to-SQL models can effectively translate natural language questions to corresponding SQL queries on unseen databases. Working mostly on the Spider dataset, researchers have proposed increasingly sophisticated solutions to the proble
Latent Dirichlet allocation (LDA), a widely used topic model, is often employed as a fundamental tool for text analysis in various applications. However, the training process of the LDA model typically requires massive text corpus data. On one hand,