بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Differential Privacy for Text Analytics via Natural Text Sanitization

339 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xiang Yue

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xiang Yue - Minxin Du - Tianhao Wang

الحساب واللغة التشفير والأمن

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Texts convey sophisticated knowledge. However, texts also convey sensitive information. Despite the success of general-purpose language models and domain-specific mechanisms with differential privacy (DP), existing text sanitization mechanisms still provide low utility, as cursed by the high-dimensional text representation. The companion issue of utilizing sanitized texts for downstream analytics is also under-explored. This paper takes a direct approach to text sanitization. Our insight is to consider both sensitivity and similarity via our new local DP notion. The sanitized texts also contribute to our sanitization-aware pretraining and fine-tuning, enabling privacy-preserving natural language processing over the BERT language model with promising utility. Surprisingly, the high utility does not boost up the success rate of inference attacks.

قيم البحث

199 - Liwei Song , Xinwei Yu , Hsuan-Tung Peng 2020

Recent work has demonstrated the vulnerability of modern text classifiers to universal adversarial attacks, which are input-agnostic sequences of words added to text processed by classifiers. Despite being successful, the word sequences produced in s uch attacks are often ungrammatical and can be easily distinguished from natural text. We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems when added to benign inputs. We leverage an adversarially regularized autoencoder (ARAE) to generate triggers and propose a gradient-based search that aims to maximize the downstream classifiers prediction loss. Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models as per automatic detection metrics and human-subject studies. Our aim is to demonstrate that adversarial attacks can be made harder to detect than previously thought and to enable the development of appropriate defenses.

الحساب واللغة التشفير والأمن

Generalised Differential Privacy for Text Document Processing

85 - Natasha Fernandes , Mark Dras , Annabelle McIver 2018

We address the problem of how to obfuscate texts by removing stylistic clues which can identify authorship, whilst preserving (as much as possible) the content of the text. In this paper we combine ideas from generalised differential privacy and mach ine learning techniques for text processing to model privacy for text documents. We define a privacy mechanism that operates at the level of text documents represented as bags-of-words - these representations are typical in machine learning and contain sufficient information to carry out many kinds of classification tasks including topic identification and authorship attribution (of the original documents). We show that our mechanism satisfies privacy with respect to a metric for semantic similarity, thereby providing a balance between utility, defined by the semantic content of texts, with the obfuscation of stylistic clues. We demonstrate our implementation on a fan fiction dataset, confirming that it is indeed possible to disguise writing style effectively whilst preserving enough information and variation for accurate content classification tasks.

التشفير والأمن التعلم الآلي

Text Analytics for Resilience-Enabled Extreme Events Reconnaissance

63 - Alicia Y. Tsai , Selim Gunay , Minjune Hwang 2020

Post-hazard reconnaissance for natural disasters (e.g., earthquakes) is important for understanding the performance of the built environment, speeding up the recovery, enhancing resilience and making informed decisions related to current and future h azards. Natural language processing (NLP) is used in this study for the purposes of increasing the accuracy and efficiency of natural hazard reconnaissance through automation. The study particularly focuses on (1) automated data (news and social media) collection hosted by the Pacific Earthquake Engineering Research (PEER) Center server, (2) automatic generation of reconnaissance reports, and (3) use of social media to extract post-hazard information such as the recovery time. Obtained results are encouraging for further development and wider usage of various NLP methods in natural hazard reconnaissance.

الحساب واللغة أجهزة الكمبيوتر والمجتمع

Privacy Guarantees for De-identifying Text Transformations

123 - David Ifeoluwa Adelani , Ali Davody , Thomas Kleinbauer 2020

Machine Learning approaches to Natural Language Processing tasks benefit from a comprehensive collection of real-life user data. At the same time, there is a clear need for protecting the privacy of the users whose data is collected and processed. Fo r text collections, such as, e.g., transcripts of voice interactions or patient records, replacing sensitive parts with benign alternatives can provide de-identification. However, how much privacy is actually guaranteed by such text transformations, and are the resulting texts still useful for machine learning? In this paper, we derive formal privacy guarantees for general text transformation-based de-identification methods on the basis of Differential Privacy. We also measure the effect that different ways of masking private information in dialog transcripts have on a subsequent machine learning task. To this end, we formulate different masking strategies and compare their privacy-utility trade-offs. In particular, we compare a simple redact approach with more sophisticated word-by-word replacement using deep learning models on multiple natural language understanding tasks like named entity recognition, intent detection, and dialog act classification. We find that only word-by-word replacement is robust against performance drops in various tasks.

الحساب واللغة

Selective Differential Privacy for Language Modeling

240 - Weiyan Shi , Aiqi Cui , Evan Li 2021

With the increasing adoption of language models in applications involving sensitive data, it has become crucial to protect these models from leaking private information. Previous work has attempted to tackle this challenge by training RNN-based langu age models with differential privacy guarantees. However, applying classical differential privacy to language models leads to poor model performance as the underlying privacy notion is over-pessimistic and provides undifferentiated protection for all tokens of the data. Given that the private information in natural language is sparse (for example, the bulk of an email might not carry personally identifiable information), we propose a new privacy notion, selective differential privacy, to provide rigorous privacy guarantees on the sensitive portion of the data to improve model utility. To realize such a new notion, we develop a corresponding privacy mechanism, Selective-DPSGD, for RNN-based language models. Besides language modeling, we also apply the method to a more concrete application -- dialog systems. Experiments on both language modeling and dialog system building show that the proposed privacy-preserving mechanism achieves better utilities while remaining safe under various privacy attacks compared to the baselines. The data, code and models are available at https://github.com/wyshi/lm_privacy.

الحساب واللغة التشفير والأمن

سجل دخول لتتمكن من نشر تعليقات

التعليقات (0)

no comments...

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة دمشق

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Differential Privacy for Text Analytics via Natural Text Sanitization

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً