Detecting Cybersecurity Events from Noisy Short Text

114 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mehmet Saygin Seyfioglu

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Semih Yagcioglu - Mehmet Saygin Seyfioglu - Begum Citamak

الحساب واللغة التشفير والأمن التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

It is very critical to analyze messages shared over social networks for cyber threat intelligence and cyber-crime prevention. In this study, we propose a method that leverages both domain-specific word embeddings and task-specific features to detect cyber security events from tweets. Our model employs a convolutional neural network (CNN) and a long short-term memory (LSTM) recurrent neural network which takes word level meta-embeddings as inputs and incorporates contextual embeddings to classify noisy short text. We collected a new dataset of cyber security related tweets from Twitter and manually annotated a subset of 2K of them. We experimented with this dataset and concluded that the proposed model outperforms both traditional and neural baselines. The results suggest that our method works well for detecting cyber security events from noisy short text.

قيم البحث

81 - Zekun Xu , Abhinav Aggarwal , Oluwaseyi Feyisetan 2020

Balancing the privacy-utility tradeoff is a crucial requirement of many practical machine learning systems that deal with sensitive customer data. A popular approach for privacy-preserving text analysis is noise injection, in which text data is first mapped into a continuous embedding space, perturbed by sampling a spherical noise from an appropriate distribution, and then projected back to the discrete vocabulary space. While this allows the perturbation to admit the required metric differential privacy, often the utility of downstream tasks modeled on this perturbed data is low because the spherical noise does not account for the variability in the density around different words in the embedding space. In particular, words in a sparse region are likely unchanged even when the noise scale is large. %Using the global sensitivity of the mechanism can potentially add too much noise to the words in the dense regions of the embedding space, causing a high utility loss, whereas using local sensitivity can leak information through the scale of the noise added. In this paper, we propose a text perturbation mechanism based on a carefully designed regularized variant of the Mahalanobis metric to overcome this problem. For any given noise scale, this metric adds an elliptical noise to account for the covariance structure in the embedding space. This heterogeneity in the noise scale along different directions helps ensure that the words in the sparse region have sufficient likelihood of replacement without sacrificing the overall utility. We provide a text-perturbation algorithm based on this metric and formally prove its privacy guarantees. Additionally, we empirically show that our mechanism improves the privacy statistics to achieve the same level of utility as compared to the state-of-the-art Laplace mechanism.

الحساب واللغة التشفير والأمن التعلم الآلي

Embedding Convolutions for Short Text Extreme Classification with Millions of Labels

71 - Siddhant Kharbanda , Atmadeep Banerjee , Akash Palrecha 2021

Automatic annotation of short-text data to a large number of target labels, referred to as Short Text Extreme Classification, has recently found numerous applications in prediction of related searches and product recommendation tasks. The conventiona l usage of Convolutional Neural Network (CNN) to capture n-grams in text-classification relies heavily on uniformity in word-ordering and the presence of long input sequences to convolve over. However, this is missing in short and unstructured text sequences encountered in search and recommendation. In order to tackle this, we propose an orthogonal approach by recasting the convolution operation to capture coupled semantics along the embedding dimensions, and develop a word-order agnostic embedding enhancement module to deal with the lack of structure in such queries. Benefitting from the computational efficiency of the convolution operation, Embedding Convolutions, when applied on the enriched word embeddings, result in a light-weight and yet powerful encoder (InceptionXML) that is robust to the inherent lack of structure in short-text extreme classification. Towards scaling our model to problems with millions of labels, we also propose InceptionXML+, which addresses the shortcomings of the dynamic hard-negative mining framework in the recently proposed LightXML by improving the alignment between the label-shortlister and extreme classifier. On popular benchmark datasets, we empirically demonstrate that the proposed method outperforms state-of-the-art deep extreme classifiers such as Astec by an average of 5% and 8% on the P@k and propensity-scored PSP@k metrics respectively.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

Differential Privacy for Text Analytics via Natural Text Sanitization

338 - Xiang Yue , Minxin Du , Tianhao Wang 2021

Texts convey sophisticated knowledge. However, texts also convey sensitive information. Despite the success of general-purpose language models and domain-specific mechanisms with differential privacy (DP), existing text sanitization mechanisms still provide low utility, as cursed by the high-dimensional text representation. The companion issue of utilizing sanitized texts for downstream analytics is also under-explored. This paper takes a direct approach to text sanitization. Our insight is to consider both sensitivity and similarity via our new local DP notion. The sanitized texts also contribute to our sanitization-aware pretraining and fine-tuning, enabling privacy-preserving natural language processing over the BERT language model with promising utility. Surprisingly, the high utility does not boost up the success rate of inference attacks.

الحساب واللغة التشفير والأمن

Semantic Scan: Detecting Subtle, Spatially Localized Events in Text Streams

76 - Abhinav Maurya , Kenton Murray , Yandong Liu 2016

Early detection and precise characterization of emerging topics in text streams can be highly useful in applications such as timely and targeted public health interventions and discovering evolving regional business trends. Many methods have been pro posed for detecting emerging events in text streams using topic modeling. However, these methods have numerous shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. In this paper, we describe Semantic Scan (SS) that has been developed specifically to overcome these shortcomings in detecting new spatially compact events in text streams. Semantic Scan integrates novel contrastive topic modeling with online document assignment and principled likelihood ratio-based spatial scanning to identify emerging events with unexpected patterns of keywords hidden in text streams. This enables more timely and accurate detection and characterization of anomalous, spatially localized emerging events. Semantic Scan does not require manual intervention or labeled training data, and is robust to noise in real-world text data since it identifies anomalous text patterns that occur in a cluster of new documents rather than an anomaly in a single new document. We compare Semantic Scan to alternative state-of-the-art methods such as Topics over Time, Online LDA, and Labeled LDA on two real-world tasks: (i) a disease surveillance task monitoring free-text Emergency Department chief complaints in Allegheny County, and (ii) an emerging business trend detection task based on Yelp reviews. On both tasks, we find that Semantic Scan provides significantly better event detection and characterization accuracy than competing approaches, while providing up to an order of magnitude speedup.

استرجاع المعلومات التعلم الالي

Text Analytics for Resilience-Enabled Extreme Events Reconnaissance

63 - Alicia Y. Tsai , Selim Gunay , Minjune Hwang 2020

Post-hazard reconnaissance for natural disasters (e.g., earthquakes) is important for understanding the performance of the built environment, speeding up the recovery, enhancing resilience and making informed decisions related to current and future h azards. Natural language processing (NLP) is used in this study for the purposes of increasing the accuracy and efficiency of natural hazard reconnaissance through automation. The study particularly focuses on (1) automated data (news and social media) collection hosted by the Pacific Earthquake Engineering Research (PEER) Center server, (2) automatic generation of reconnaissance reports, and (3) use of social media to extract post-hazard information such as the recovery time. Obtained results are encouraging for further development and wider usage of various NLP methods in natural hazard reconnaissance.

الحساب واللغة أجهزة الكمبيوتر والمجتمع