Stacking classifiers for anti-spam filtering of e-mail

169 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ion Androutsopoulos

تاريخ النشر 2001

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف G. Sakkis - I. Androutsopoulos - G. Paliouras

الحساب واللغة الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, or spam, floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real-life applications.

قيم البحث

103 - Ion Androutsopoulos , John Koutsias , Konstantinos V. Chandrinos andn Constantine D. Spyropoulos 2000

The growing problem of unsolicited bulk e-mail, also known as spam, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has r ecently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in encrypted form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.

الحساب واللغة استرجاع المعلومات التعلم الآلي

An evaluation of Naive Bayesian anti-spam filtering

196 - Ion Androutsopoulos , John Koutsias , Konstantinos V. Chandrinos 2000

It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail (spam). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks . At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filters performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.

الحساب واللغة الذكاء الاصطناعي

Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach

340 - Ion Androutsopoulos , Georgios Paliouras , Vangelis Karkaletsis 2000

We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far bee n based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the keyword-based filter of a widely used e-mail reader.

الحساب واللغة استرجاع المعلومات التعلم الآلي

Clasificarea distribuita a mesajelor de e-mail

115 - Florin Pop , Diana Petrescu , c{S}tefan Trauc{s}an-Matu 2011

A basic component in Internet applications is the electronic mail and its various implications. The paper proposes a mechanism for automatically classifying emails and create dynamic groups that belong to these messages. Proposed mechanisms will be b ased on natural language processing techniques and will be designed to facilitate human-machine interaction in this direction.

تفاعل الإنسان والحاسوب الحساب واللغة

Likelihood Ratios and Generative Classifiers for Unsupervised Out-of-Domain Detection In Task Oriented Dialog

121 - Varun Gangal , Abhinav Arora , Arash Einolghozati 2019

The task of identifying out-of-domain (OOD) input examples directly at test-time has seen renewed interest recently due to increased real world deployment of models. In this work, we focus on OOD detection for natural language sentence inputs to task -based dialog systems. Our findings are three-fold: First, we curate and release ROSTD (Real Out-of-Domain Sentences From Task-oriented Dialog) - a dataset of 4K OOD examples for the publicly available dataset from (Schuster et al. 2019). In contrast to existing settings which synthesize OOD examples by holding out a subset of classes, our examples were authored by annotators with apriori instructions to be out-of-domain with respect to the sentences in an existing dataset. Second, we explore likelihood ratio based approaches as an alternative to currently prevalent paradigms. Specifically, we reformulate and apply these approaches to natural language inputs. We find that they match or outperform the latter on all datasets, with larger improvements on non-artificial OOD benchmarks such as our dataset. Our ablations validate that specifically using likelihood ratios rather than plain likelihood is necessary to discriminate well between OOD and in-domain data. Third, we propose learning a generative classifier and computing a marginal likelihood (ratio) for OOD detection. This allows us to use a principled likelihood while at the same time exploiting training-time labels. We find that this approach outperforms both simple likelihood (ratio) based and other prior approaches. We are hitherto the first to investigate the use of generative classifiers for OOD detection at test-time.

الحساب واللغة الذكاء الاصطناعي