An evaluation of Naive Bayesian anti-spam filtering

197 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ion Androutsopoulos

تاريخ النشر 2000

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ion Androutsopoulos - John Koutsias - Konstantinos V. Chandrinos

الحساب واللغة الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail (spam). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filters performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.

قيم البحث

103 - Ion Androutsopoulos , John Koutsias , Konstantinos V. Chandrinos andn Constantine D. Spyropoulos 2000

The growing problem of unsolicited bulk e-mail, also known as spam, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has r ecently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in encrypted form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.

الحساب واللغة استرجاع المعلومات التعلم الآلي

Stacking classifiers for anti-spam filtering of e-mail

168 - G. Sakkis , I. Androutsopoulos , G. Paliouras 2001

We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, or spam, floods mailboxes, c ausing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real-life applications.

الحساب واللغة الذكاء الاصطناعي

Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach

340 - Ion Androutsopoulos , Georgios Paliouras , Vangelis Karkaletsis 2000

We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far bee n based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the keyword-based filter of a widely used e-mail reader.

الحساب واللغة استرجاع المعلومات التعلم الآلي

A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation

63 - Ted Pedersen 2000

This paper presents a corpus-based approach to word sense disambiguation that builds an ensemble of Naive Bayesian classifiers, each of which is based on lexical features that represent co--occurring words in varying sized windows of context. Despite the simplicity of this approach, empirical results disambiguating the widely studied nouns line and interest show that such an ensemble achieves accuracy rivaling the best previously published results.

الحساب واللغة

BlonD: An Automatic Evaluation Metric for Document-level MachineTranslation

85 - Yuchen Jiang , Shuming Ma , Dongdong Zhang 2021

Standard automatic metrics (such as BLEU) are problematic for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones nor can they identify the specific discourse phenomen a that caused the translation errors. To address these problems, we propose an automatic metric BlonD for document-level machine translation evaluation. BlonD takes discourse coherence into consideration by calculating the recall and distance of check-pointing phrases and tags, and further provides comprehensive evaluation scores by combining with n-gram. Extensive comparisons between BlonD and existing evaluation metrics are conducted to illustrate their critical distinctions. Experimental results show that BlonD has a much higher document-level sensitivity with respect to previous metrics. The human evaluation also reveals high Pearson R correlation values between BlonD scores and manual quality judgments.

الحساب واللغة الذكاء الاصطناعي