Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach

341 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ion Androutsopoulos

تاريخ النشر 2000

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ion Androutsopoulos - Georgios Paliouras - Vangelis Karkaletsis

الحساب واللغة استرجاع المعلومات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the keyword-based filter of a widely used e-mail reader.

قيم البحث

103 - Ion Androutsopoulos , John Koutsias , Konstantinos V. Chandrinos andn Constantine D. Spyropoulos 2000

The growing problem of unsolicited bulk e-mail, also known as spam, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has r ecently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in encrypted form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.

الحساب واللغة استرجاع المعلومات التعلم الآلي

Stacking classifiers for anti-spam filtering of e-mail

168 - G. Sakkis , I. Androutsopoulos , G. Paliouras 2001

We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, or spam, floods mailboxes, c ausing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real-life applications.

الحساب واللغة الذكاء الاصطناعي

An evaluation of Naive Bayesian anti-spam filtering

196 - Ion Androutsopoulos , John Koutsias , Konstantinos V. Chandrinos 2000

It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail (spam). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks . At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filters performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.

الحساب واللغة الذكاء الاصطناعي

A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation

63 - Ted Pedersen 2000

This paper presents a corpus-based approach to word sense disambiguation that builds an ensemble of Naive Bayesian classifiers, each of which is based on lexical features that represent co--occurring words in varying sized windows of context. Despite the simplicity of this approach, empirical results disambiguating the widely studied nouns line and interest show that such an ensemble achieves accuracy rivaling the best previously published results.

الحساب واللغة

Sentiment Analysis of Yelp Reviews: A Comparison of Techniques and Models

100 - Siqi Liu 2020

We use over 350,000 Yelp reviews on 5,000 restaurants to perform an ablation study on text preprocessing techniques. We also compare the effectiveness of several machine learning and deep learning models on predicting user sentiment (negative, neutra l, or positive). For machine learning models, we find that using binary bag-of-word representation, adding bi-grams, imposing minimum frequency constraints and normalizing texts have positive effects on model performance. For deep learning models, we find that using pre-trained word embeddings and capping maximum length often boost model performance. Finally, using macro F1 score as our comparison metric, we find simpler models such as Logistic Regression and Support Vector Machine to be more effective at predicting sentiments than more complex models such as Gradient Boosting, LSTM and BERT.

الحساب واللغة استرجاع المعلومات التعلم الآلي