spam talk for casa marketing draft5

Information about spam talk for casa marketing draft5

Published on December 28, 2007

Author: Emma

Source: authorstream.com

Content

(Naive) Bayesian Text Classification for Spam Filtering:  (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting www.daviddlewis.com Presented at ASA Chicago Chapter Spring Conference., Loyola Univ., May 7, 2004. Menu Spam Spam Filtering Classification for Spam Filtering Classification Bayesian Classification Naive Bayesian Classification Naive Bayesian Text Classification Naive Bayesian Text Classification for Spam Filtering (Feature Extraction for) Spam Filtering Text Classification (for Marketing) (Better) Bayesian Classification :  Menu Spam Spam Filtering Classification for Spam Filtering Classification Bayesian Classification Naive Bayesian Classification Naive Bayesian Text Classification Naive Bayesian Text Classification for Spam Filtering (Feature Extraction for) Spam Filtering Text Classification (for Marketing) (Better) Bayesian Classification Spam:  Spam Unsolicited bulk email or, in practice, whatever email you don’t want Large fraction of all email sent Brightmail est. 64%, Postini est. 77% Still growing Est. cost to US businesses exceeded $30 billion in Y2003 Approaches to Spam Control:  Approaches to Spam Control Economic (email pricing, ...) Legal (CANSPAM, ...) Societal pressure (trade groups, ...) Securing infrastructure (email servers, ...) Authentication (challenge/response,...) Filtering Spam Filtering:  Spam Filtering Intensional (feature-based) vs. Extensional (white/blacklist-based) Applied at sender vs. receiver Applied at email client vs. mail server vs. ISP Statistical Classification:  Statistical Classification Define classes of objects Specify probability distribution model connecting classes to observable features Fit parameters of model to data Observe features on inputs and compute probability of class membership Assign object to a class Slide7:  Classifier Inter- preter Feature Extraction Classification for Spam Filtering:  Extract features from header, content Train classifier Classify message and process: Block message, insert tag, put in folder, etc. Classification for Spam Filtering Define classes: Two Classes of Classifier:  Two Classes of Classifier Generative: Naive Bayes, LDA,... Model joint distribution of class and features Derive class probability by Bayes rule Discriminative: logistic regression, CART,... Model conditional distribution of class given known feature values Model directly estimates class probability Bayesian Classification (1):  2. Specify probability model 2b. And prior distribution over parameters 3. Find posterior distribution of model parameters, given data 4. Compute class probabilities using posterior distribution (or element of it) 5. Classify object Bayesian Classification (1) Define classes Bayesian Classification (2):  Bayesian Classification (2) = “Naive”/”Idiot”/”Simple” Bayes A particular generative model Assumes independence of observable features within each class of messages Bayes rule used to compute class probability Might or might not use a prior on model parameters Naive Bayes for Text Classification - History:  Naive Bayes for Text Classification - History Maron (JACM, 1961) – automated indexing Mosteller and Wallace (1964) – author identification Van Rijsbergen, Robertson, Sparck Jones, Croft, Harper (early 1970’s) – search engines Sahami, Dumais, Heckerman, Horvitz (1998) – spam filtering Bayesian Classification (3):  Graham’s A Plan for Spam And its mutant offspring... Naive Bayes-like classifier with weird parameter estimation Widely used in spam filters Classic Naive Bayes superior when appropriately used Bayesian Classification (3) NB & Friends: Advantages:  NB & Friends: Advantages Simple to implement No numerical optimization, matrix algebra, etc. Efficient to train and use Fitting = computing means of feature values Easy to update with new data Equivalent to linear classifier, so fast to apply Binary or polytomous NB & Friends: Advantages:  NB & Friends: Advantages Independence allows parameters to be estimated on different data sets, e.g. Estimate content features from messages with headers omitted Estimate header features from messages with content missing NB & Friends: Advantages:  NB & Friends: Advantages Generative model Comparatively good effectiveness with small training sets Unlabeled data can be used in parameter estimation (in theory) NB & Friends: Disadvantages:  NB & Friends: Disadvantages Independence assumption wrong Absurd estimates of class probabilities Threshold must be tuned, not set analytically Generative model Generally lower effectiveness than discriminative techniques (e.g. log. regress.) Improving parameter estimates can hurt classification effectiveness Feature Extraction:  Feature Extraction Convert message to feature vector Header: sender, recipient, routing,… Possibly break up domain names Text Words, phrases, character strings Become binary or numeric features URLs, HTML tags, images,… Slide21:  From: Sam Elegy <[email protected]> To: [email protected] Subject: you can buy [email protected] Spamlike content in image form Irrelevant legit content; doubles as hash buster Typographic variations Randomly generated name and email Defeating Feature Extraction:  Defeating Feature Extraction Misspellings, character set choice, HTML games: mislead extraction of words Put content in images Forge headers (to avoid identification, but also interferes with classification) Innocuous content to mimic distribution in nonspam Hashbusters (zyArh73Gf) clog dictionaries Survival of the Fittest:  Survival of the Fittest Filter designers get to see spam Spammers use spam filters Unprecedented arms race for a statistical field Countermeasures mostly target feature extraction, not modeling assumptions Miscellany:  Miscellany Getting legitimate bulk mail past spam filters Other uses of text classification in marketing Frontiers in Bayesian classification Getting Legit Bulk Email Past Filters:  Getting Legit Bulk Email Past Filters Test email against several filters Send to accounts on multiple ISPs Multiple client-based filters if particularly concerned Coherent content, correctly spelled Non-tricky headers and markup Avoid spam keywords where possible Don’t use spammer tricks Text Classification in Marketing:  Text Classification in Marketing Routing incoming email Responses to promotions Detect opportunities for selling (Automated response sometimes possible) Analysis of text/mixed data on customers e.g. customer or CSR comments Content analysis Focus groups, email, chat, blogs, news,… Better Bayesian Classification:  Better Bayesian Classification Discriminative Logistic regression with informative priors Sharing strength across related problems Calibration and confidence of predictions Generative Bayesian networks/graphical models Use of unlabeled and partially labeled data Hybrid BBR:  BBR Logistic regression w/ informative priors Gaussian = ridge logistic regression Laplace = lasso logistic regression Sparse data structures & fast optimizer 10^4 cases, 10^5 predictors, few seconds! Accuracy competitive with SVMs Free for research use www.stat.rutgers.edu/~madigan/BBR/ Joint work w/ Madigan & Genkin (Rutgers) Slide29:  Gaussian Laplace Gaussian vs. Laplace Prior Future of Spam Filtering:  Future of Spam Filtering More attention to training data selection, personalization Image processing  Robustness against word variations More linguistic sophistication Replacing naive Bayes with better learners Keep hoping for economic cure Summary:  Summary By volume, spam filtering is easily the biggest application of text classification Possible of supervised learning Filters have helped a lot Naive Bayes is just a starting point Other interesting applications of Bayesian classification

Related presentations


Other presentations created by Emma

Geothermal Energy
03. 12. 2007
0 views

Geothermal Energy

Science Teaching in 21Century1
18. 03. 2008
0 views

Science Teaching in 21Century1

class2new
04. 10. 2007
0 views

class2new

sharetheroad
28. 09. 2007
0 views

sharetheroad

Suzhou
01. 11. 2007
0 views

Suzhou

dia de muertos 61PPT
06. 11. 2007
0 views

dia de muertos 61PPT

Digester complex
07. 11. 2007
0 views

Digester complex

grp1wk3
15. 11. 2007
0 views

grp1wk3

PresentaciÃn Cilca 2005
15. 11. 2007
0 views

PresentaciÃn Cilca 2005

Boeing
23. 11. 2007
0 views

Boeing

palakal1 iu
10. 12. 2007
0 views

palakal1 iu

rutas por el pasado dos basica
20. 11. 2007
0 views

rutas por el pasado dos basica

FRIENDLY PRESENTATION
23. 12. 2007
0 views

FRIENDLY PRESENTATION

bioterrorism
04. 01. 2008
0 views

bioterrorism

prepro
07. 01. 2008
0 views

prepro

Tema1 Historia
07. 01. 2008
0 views

Tema1 Historia

doe review sep07 reich r0
06. 12. 2007
0 views

doe review sep07 reich r0

BYU diversification
30. 12. 2007
0 views

BYU diversification

Conjoint
24. 02. 2008
0 views

Conjoint

dedicated
26. 02. 2008
0 views

dedicated

Reusable Filters
28. 02. 2008
0 views

Reusable Filters

chapter5
04. 03. 2008
0 views

chapter5

FaNeilKift2004
06. 03. 2008
0 views

FaNeilKift2004

20074482634
10. 03. 2008
0 views

20074482634

fdida
14. 03. 2008
0 views

fdida

Inspectors 06
21. 03. 2008
0 views

Inspectors 06

Early Cold War
27. 03. 2008
0 views

Early Cold War

1 10 61005347 IRSCpresentaio0A6
30. 03. 2008
0 views

1 10 61005347 IRSCpresentaio0A6

AnastasiaVICKI
21. 11. 2007
0 views

AnastasiaVICKI

HowtoDoResearchonMov ies2nd
19. 02. 2008
0 views

HowtoDoResearchonMov ies2nd

inspire1
19. 12. 2007
0 views

inspire1

INDEPTH slides bis PRINCIPIA
09. 11. 2007
0 views

INDEPTH slides bis PRINCIPIA

EDUCAR PARA O SUCESSO
29. 12. 2007
0 views

EDUCAR PARA O SUCESSO

sustentabilidad rbsg
21. 11. 2007
0 views

sustentabilidad rbsg

newproducts
03. 01. 2008
0 views

newproducts

drager
27. 09. 2007
0 views

drager

The Snowy Day Module 1
01. 10. 2007
0 views

The Snowy Day Module 1