Prasad p3

Information about Prasad p3

Published on September 6, 2007

Author: Waldarrama

Source: authorstream.com

Content

Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering:  Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and Rich Schwartz {rprasad,pnataraj}@bbn.com Presented by Daniel Lopresti 8th January 2007 Outline:  Outline Research objectives and challenges Overview of supervised classification using HMMs Supervised topic classification of newsgroup messages Unsupervised topic discovery and clustering Rejection of off-topic messages Objectives:  Objectives Develop a system that performs topic based categorization of newsgroup messages in two modes Mode 1 – Supervised classification: topics of interest to the user are known apriori to the system Spot messages that are on topics of interest to a user Requires rejecting 'off-topic' messages to ensure low false alarm rates Mode 2 – Unsupervised classification: topics of interest to the user are not known Discover topics in a large corpus without human supervision Automatically organize/cluster the messages to support efficient navigation Challenges Posed by Newsgroup Messages:  Challenges Posed by Newsgroup Messages Text in newsgroup messages tends to be 'noisy' Abbreviations, misspellings Colloquial (non-grammatical) language of messages Discursive structure with frequent switching between topics Lack of context in some messages makes it impossible to understand the message without access to complete thread Supervised classification requires annotation of newsgroup messages with a set of topic labels Every non-trivial message contains multiple topics No completely annotated corpus of newsgroup messages exists By complete annotation we mean tagging each message with ALL relevant topics Outline:  Outline Research objectives and challenges Overview of supervised classification using HMMs Supervised topic classification of newsgroup messages Unsupervised topic discovery and clustering Rejection of off-topic messages Supervised Topic Classification:  Supervised Topic Classification 'President Clinton dumped his embattled Mexican bailout today. Instead, he announced another plan that doesn’t need congressional approval.' Text Audio or Images ASR or OCR Topic Classifier CNN, NBC, CBS, NPR, etc. • Clinton, Bill • Mexico • Money • Economic assistance, American News Sorting Information Retrieval Detection of Key Events Improved Speech Recognition Topic Models Topic-Labeled Broadcast News Corpus Several Topics e.g., Primary Source Media 4-5 topics / story 40,000 stories / year 5,000 topics Text OnTopicTM HMM Topic Model:  OnTopicTM HMM Topic Model A probabilistic hidden Markov model (HMM) that attempts to capture the generation of a story Assumes a story can be on multiple topics, different words are related to different topics Uses an explicit state for General Language because most words in a story are not related to any topics Scalable to a large number of topics and requires only topic labels for each story for training the model Language independent methodology Outline:  Outline Research objectives and challenges Overview of supervised classification using HMMs Supervised topic classification of newsgroup messages Unsupervised topic discovery and clustering Rejection of off-topic messages Experiment Setup:  Experiment Setup Performed experiments with two newsgroup corpora Automated Front End (AFE) newsgroup corpus collected by Washington Univ. 20 Newsgroup (NG) corpus from http://people.csail.mit.edu/jrennie/20Newsgroups/ Assumed the name of the newsgroup is the ONLY associated topic for each of the message Although cost effective this assumption leads to inaccuracies in estimating system performance Messages typically contain multiple topics, some of which may be related to the dominant theme of another newsgroup AFE Newsgroups Corpus:  AFE Newsgroups Corpus Google newsgroups data collected by Washington University from 12 diverse newsgroups Messages posted to 11 newsgroups are considered to be in-topic and all messages posted to the 'talk.origins' newsgroup are considered to be off-topic Message headers were stripped to exclude newsgroup name from training and test messages Split the corpus into training, test, and, validation sets according to the distribution specified in the config.xml file provided by the Washington University But since the filenames were truncated we could not select the same messages as Washington University AFE Newsgroups Corpus:  AFE Newsgroups Corpus Closed-set Classification Accuracy on AFE:  Closed-set Classification Accuracy on AFE Trained OnTopic models on 11 newsgroups Excluded messages from talk.origins newsgroup because they are 'off-topic' w.r.t topics of interest Used stemming since some newsgroups had only a few training messages Classified 376 in-topic messages Achieved overall top-choice accuracy of 91.2% Top-choice accuracy: Percentage of times the top-choice (best) topic returned by OnTopic was the correct answer Top-choice accuracy was worse on newsgroups with fewer training examples Closed-set Classification Accuracy (Contd.):  Closed-set Classification Accuracy (Contd.) “20 Newsgroups” Corpus:  '20 Newsgroups' Corpus Downloaded 20 Newsgroups Corpus ('20 NG') from http://people.csail.mit.edu/jrennie/20Newsgroups/ Corpus characteristics: Messages from 20 newsgroups with an average of 941 messages per newsgroup Average of 350 threads in each newsgroup Average message length of 300 words (170 words after headers and 'replied to' text is excluded) Some newsgroups are similar – the 20 newsgroups span 6 broad subjects Data pre-processing Stripped message headers, e-mail IDs, and signatures to exclude newsgroup related information Corpus was split into training, development, and validation sets for topic classification experiments Distribution of Messages Across Newsgroups:  Distribution of Messages Across Newsgroups Organization of Newsgroups By Subject Matter :  Organization of Newsgroups By Subject Matter Splits for Training and Testing:  Splits for Training and Testing 80:20 split between training and test/validation sets for three different partitioning schemes Thread Partitioning: Entire thread is assigned to one of training, development, or validation sets Chronological Partitioning: Messages in each thread are split between training, test, and validation; first 80% in training, and rest in test and validation Random Partitioning: 80:20 split between training and test/validation, without regard to thread or chronology Prior work by other researchers with 20 NG used random partitioning Closed-set Classification Results:  Closed-set Classification Results Trained OnTopic model set consisting of 20 topics Classified 2K test messages Two test conditions, one where 'replied-to' text (from previous messages) is included and the other where it is stripped from the test message Classification accuracy is low due to following Significant subject overlap between newsgroups Lack of useful a priori probabilities due to almost uniform distribution of topics, unlike AFE newsgroup data Detailed Results for Thread Partitioned:  Detailed Results for Thread Partitioned Detailed Results for Chronological:  Detailed Results for Chronological Manual Clustering and Human Review:  Manual Clustering and Human Review Manually clustered newsgroups into 12 topics after reviewing content of training messages Recomputed top-choice classification accuracy using the cluster information Effect of presence of multiple topics in a message and incomplete reference topic label set Manually reviewed messages from 4 categories with lowest performance for 'Chronological' split Accuracy increases to 88.0% (from 84.8%) following manual rescoring Cluster Table:  Cluster Table Outline:  Outline Research objectives and challenges Overview of supervised classification using HMMs Supervised topic classification of newsgroup messages Unsupervised topic discovery and clustering Rejection of off-topic messages The Problem:  The Problem Why unsupervised topic discovery and clustering? Topics of interest may not be known apriori May not be feasible to annotate documents with a large number of topics Goals Discover topics and meaningful topic names Cluster topics instead of messages automatically to organize messages/documents for navigation at multiple levels Unsupervised Topic Discovery3:  Unsupervised Topic Discovery3 Add Phrases Topic Classification Topic Training Initial Topics for each doc Frequent phrases, using MDL criterion; Names, using IdentiFinderTM Select words/phrases with highest tf-idf; Keep topic names that occur in andgt;3 documents Assign topics to all documents Key step: Associate many words/phrases with topics; Use EM training in OnTopicTM system Topic Models Topic Names Augmented docs Topic Annotated Corpus 3. S. Sista et al.. An Algorithm for Unsupervised Topic Discovery from Broadcast News Stories. In Proceedings of ACM HLT, San Diego, CA, 2002. UTD output (English document):  UTD output (English document) news source: Associated Press – November, 2001 UTD output (Arabic document):  UTD output (Arabic document) News Source: Al_Hayat (Aug-Nov, 2001) Unsupervised Topic Clustering:  Unsupervised Topic Clustering Organize automatically discovered topics (rather than documents) into a hierarchical topic tree Leaves of the topic tree are one of the fine topics discovered from the UTD process Intermediate nodes are logical collection of topics Each node in the topic tree has a set of messages associated with it A message can be assigned to multiple topic clusters by virtue of multiple topic labels assigned to it by UTD process Overcomes the problem of single cluster assignment of a document prevalent in most document clustering approaches Resulting topic tree enables browsing of the large corpus at multiple level of granularity One can find a message with different set of logical actions Topic Clustering Algorithm:  Topic Clustering Algorithm Agglomerative clustering for organizing topics in a hierarchical tree structure Topic clustering algorithm: Step 1: Each topic assigned to its own individual cluster Step 2: For every pair of clusters, compute the distance between the two clusters Step 3: Merge the closest pair into a single cluster if the distance is lower than a threshold and go to Step 2. Else Stop clustering Modification: merge more than two clusters at each iteration to limit the number of levels in the tree Also add other constraints in terms of limiting the branching factor, number of levels etc. Distance Metrics for Topic Clustering:  Distance Metrics for Topic Clustering Metrics computed from topic co-occurrences: Co-occurrence probability Mutual Information Metrics computed from support/key word distributions: Support word overlap between Ti and Tj Kullback-Leibler (KL) and J-Divergence between two probability mass functions Clustering Example:  Clustering Example Evaluation of UTC:  Evaluation of UTC Initial topic clustering experiments performed on 20 NG corpus 3,343 topics discovered from 19K message Allowed a maximum of 4 topics to be clustered at each iteration Evaluation of UTC has been mostly subjective with a few objective metrics used to evaluate the clustering Clustering rate: rate of increase of clusters with more than one topic seems to be well correlated with subjective judgments Combination of J-divergence and topic co-occurrence seems to result in most uniform, logical clusters Key Statistics of the UTC Topic Tree for 20 NG Corpus:  Key Statistics of the UTC Topic Tree for 20 NG Corpus Measured some key features of the topic tree that could have significant impact on user experience Screenshot of the UTC based Message Browser:  Screenshot of the UTC based Message Browser Outline:  Outline Research objectives and challenges Overview of supervised classification using HMMs Supervised topic classification of newsgroup messages Unsupervised topic discovery and clustering Rejection of off-topic messages Off-topic Message Rejection:  Off-topic Message Rejection Significant fraction of messages processed by the topic classification system are likely to be off-topic Rejection Problem: Design a binary classifier for accepting or rejecting the top-choice topic Accepting a message means asserting that the message contains the top-choice topic Rejecting a message means asserting that the message does not contain the top-choice topic Rejection Algorithm:  Rejection Algorithm Use the General Language (GL) topic model as model for off-topic messages Compute the ratio of the log-posterior of top-choice topic Tj and GL topic as a relevance score Accept the top-choice topic Tj if: Threshold can be topic-independent or topic-specific Parametric Topic-Specific Threshold Estimation:  Parametric Topic-Specific Threshold Estimation Compute empirical distribution (m and s) of log likelihood ratio score for a large corpus of off-topic documents Can assume most messages in corpus are off-topic More reliable statistics than if computed for on-topic message Normalize the score for a test message before comparing to a topic-independent threshold Can be thought of as a transformation of the topic-independent threshold rather than score normalization Parametric Topic-Specific Threshold Estimation:  Parametric Topic-Specific Threshold Estimation Do a Null-hypothesis test using the score distribution of the off-topic messages Example histogram of normalized test scores (y-axis scaled to magnify view for on-topic messages) Off-topic score distribution On-topic score distribution A message that is not-off-topic is on-topic. A message several standard-deviations away from off-topic mean is very likely to be on-topic. Non-Parametric Threshold Estimation:  Non-Parametric Threshold Estimation Accept the top-choice topic Tj if: Select a(Tj) by constrained optimization: Experimentation Configuration:  Experimentation Configuration In-topic messages from 14 newsgroups of the 20 NG corpus Messages from six newsgroups were discarded due to significant subject overlap with off-topic messages Off-topic/chaff messages are from two sources: talk.origins newsgroup from the AFE corpus large collection of messages from 4 Yahoo! groups Used jack-knifing to estimate rejection thresholds on Train+Dev set and then applied them to validation set Comparison of Threshold Estimation Techniques:  Comparison of Threshold Estimation Techniques Comparison of Threshold Estimation Techniques:  Comparison of Threshold Estimation Techniques Comparison of Threshold Estimation Techniques:  Comparison of Threshold Estimation Techniques Conclusions:  Conclusions HMM based topic classification delivers comparable performance on 20 NG and AFE corpora as in [1],[2] Closed-set classification accuracy on 20 NG data after clustering is slightly worse than AFE data Key reason is significant subject overlap between the newsgroups Clustered categories still exhibited significant subject overlap across clusters The data set creators assign only six different subjects (topics) to the 20 NG set J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In Proceeding of ICML 2003, Washington, D.C., 2003. S. Eick, J. Lockwood, R. Loui, J. Moscola, C. Kastner, A. Levine, and D. Weishar. Transformation Algorithms for Data Streams. In Proceedings of IEEE AAC, March 2005. Conclusions (Contd.):  Conclusions (Contd.) Novel estimation of topic-specific thresholds outperforms topic-independent threshold for rejection of off-topic messages Introduced a novel concept of unsupervised topic clustering for organizing messages Built a demonstration prototype for topic tree based browsing of large corpus of archived messages Future work will focus on measuring the utility of UTC on user experience and objective metrics to evaluate UTC performance

Related presentations


Other presentations created by Waldarrama

ISP 20071031
30. 11. 2007
0 views

ISP 20071031

Mexican Revolution
13. 04. 2008
0 views

Mexican Revolution

CHINA
26. 03. 2008
0 views

CHINA

God is Love
17. 06. 2007
0 views

God is Love

GENOCIDE FRAMEWORK
28. 12. 2007
0 views

GENOCIDE FRAMEWORK

quidnunc
22. 04. 2008
0 views

quidnunc

berlino
17. 04. 2008
0 views

berlino

pandemics
10. 04. 2008
0 views

pandemics

WF Surface Water
07. 04. 2008
0 views

WF Surface Water

LopezBGET12May05
30. 03. 2008
0 views

LopezBGET12May05

Tut Prager
28. 03. 2008
0 views

Tut Prager

AMY IMS CLIVARSSG15
27. 03. 2008
0 views

AMY IMS CLIVARSSG15

boone
04. 10. 2007
0 views

boone

After The Tornado
05. 10. 2007
0 views

After The Tornado

Tornadoes
07. 10. 2007
0 views

Tornadoes

gp 9 forest resources
10. 10. 2007
0 views

gp 9 forest resources

maki
06. 09. 2007
0 views

maki

Teaching Hockey Sense
06. 09. 2007
0 views

Teaching Hockey Sense

lecture16
06. 09. 2007
0 views

lecture16

Mouth Protection Info For Clinic
06. 09. 2007
0 views

Mouth Protection Info For Clinic

peter huybers
06. 09. 2007
0 views

peter huybers

P E at Arnold House
06. 09. 2007
0 views

P E at Arnold House

The Kerr Metric
29. 11. 2007
0 views

The Kerr Metric

BHEU2004 NF SP EWS v11
03. 12. 2007
0 views

BHEU2004 NF SP EWS v11

Presentation Bejakovic
05. 12. 2007
0 views

Presentation Bejakovic

911
02. 11. 2007
0 views

911

dalhousie
12. 11. 2007
0 views

dalhousie

SEGUNDA GUERRA MUNDIAL
13. 11. 2007
0 views

SEGUNDA GUERRA MUNDIAL

Svarc
14. 11. 2007
0 views

Svarc

7 CECCHINI Marco
16. 11. 2007
0 views

7 CECCHINI Marco

Las Mujeres Jaguar
20. 11. 2007
0 views

Las Mujeres Jaguar

visual studio 2008 linq
28. 11. 2007
0 views

visual studio 2008 linq

WoodandRyan
24. 12. 2007
0 views

WoodandRyan

CreditCards
25. 12. 2007
0 views

CreditCards

southeastasia
28. 12. 2007
0 views

southeastasia

BC8 11 2
01. 01. 2008
0 views

BC8 11 2

dw space
02. 01. 2008
0 views

dw space

NHSL Dont forget workforce
07. 01. 2008
0 views

NHSL Dont forget workforce

Prasser Project Conf 2006 final
26. 11. 2007
0 views

Prasser Project Conf 2006 final

ASD BASIC
28. 12. 2007
0 views

ASD BASIC

OQ Presentation2
04. 12. 2007
0 views

OQ Presentation2

HIPAA Education BasicFinal0103
23. 12. 2007
0 views

HIPAA Education BasicFinal0103

NCIA CARDÃ ACA
28. 12. 2007
0 views

NCIA CARDÃ ACA

2002 annual meeting print
19. 02. 2008
0 views

2002 annual meeting print

Gang Primer 1ID1
26. 02. 2008
0 views

Gang Primer 1ID1

PotatoesI
04. 03. 2008
0 views

PotatoesI

UNIT ONE INTRODUCTION
06. 03. 2008
0 views

UNIT ONE INTRODUCTION

Prezentacja4
18. 03. 2008
0 views

Prezentacja4

NC2005 Denbeck
06. 09. 2007
0 views

NC2005 Denbeck

Ceremonial Speech
31. 12. 2007
0 views

Ceremonial Speech

03conv d3 champs
06. 09. 2007
0 views

03conv d3 champs

BCH Slide PresentationFinal
06. 09. 2007
0 views

BCH Slide PresentationFinal

G050504 00
03. 10. 2007
0 views

G050504 00

02 TCSS Semi Formal Dance
27. 11. 2007
0 views

02 TCSS Semi Formal Dance

Changes in Medina County
11. 12. 2007
0 views

Changes in Medina County

sunum39
21. 11. 2007
0 views

sunum39

S14 44
17. 12. 2007
0 views

S14 44

God Embraces vs 2
17. 06. 2007
0 views

God Embraces vs 2

gender issues
17. 06. 2007
0 views

gender issues

Funny Bunny Camille Page
17. 06. 2007
0 views

Funny Bunny Camille Page

Funny Turns
17. 06. 2007
0 views

Funny Turns

history of comedy presentation
17. 06. 2007
0 views

history of comedy presentation

history of english
17. 06. 2007
0 views

history of english

hinman romeo
17. 06. 2007
0 views

hinman romeo

Helping Children Love to Read
17. 06. 2007
0 views

Helping Children Love to Read

health humor
17. 06. 2007
0 views

health humor

happiness Belarus
17. 06. 2007
0 views

happiness Belarus

group F
17. 06. 2007
0 views

group F

Gripping
17. 06. 2007
0 views

Gripping

gallows humor
17. 06. 2007
0 views

gallows humor

fwe 05
17. 06. 2007
0 views

fwe 05

bremsstrahlung nss2006
22. 11. 2007
0 views

bremsstrahlung nss2006

praes lehnert
14. 03. 2008
0 views

praes lehnert

Dabholkar2
28. 11. 2007
0 views

Dabholkar2

HELP PP
17. 06. 2007
0 views

HELP PP

Gaubatz
06. 11. 2007
0 views

Gaubatz

Women Energy
04. 01. 2008
0 views

Women Energy

chi course06 4 23
03. 10. 2007
0 views

chi course06 4 23

ALDA
06. 09. 2007
0 views

ALDA

CobraKai MarketingPresentation
06. 09. 2007
0 views

CobraKai MarketingPresentation

NAU Hockey Club
06. 09. 2007
0 views

NAU Hockey Club

AGM CoachingandNewNCCP
06. 09. 2007
0 views

AGM CoachingandNewNCCP

hackman CLC
17. 06. 2007
0 views

hackman CLC

Yi Capstone071604
07. 12. 2007
0 views

Yi Capstone071604

Hotchkiss REU03
07. 11. 2007
0 views

Hotchkiss REU03

CASE ITINERARIO
01. 11. 2007
0 views

CASE ITINERARIO