IDAR26

Information about IDAR26

Published on August 26, 2007

Author: Malbern

Source: authorstream.com

Content

Topic Oriented Semi-supervised Document Clustering :  Topic Oriented Semi-supervised Document Clustering Jiangtao Qiu, Changjie Tang Computer School, Sichuan University OUTLINE:  OUTLINE 1.Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 1. INTRODUCTION:  1. INTRODUCTION Developing a Text Mining Prototype System. Aim to mine associative event, generate hypotheses etc. At present, we have complete Content Extracting from web page, Document Classification, Document Cluster. 1. INTRODUCTION:  1. INTRODUCTION Web pages Text Collecting data Preprocess Classification Cluster Needed Vectors Remove noise Get feature vector Deriving needed texts Mining Presenting Mining associative Events etc. Prototype System OUTLINE:  OUTLINE 1. Introduction 2.Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 2. MOTIVATION:  2. MOTIVATION Traditional documents clustering are usually considered an unsupervised learning. General Method: Extracting Feature Vector Computing Similarity among vectors Building dissimilarity matrix Implementing Clustering 2. Motivation:  2. Motivation Can we group documents by users need? New Challenge OUTLINE:  OUTLINE 1. Introduction 2. Motivation 3.Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 3. Topic Semantic Annotation:  3. Topic Semantic Annotation we propose a new semi-supervised documents clustering approach It can group documents according to user’s need Topic oriented documents clustering 3. Topic Semantic Annotation:  3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.1 How to represent user’s need? (1) we propose a multiple-attributes topic structure to represent the user’s need Topic is a user’s focus that is represented by a word. We use concept set C in ontology as attributes set. Attributes of topic consist of a collection of concepts {p1,..,pn} C; attributes can well describe the topic. 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.1 How to represent user’s need? For Example: Collecting documents about Yao Ming. There are several peoples named Yao Ming in corpus. We want to group documents by different Yao Ming. We set ‘Yao Ming’ as topic. We choose background, place , named entity as attributes. 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. Cancer medicine background For instance, when words coach, stadium emerge in a document, it can be inferred that the peoples involved in this document is related to ‘sport’. 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 1.Many words has background. Cancer medicine background We have modified ontology, which added background for words in ontology 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 2.Place can well distinguish different peoples. The places where peoples have grown up and lived may well distinguish different peoples. 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.1 How to represent user’s need? Reason for choosing the three attributes. 3.Named entities may be used to describe semantic of topic. Some people names, institution and organization names that do not occur in dictionary are called named entity. Named entities may be used to describe semantic of topic. 3. Topic Semantic Annotation:  3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? We represent relationship between topic and documents by annotating topic-semantic for documents 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} If ti may be mapped to one attribute pj Ontology ti pj 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} And ti is semantical correlation with T If distance of ti and T is not lager than threshold, We call ti and T is semantical correlation 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} Insert ti into vector Pj Vector Pj ={…, ti} 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} When all words are explored, we can derived Attributes Vectors: P1 ={…, ti} … Pn ={…, tm} 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Topic T Attributes:p1,.., pn Document S Words {t1,…, tn} We call the above process topic-semantic annotation 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={andlt;sport, 4andgt;} 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={andlt;sport, 4andgt;} P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;} 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={andlt;sport, 4andgt;} P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;} P3={andlt; Rasheed Wallace, 1andgt;, andlt; Shane Battier, 1andgt;, andlt; Auburn Hills, 1andgt;} 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.2 How to represent relationship between the need and documents? Example: Houston Rockets center Yao Ming grabs a rebound in front of Detroit Pistons forward Rasheed Wallace and Rockets forward Shane Battier during the first half of their NBA game in Auburn Hills, Michigan. Topic: Yao Ming Attributes: p1=background, p2=place, p3=named entity Feature vectors: P1={andlt;sport, 4andgt;} P2={andlt;Huston, 1andgt;, andlt;Michigan, 1andgt;,andlt; Detroit,1 andgt;} P3={andlt; Rasheed Wallace, 1andgt;, andlt; Shane Battier, 1andgt;, andlt; Auburn Hills, 1andgt;} 3. Topic Semantic Annotation:  3. Topic Semantic Annotation Several issues need be addressed (1) How to represent user’s need? (2) How to represent relationship between the need and documents? (3) How to evaluate similarity of documents by the need? 3. Topic Semantic Annotation:  3. Topic Semantic Annotation 3.3 How to evaluate similarity of documents by the need? d1 d2 V1={…} … Vn={…} V1={…} … Vn={…} OUTLINE:  OUTLINE 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4.Optimizing Hierarchical Clustering 5. Experiments 6. Conclusion 4. Optimizing Hierarchical Clustering :  4. Optimizing Hierarchical Clustering Motivation: Current clustering algorithms often need user to set some parameters such as the number of clusters, radius or density threshold. If users lack experience to choice parameters, it is difficult to produce good clustering solution. 4. Optimizing Hierarchical Clustering:  4. Optimizing Hierarchical Clustering Solution: 1.build clustering tree by using hierarchical clustering algorithm. 2.recommend best clustering solution on clustering tree to users by using a criterion function. 4. Optimizing Hierarchical Clustering:  4. Optimizing Hierarchical Clustering Solution: All samples in one cluster Each samples is one cluster Worst Solution One cluster five clusters 4. Optimizing Hierarchical Clustering:  4. Optimizing Hierarchical Clustering Solution: Combining inner-cluster distance with intra-cluster distance, We propose a criterion function. the best clustering solution may be provided to user by using a criterion function without parameter setting. 4. Optimizing Hierarchical Clustering:  4. Optimizing Hierarchical Clustering the best clustering solution may be provided to user by using a criterion function without parameter setting. A B C D E Bottom up 4. Optimizing Hierarchical Clustering:  4. Optimizing Hierarchical Clustering the best clustering solution may be provided to user by using a criterion function without parameter setting. A B C D E Level 5 Level 4 Level 3 Level 2 Level 1 The smallest DistanceSum OUTLINE:  OUTLINE 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5.Experiments 6. Conclusion 5. Experiments:  5. Experiments To the best our knowledge, topic oriented document clustering has not been well addressed in the existing works. Experiments, in this study, will compare our approach to the unsupervised clustering approach 5. Experiments:  5. Experiments Dataset: Collect web pages involved three peoples named ‘Li Ming’. purpose: clustering documents by people. 5. Experiments:  5. Experiments Experiment 1: TFIDF Comparing on Time performance 5. Experiments:  5. Experiments Experiment 1: TFIDF Comparing Dimensionality 5. Experiments:  5. Experiments Experiment 2: 1. Using new approach and traditional approach to build dissimilarity matrix 2. Implement documents clustering on matrix 3. compare clustering solution by using F-Measure 5. Experiments:  5. Experiments Experiment 2: OUTLINE:  OUTLINE 1. Introduction 2. Motivation 3. Topic Semantic Annotation 4. Optimizing Hierarchical Clustering 5. Experiments 6.Conclusion 6. Conclusion:  6. Conclusion Experiments show that new approach is feasible and effective. To further improve performance, However, some works need be done such as improving accuracy on named entity recognizing Thanks!:  Any Question? Thanks!

Related presentations


Other presentations created by Malbern

Ford
04. 08. 2007
0 views

Ford

aganek ams03
18. 09. 2007
0 views

aganek ams03

carl2004
18. 09. 2007
0 views

carl2004

Innovation CEO
18. 09. 2007
0 views

Innovation CEO

Websight TAV WEB July 2004
18. 09. 2007
0 views

Websight TAV WEB July 2004

BlueRidge
18. 09. 2007
0 views

BlueRidge

sis statewide training
18. 09. 2007
0 views

sis statewide training

060307DFID
18. 09. 2007
0 views

060307DFID

bluegene01
18. 09. 2007
0 views

bluegene01

Blue Jean expt 2005
18. 09. 2007
0 views

Blue Jean expt 2005

APA 2005 Strategies for Genetics
18. 09. 2007
0 views

APA 2005 Strategies for Genetics

ross pearce
18. 09. 2007
0 views

ross pearce

present 05
18. 09. 2007
0 views

present 05

mcgee
18. 09. 2007
0 views

mcgee

SIS Asmnt Score Pres 20070522
18. 09. 2007
0 views

SIS Asmnt Score Pres 20070522

mergers
26. 08. 2007
0 views

mergers

WalkTwoMoons
26. 08. 2007
0 views

WalkTwoMoons

a2 pale view narrative technique
26. 08. 2007
0 views

a2 pale view narrative technique

SundaySept30
26. 08. 2007
0 views

SundaySept30

would you recognize
13. 08. 2007
0 views

would you recognize

Youth
13. 08. 2007
0 views

Youth

WEAD0201
13. 08. 2007
0 views

WEAD0201

wp 10 e
13. 08. 2007
0 views

wp 10 e

VALS Keahey
13. 08. 2007
0 views

VALS Keahey

Workshop Presentation june 2002
13. 08. 2007
0 views

Workshop Presentation june 2002

DisplacedEllerman
04. 08. 2007
0 views

DisplacedEllerman

Framing Effect and Age
04. 08. 2007
0 views

Framing Effect and Age

Freud2004
04. 08. 2007
0 views

Freud2004

feist ch01Intro
04. 08. 2007
0 views

feist ch01Intro

empowering
04. 08. 2007
0 views

empowering

weston
13. 08. 2007
0 views

weston

Coyote Hills Historic B 000
26. 08. 2007
0 views

Coyote Hills Historic B 000

Olshan cegawrkshp 2
18. 09. 2007
0 views

Olshan cegawrkshp 2

EmbryologyLect5
04. 08. 2007
0 views

EmbryologyLect5

FBrancaLuxembourg3mar
04. 08. 2007
0 views

FBrancaLuxembourg3mar

E Newsletter June2006
18. 06. 2007
0 views

E Newsletter June2006

Urdy
18. 06. 2007
0 views

Urdy

tim feeney
18. 06. 2007
0 views

tim feeney

Sports vision 2004
18. 06. 2007
0 views

Sports vision 2004

PAD Cost Effectiveness
18. 06. 2007
0 views

PAD Cost Effectiveness

Geriatric Exercise Handout
18. 06. 2007
0 views

Geriatric Exercise Handout

Expositions Solidaires
18. 06. 2007
0 views

Expositions Solidaires

erasmus aout05en
18. 06. 2007
0 views

erasmus aout05en

ENGLISH IS FUN
18. 06. 2007
0 views

ENGLISH IS FUN

English Culture Sports2 ready
18. 06. 2007
0 views

English Culture Sports2 ready

6746
18. 06. 2007
0 views

6746

20060410 uncch sils
18. 06. 2007
0 views

20060410 uncch sils

20060408 seaall
18. 06. 2007
0 views

20060408 seaall

20050823 Nagoya
18. 06. 2007
0 views

20050823 Nagoya

060217 CLEMENT B talk
18. 06. 2007
0 views

060217 CLEMENT B talk

DREAM ME
18. 06. 2007
0 views

DREAM ME

evaporaion
15. 06. 2007
0 views

evaporaion

Division with 0
15. 06. 2007
0 views

Division with 0

Division Trouble
15. 06. 2007
0 views

Division Trouble

decimals add subtract
15. 06. 2007
0 views

decimals add subtract

coniferousrees
15. 06. 2007
0 views

coniferousrees

conducing a successful online
15. 06. 2007
0 views

conducing a successful online

Caribou Hills Fire pt1
26. 08. 2007
0 views

Caribou Hills Fire pt1

Caribou Hills Fire pt2
26. 08. 2007
0 views

Caribou Hills Fire pt2

7560
18. 06. 2007
0 views

7560

EOCT GOLF REVIEW ALG
15. 06. 2007
0 views

EOCT GOLF REVIEW ALG

StanescuAndreas
18. 09. 2007
0 views

StanescuAndreas

07Sp Retention
26. 08. 2007
0 views

07Sp Retention

civ3 tech tree rev100
26. 08. 2007
0 views

civ3 tech tree rev100

PPoPP2006
18. 09. 2007
0 views

PPoPP2006

Digitalt innhold
04. 08. 2007
0 views

Digitalt innhold

bateman powerpoint figures
26. 08. 2007
0 views

bateman powerpoint figures

LBC BlueObserving
18. 09. 2007
0 views

LBC BlueObserving

conrad brittannia
26. 08. 2007
0 views

conrad brittannia

PMR bullets2
18. 06. 2007
0 views

PMR bullets2

Dr Levitons Slides
04. 08. 2007
0 views

Dr Levitons Slides