hsearch

Information about hsearch

Published on November 20, 2007

Author: Lassie

Source: authorstream.com

Content

Web Page Clustering Using Heuristic Search in the Web Graph:  Web Page Clustering Using Heuristic Search in the Web Graph Ron Bekkerman Shlomo Zilberstein James Allan Example application:  Example application Given a person name Find out everything about this person What is available in the Web Possible solution: Query a search engine with the person name Retrieve documents Cluster retrieved documents Build largest clusters possible Query::  “Michel Décary” Query: “Michel Décary” Query::  Query: “Michel Décary” Lawyer Computer scientist Chanson singer Query::  Query: “Michel Décary” Lawyer Computer scientist Chanson singer Query::  Query: “Michel Décary” Lawyer Computer scientist Chanson singer Query::  Query: “Michel Décary” Lawyer Computer scientist Chanson singer + Summary of observations:  Summary of observations Topical clustering is not enough Although not to be ignored Web graph topology should be exploited Close pages tend to be semantically related There’s a reason for hyperlinking page A → page B Be careful: arbitrary connections exist as well Apply beam search to find close pages Use heuristics to prune undesired branches Example: breadth first search:  Example: breadth first search Clustering by multi-agent search:  Clustering by multi-agent search Each page is represented by a Web agent Whose task is to traverse the Web graph And meet as many other agents as possible Real-world case:  Real-world case The Web is tightly interconnected About 70% agents meet after 3 search iterations Which is clearly an undesired outcome Heuristic 1:  Heuristic 1 Elimination of high-degree nodes Both high in-degree and high out-degree They often connect unrelated pages www.google.com Heuristic 2:  Heuristic 2 Person name sharing Expanded nodes share a hyperlink AND a person name (ignore too popular names) Jonathan Stern Jonathan Stern Heuristic 3:  Heuristic 3 Anchor text sharing Anchor texts often summarize the content of hyperlinked pages Same idea as in person name sharing But much simpler to implement No sophisticated information extraction needed Shallow parsing of HTML is enough Again, ignore too frequent anchor texts Like “Contact Us” or “Copyright” Algorithmic enhancement:  Algorithmic enhancement Unpleasant artifact: too long connections Too weak semantic relationships Proposed solution: keep track of cluster’s diameter Start with a tightly connected component Add pages found within one hop Experimentation domains:  Experimentation domains Web appearance disambiguation Given pages retrieved on N people names From one social network Filter out pages that refer to their unrelated namesakes Clustering of Web search results Represent ranked lists of retrieved documents as clusters of semantically related documents Bekkerman & McCallum, WWW-05 Hearst & Pedersen, SIGIR-96 Slide17:  Disambiguation dataset 12 names out of Melinda Gervasio’s social network Disambiguation results:  Disambiguation results SHS – sequential heuristic search (basic algorithm) IHS – incremental heuristic search With the enhancement of diameter tracking h/d – high degree node elimination names – person name heuristic Each point: one iteration of search Only 2 iterations are enough Jaguar dataset:  Jaguar dataset 100 pages retrieved on query “Jaguar” 23 different categories! The task is to build 3 clusters Of cars, Mac OS and wild cats Jaguar results:  Jaguar results SHS algorithm fails 70 agents meet together anchors – anchor text heuristic Best performance: High degree AND anchors heuristics Topical & topological clustering:  Topical & topological clustering Build topical clusters Enrich topical clusters with pages obtained by heuristic search based clustering Bekkerman & McCallum, WWW-05 Best performance: after one iteration of heuristic search only! Conclusion:  Conclusion First application of heuristic search to the Web graph Very simple algorithms / heurstics Heuristic search theory yet to be applied E.g., can an admissible heuristic be proposed? Search can be performed in real time! Modern search engines store the link structure of most of the Web Maximum 2 search iterations are enough Fully distributable in a multi-agent fashion

Related presentations


Other presentations created by Lassie

Presentation Accenture
12. 03. 2008
0 views

Presentation Accenture

Taxation
19. 10. 2007
0 views

Taxation

population
21. 10. 2007
0 views

population

19 sethusamudram ppt
30. 09. 2007
0 views

19 sethusamudram ppt

sample seminar
02. 05. 2008
0 views

sample seminar

Oral Radiology
02. 05. 2008
0 views

Oral Radiology

Rainfall
07. 04. 2008
0 views

Rainfall

Andrew
19. 02. 2008
0 views

Andrew

campaigns and elections
07. 01. 2008
0 views

campaigns and elections

Emergency Planning
03. 10. 2007
0 views

Emergency Planning

stjerna 060307
09. 10. 2007
0 views

stjerna 060307

ElectiveWkshpGoalSet
10. 10. 2007
0 views

ElectiveWkshpGoalSet

p5animals
12. 10. 2007
0 views

p5animals

RT
12. 10. 2007
0 views

RT

pinarsut
13. 10. 2007
0 views

pinarsut

Lecture19 Ch19 111405
15. 10. 2007
0 views

Lecture19 Ch19 111405

Poster APS summary
16. 10. 2007
0 views

Poster APS summary

david ingleby
17. 10. 2007
0 views

david ingleby

Ole Lund
23. 10. 2007
0 views

Ole Lund

acmmm02
24. 10. 2007
0 views

acmmm02

rebecca
11. 12. 2007
0 views

rebecca

HHH Scandale
17. 10. 2007
0 views

HHH Scandale

1 WMO
19. 10. 2007
0 views

1 WMO

Great Britain
02. 11. 2007
0 views

Great Britain

ENYA
02. 11. 2007
0 views

ENYA

5AlpaShah
06. 11. 2007
0 views

5AlpaShah

Rectoria Panama 2006
25. 10. 2007
0 views

Rectoria Panama 2006

L10a 4345 Sp02
07. 11. 2007
0 views

L10a 4345 Sp02

Shahriar
15. 11. 2007
0 views

Shahriar

lakshmi wireless
15. 11. 2007
0 views

lakshmi wireless

CONTEMPORARY DANCE LESSONS
23. 11. 2007
0 views

CONTEMPORARY DANCE LESSONS

price iso 15926 as owl
07. 11. 2007
0 views

price iso 15926 as owl

acute 060718 neuroemergencies
23. 10. 2007
0 views

acute 060718 neuroemergencies

The Olmec
21. 11. 2007
0 views

The Olmec

wetlands inventory
03. 01. 2008
0 views

wetlands inventory

plainChairdesign
04. 01. 2008
0 views

plainChairdesign

Polymorphic Robotics at ISI
07. 01. 2008
0 views

Polymorphic Robotics at ISI

t Campout with Foodborne Illness
07. 01. 2008
0 views

t Campout with Foodborne Illness

poly web cast
15. 10. 2007
0 views

poly web cast

MTAC 11 06v1 03georgewright
06. 11. 2007
0 views

MTAC 11 06v1 03georgewright

awmapres041905
31. 10. 2007
0 views

awmapres041905

Insecta
23. 10. 2007
0 views

Insecta

ficci wo pictures
29. 12. 2007
0 views

ficci wo pictures

Modelos de control Ley 24156
22. 10. 2007
0 views

Modelos de control Ley 24156

severe wx
07. 10. 2007
0 views

severe wx

Egypt presentation salem 2nd
21. 10. 2007
0 views

Egypt presentation salem 2nd

nursinghome
29. 11. 2007
0 views

nursinghome

IMechE 19th May
15. 10. 2007
0 views

IMechE 19th May

EGEE Summer School 2007
17. 10. 2007
0 views

EGEE Summer School 2007

Carolyn
31. 12. 2007
0 views

Carolyn

4 Systematic chemistry web cec
16. 02. 2008
0 views

4 Systematic chemistry web cec

kimkidu
24. 02. 2008
0 views

kimkidu

IntroMiscConcl
27. 02. 2008
0 views

IntroMiscConcl

dutchhistoryfordummi es
27. 02. 2008
0 views

dutchhistoryfordummi es

hansen2
29. 10. 2007
0 views

hansen2

sciencenews
29. 09. 2007
0 views

sciencenews

Running a vegetarian campaign
04. 03. 2008
0 views

Running a vegetarian campaign

D3 NomuraResearch
25. 03. 2008
0 views

D3 NomuraResearch

presentationkpimchan ThaiAirways
30. 03. 2008
0 views

presentationkpimchan ThaiAirways

ch19 Kreitner 2004 6e OB
08. 04. 2008
0 views

ch19 Kreitner 2004 6e OB

indiapres110606
14. 04. 2008
0 views

indiapres110606

MMM CEO
18. 04. 2008
0 views

MMM CEO

ch26 hedgingrisk
16. 04. 2008
0 views

ch26 hedgingrisk

myCH14
07. 05. 2008
0 views

myCH14

CROSSGRID VO SEC KD
08. 05. 2008
0 views

CROSSGRID VO SEC KD

Prescribed Fire at UNF
02. 01. 2008
0 views

Prescribed Fire at UNF

md apr quality82604 mtan
30. 04. 2008
0 views

md apr quality82604 mtan

bowles
01. 05. 2008
0 views

bowles

PSRS partI
02. 05. 2008
0 views

PSRS partI

pres1
02. 05. 2008
0 views

pres1

Waittimes presentation e
02. 05. 2008
0 views

Waittimes presentation e

ae8 eman over
26. 02. 2008
0 views

ae8 eman over

CH 9
04. 01. 2008
0 views

CH 9

Horton ALA Moving Mountains
29. 02. 2008
0 views

Horton ALA Moving Mountains

remsim
01. 11. 2007
0 views

remsim

CERLS
16. 10. 2007
0 views

CERLS

forum creativite5
23. 10. 2007
0 views

forum creativite5

Ceanothus
14. 12. 2007
0 views

Ceanothus

Pollard
24. 10. 2007
0 views

Pollard

Typology 7mfamelev
02. 11. 2007
0 views

Typology 7mfamelev

ST3
24. 10. 2007
0 views

ST3

UNITROL Service Nov2006 Aend A
19. 10. 2007
0 views

UNITROL Service Nov2006 Aend A

energetech
17. 04. 2008
0 views

energetech

Presentaton
23. 10. 2007
0 views

Presentaton

LTCOPhistoryandresp
03. 10. 2007
0 views

LTCOPhistoryandresp

schmerge experimental results
21. 11. 2007
0 views

schmerge experimental results

martemiyanov
27. 09. 2007
0 views

martemiyanov

Lecture9 JF
16. 10. 2007
0 views

Lecture9 JF

helen meeks
02. 10. 2007
0 views

helen meeks

11 Gatorpops
17. 12. 2007
0 views

11 Gatorpops

Mol gen 9910
16. 10. 2007
0 views

Mol gen 9910