Stalker

Information about Stalker

Published on November 1, 2007

Author: Garrick

Source: authorstream.com

Content

A Hierarchical Approach to Wrapper Induction [Muslea et al, In Proceedings of Agents ‘99]:  A Hierarchical Approach to Wrapper Induction [Muslea et al, In Proceedings of Agents ‘99] Acknowledgements: Weijun Lou Dr. Graig Knoblock Dr. Ion Muslea Dr. Steve Minton Dr. Subbarao Kambhampati Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Slide3:  Name: KFC Cuisine: Fast Food Locations: Venice (310) 123-4567, (800) 888-4412. L.A. (213) 987-6543. Encino (818) 999-4567, (888) 727-3131. RESTAURANT Name List ( Locations ) Cuisine City List (PhoneNumbers) AreaCode Phone The Embedded Catalog Tree (ECT) The Embedded Catalog Tree (ECT) (continuous):  The Embedded Catalog Tree (ECT) (continuous) Common conventions for structuring HTML page: The information on a page often exhibits some hierarchy Semistructured information is often presented in the form of lists of tuples The formalism (ECT) can describe the structure of a wide-wage of semistructured documents The Embedded Catalog Tree (ECT) (continuous):  The Embedded Catalog Tree (ECT) (continuous) The formalism: The leaves are the items of interest for user The internal nodes (include root) represent lists of k-tuples Each item in the k-tuple can be either a leaf l or another list L (called an embedded list) k indicates the number of items in the tuple Outline of the presentation :  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Extraction Rules:  Extraction Rules Why need rules: Given the ECT together with an extraction rule attached each edge and a iteration rule associated with each list node, the problem of extracting any item of interest by simply determining the path P from root to the corresponding leaf and by successively extracting each node x ∈P from its parent p. Slide8:  Training Example: 1: <p> Name: <b> Yala </b> <p>Cuisine: Thai <p> <i> 2: 4000 Colfax, Phoenix, AZ 85258 (602) 508-1570 3: </i><br><i> 4: 523 Vernon, Las Vegas, NV 89104 (702) 578-2293 5: </i><br><i> 6: 403 Pico, LA, CA 90007 (213) 789-0008 7: </i> Example of Rule Induction To extract the name, apply R1: SkipTo(<b>) start rule and R2: SkipTo</b> end rule to the parent node to identify the prefix and suffix of the item. Alternative rules to R1 can be R3: SkipTo(Name) SkipTo(<b>) or R4: SkipTo(Name Symbol HtmlTag). - match correctly. To extract list node, apply start rule SkipTo(<p><i>) and end rule skipTo(</i>). To break the list into individual tuples, apply SkipTo(<i>) and end rule SkipTo(</i>) iterately to the list node. Extraction rules:  Extraction rules How to express extraction rules ? – a sequence of landmarks: Landmark – a group of consecutive tokens (e.g. words, numbers, HTML tags, substrings, etc), eg. <b> Wildcard – a class of tokens, eg. Number, Sign, HtmlTag, AllCaps, etc. Function-like expression which take a landmark or a wildcard as arguments, eg. SkipTo(:<b>) Cascade functions, eg. SkipTo(AllCaps) NextLandmark(Number) Disjunctive rule, eg. either SkipTo(<b>) or SkipTo(<i>) Slide10:  Training Example: 1: <p> Name: <b> Yala </b> <p>Cuisine: Thai <p> <i> 2: 4000 Colfax, Phoenix, AZ 85258 (602) 508-1570 3: </i><br><i> 4: 523 Vernon, Las Vegas, NV 89104 (702) 578-2293 5: </i><br><i> 6: 403 Pico, LA, CA 90007 (213) 789-0008 7: </i> Example of Rule Induction (cont’s) To extract area-code, apply SkipTo(‘(‘) and end rule SkipTo(‘)’) iterately to each tuple To find beginning of ZIP code, apply R5: SkipTo(,) SkipUntil(Number) or R6: SkipTo(AllCaps) NextLandmark(Number) Extraction Rules (cont’s):  Extraction Rules (cont’s) Advantages Hierarchical extraction based on the ECT formalism allows to wrap info sources that have arbitrary many levels Each node is extracted independently of its siblings, allows missing items or items ordering variation in info sources. In general, using inductive algorithm (Stalker) with hierarchical extraction (ECT) turns an extremely hard problem into several simple ones: rather than finding a single rule, find a number of simpler rules which deal with the easier task of extracting one item from its ECT Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Extraction Rules as Finite Automata:  Extraction Rules as Finite Automata Landmark automata: a group of function-like SkipTo()s that must be applied in a pro-established order represents a landmark automaton. Any extraction rule in previous slides is a landmark automaton. Linear landmark: landmark described by a sequence of tokens and wildcards. Linear landmarks are sufficiently expressive to allow efficient navigation within ECT Linear landmarks can be generated and refined in a simple way during learning Extraction Rules as Finite Automata (cont’s):  Extraction Rules as Finite Automata (cont’s) Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone: (818)508-1570 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 E4: 403 Vernon, <b>Watts</b>, Phone: (310)798-0008 Rule for start of the area code: either Skipto(‘(‘) or SkipTo (Phone) SkipTo (<b>) An SLG for the start of the area code Landmark automata (LA) are nondeterministic finite automata in which Si->Sj (i=j) is labeled by a landmark li,j; i.e. the transition takes place iff the input in the state is a string that is accepted by the landmark li,j. Simple Landamark Grammars (SLGs) are the class of Las that correspond to the disjuctive rules Extraction Rules as Finite Automata (cont’s):  Extraction Rules as Finite Automata (cont’s) The properties of SLG: The initial state S0 has a branching –factor of k It has exactly k accepting states (one per branch) All k branches that leave the S0 are sequential Las Linear Landmarks label each non loop transitions; All looping transitions have the meaning “consume all tokens until you encounter the linear landmark that leads to the next state s3 s0 s1 s2 ( Phone <b> s0 Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Example of Extraction Task:  Example of Extraction Task NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392-5751 The Wrapper Architecture:  Extraction Rules Query Data Information Extractor EC Tree The Wrapper Architecture Learning the Extraction Rules:  Inductive Learning System Extraction Rules Labeled Pages Learning the Extraction Rules GUI The STALKER Algorithm:  The STALKER Algorithm STALKER(Examples) let RetVal be an empty SLG WHILE Examples =0 aDisjunct = LearnDisjunct(Examples) remove all examples covered by aDisjunct add aDisjunct to RetVal return RetVal LearnDisjunct(Examples) Terminals = Wildcards U GetTokens(Examples) Candidates = GetInitailCandidateds(Examples) WHILE candidates ≠0 DO let D = BestDisjunct (Candidates) IF D is a perfect Disjunct THEN return D FOR EACH t ∈Terminals DO Candidates = Candidates U Refine(D, t) remove D from Candidates return best disjunct Slide21:  Example of Rule Induction Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone: (818)508-1570 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 E4: 403 Vernon, <b>Watts</b>, Phone: (310)798-0008 Initial candidates in the first iteration R1 R2 R3 R4 Slide22:  Example of Rule Induction Training Examples: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b>800</b>-555-1515 E3: 523 1st St., <b>LA </b?, Phone: 1-<b>888</b>-578-2293 Initial candidates and Refinements in the second iteration Slide23:  STALKER (cont’s) Stalker is a typical sequential covering algorithm Stalker can be easily extended to uses SkipUntil() and NextLandMark() feature No change on the rule refining process GenerateInitCandidates() need to change to also generate SkipTo()SkipUntil() and SkipTo()NexLandmark() Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Experimental Results:  Experimental Results Illustrative information sources Comparison between WIEN and STALKER Correctness levels for S1 and S2 Results Analysis:  Results Analysis Stalker usually requires less than 10 examples to obtain a 97% average accuracy over 500 trials Termination criteria Stalker: either reach the 97% accuracy threshold or consumed 10 examples WIEN: 100% accuracy The most impressiveness lies in WIEN uses two orders of magnitude more example than Stalker While Stalker allows both missing items and various item ordering, accuracy ranges between 85% and 100% Stalker is reasonably fast Stalker: less than 20 sec per experiment for S1 and S2; less than 40 sec per item for S2 and S4. WIEN: 5 sec per experiment for S1 and 83 sec for S2 For easier task like S2 which has extremely regular structure, except for list extraction rule, all other rules have a 100% accuracy based on a single example! The learning curves for the list extraction and list iteration for all four sources show that these two rules can be extracted relative independently of the extraction difficulty of different source, they all converges quick and get accuracy level above 95% - From another perspective, it proves the usefulness of EC formalism Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A Strength vs Weakness:  Strength vs Weakness Strength: Powerful extraction language EC Formalism: turns hard problem into a problem that is extremely easy in practice. Stalker: allows miss items and various item orderings One hard-to-extract item does not affect others Disadvantage: Requires hand-craft labelling, though usually 10 examples are sufficient. Man-made thresholds in termination criteria. For some hard tasks like S4, the accuracy obtained is still not satisfactory. Outline of the presentation:  Outline of the presentation Main Contents in the paper ECT – Embedded Catalog Tree Extraction Rules Extraction Rules as Finite Automata Rules Learning – Stalker Experimental Results Strength vs Weakness Q/A

Related presentations


Other presentations created by Garrick

1 3 Dynamic Stability
06. 11. 2007
0 views

1 3 Dynamic Stability

adobe flex
28. 11. 2007
0 views

adobe flex

Pennington Cape Cod
10. 12. 2007
0 views

Pennington Cape Cod

Basko EPS2006 iac
29. 10. 2007
0 views

Basko EPS2006 iac

H113j
02. 11. 2007
0 views

H113j

carnival
06. 11. 2007
0 views

carnival

a gavin
06. 11. 2007
0 views

a gavin

demo
12. 11. 2007
0 views

demo

Qiu
15. 11. 2007
0 views

Qiu

Name That Pit Bull
16. 11. 2007
0 views

Name That Pit Bull

20031212 SUPC overview
19. 11. 2007
0 views

20031212 SUPC overview

pe conference presentation
22. 11. 2007
0 views

pe conference presentation

Blind Men
26. 11. 2007
0 views

Blind Men

power point for final project
17. 12. 2007
0 views

power point for final project

CUX Forced Ranking Webinar
14. 11. 2007
0 views

CUX Forced Ranking Webinar

PCIJ PHILIPPINES
07. 12. 2007
0 views

PCIJ PHILIPPINES

Nefropatia
28. 12. 2007
0 views

Nefropatia

self determination handbook
29. 12. 2007
0 views

self determination handbook

OperatorOverloading
30. 12. 2007
0 views

OperatorOverloading

christie workshop presentation
03. 01. 2008
0 views

christie workshop presentation

AGU 2002 Presentation
03. 01. 2008
0 views

AGU 2002 Presentation

Rhetoric
04. 01. 2008
0 views

Rhetoric

lec4 robot classification
07. 01. 2008
0 views

lec4 robot classification

EE541 451 class18
07. 01. 2008
0 views

EE541 451 class18

Livermore Seminar
21. 11. 2007
0 views

Livermore Seminar

GVJ seminar 2000
04. 10. 2007
0 views

GVJ seminar 2000

CA1
23. 11. 2007
0 views

CA1

fre intro curator
28. 11. 2007
0 views

fre intro curator

RocketFuelPumpJPC39
08. 11. 2007
0 views

RocketFuelPumpJPC39

12121
11. 12. 2007
0 views

12121

ToyStory2007studycar ds
20. 02. 2008
0 views

ToyStory2007studycar ds

TURKISHYACHTINGSECTOR
23. 11. 2007
0 views

TURKISHYACHTINGSECTOR

REMI Structure
24. 02. 2008
0 views

REMI Structure

12 3 07
27. 02. 2008
0 views

12 3 07

II 5 Meloe3powerpt
30. 12. 2007
0 views

II 5 Meloe3powerpt

marie midterm
19. 11. 2007
0 views

marie midterm

Canadell Kobe2002
26. 11. 2007
0 views

Canadell Kobe2002

2C Plex Status and Plans
30. 11. 2007
0 views

2C Plex Status and Plans

BecomingaLibraryAdvo cate
23. 12. 2007
0 views

BecomingaLibraryAdvo cate

HLAtimeManagement
13. 11. 2007
0 views

HLAtimeManagement

dania planowanie podatkowe
06. 11. 2007
0 views

dania planowanie podatkowe

icot00zimmer
31. 10. 2007
0 views

icot00zimmer