04 05 knowitall

Information about 04 05 knowitall

Published on October 9, 2007

Author: Cubemiddle

Source: authorstream.com

Content

KnowItAll:  KnowItAll April 5 2007 William Cohen Announcements:  Announcements Reminder: project presentations (or progress report) Sign up for a 30min presentation (or else) First pair of slots is April 17 Last pair of slots is May 10 William is out of town April 6-April 9 So, no office hours Friday. Next week: no critiques assigned But I will lecture Bootstrapping:  Bootstrapping BM’98 Brin’98 Hearst ‘92 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text… Collins & Singer ‘99 Riloff & Jones ‘99 Cucerzan & Yarowsky ‘99 Etzioni et al 2005 Rosenfeld and Feldman 2006 … … Stevenson & Greenwood 2005 Clever idea for learning relation patterns & strong experimental results De-emphasize duality, focus on distance between patterns. Know It All:  Know It All Architecture:  Architecture Set of (disjoint?) predicates to consider + two names for each ~= [H92] Context – keywords from user to filter out non-domain pages … ? Architecture:  Architecture Bootstrapping - 1:  Bootstrapping - 1 “city” query template rule Bootstrapping - 2:  Bootstrapping - 2 Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”) i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”) These are then used to create features: fU(x)>θ and fU(x)<θ Bootstrapping - 3:  Bootstrapping - 3 Submit the queries & apply the rules to produce initial seeds. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds. Train a NaiveBayes classifier using thresholded U’s as features. Bootstrapping - 4:  Bootstrapping - 4 Estimate using the classifier based on the previously-trained discriminators Some ad hoc stopping conditions… (“signal to noise” ratio) Architecture - 2:  Architecture - 2 Extensions to KnowItAll:  Extensions to KnowItAll Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want Eg target is “scientist”, but natural clusters are “biologist”, “physicist”, “chemist” Solution: subclass extraction Modify template/rule system to extract subclasses of target class (eg scientist  chemist, biologist, …) Check extracted subclasses with WordNet and/or PMI-like method (as for instances) Extract from each subclass recursively Extensions to KnowItAll:  Extensions to KnowItAll Problem: Set of rules is limited: Derived from fixed set of “templates” (general patterns ~ from H92) Solution 1: Pattern learning: augment the initial set of rules derivable from templates Search for instances I on the web Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4” Assume classes are disjoint and estimate recall/precision of each pattern P Exclude patterns that cover only one seed (very low recall) Take the top 200 remaining patterns and Evaluate them as extractors “using PMI” (?) Evaluate them as discriminators (in usual way?) Examples: “headquartered in <city>”, “<city> hotels”, …, Extensions to KnowItAll:  Extensions to KnowItAll Solution 2: List extraction: augment the initial set of rules with rules that are local to a specific web page Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”) For each page P: Find subtrees T of the DOM tree that contain >k seeds Find longest common prefix/suffix of the seeds in T [Some heuristics added to generalize this further] Find all other strings inside T with the same prefix/suffix Heuristically select the “best” wrapper for a page Wrapper = P, T, prefix, suffix Slide15:  T1 w1  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator Slide16:  T2 w1  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator Slide17:  T3 w1  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w3  Italy, Japan, France, Israel, Spain, Brazil Slide18:  T4 w1  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w2  Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator w3  Italy, Japan, France, Israel, Spain, Brazil w4  Italy, Japan Slide19:  […] Results - City:  Results - City Results - Film:  Results - Film Results - Scientist:  Results - Scientist Observations:  Observations Corpus is accessed indirectly thru Google API Only use top k discriminators Run extractors via query keywords & extract Limited by network access time Lots of moving parts to engineer Rule templates Signal-to-noise LE wrapper evaluation details Parameters: number of discriminators, number of seeds to keep, number of names per concept, …. KnowItNow: Son of KnowItAll:  KnowItNow: Son of KnowItAll Goal: faster results, not better results Difference 1: Store documents locally Build local index (Bindings Engine) optimized for finding instances of KnowItAll rules and patterns Based on inverted index term  (doc,position,contextInfo) KnowItNow: Son of KnowItAll:  KnowItNow: Son of KnowItAll Difference 2: New model (URNS model) to merge information from multiple extraction rules Intuition: instances generated from each extractor are assumed to be a mixture of two distributions Random noise from large instance pool Stuff with known structure (e.g., uniform, Zipf’s law, …) Using EM you can estimate mixture probabilities and parameters of non-noisy data Prob(x noise|x extracted) KnowItNow: Son of KnowItAll:  KnowItNow: Son of KnowItAll … 137 colors = 41% of mass 15,346 colors = 59% of mass Prob(noise)= 0.59 Non-noisy data: uniform over 137 instances … 59% of mass doesn’t Prob(noise)= 0.59 Non-noisy data: Zipf’s over >N instances 41% of mass fits powerlaw

Related presentations


Other presentations created by Cubemiddle

Nuclear Energy
26. 03. 2008
0 views

Nuclear Energy

Jeopardy Template
01. 10. 2007
0 views

Jeopardy Template

Vujic IEEEMarch06
07. 10. 2007
0 views

Vujic IEEEMarch06

Fallacy
12. 09. 2007
0 views

Fallacy

how to write an introduction
06. 09. 2007
0 views

how to write an introduction

zero tolerance
06. 09. 2007
0 views

zero tolerance

Pande ICRISAT
04. 10. 2007
0 views

Pande ICRISAT

intro CS p2p
27. 11. 2007
0 views

intro CS p2p

BestWorstPractices presented
04. 12. 2007
0 views

BestWorstPractices presented

Alcohol Presentacón
15. 11. 2007
0 views

Alcohol Presentacón

HIS European Exploration
15. 11. 2007
0 views

HIS European Exploration

GraphicsAtStanford mar05 san
16. 11. 2007
0 views

GraphicsAtStanford mar05 san

MCOR 384 Presentation Gallipoli
23. 11. 2007
0 views

MCOR 384 Presentation Gallipoli

Workplace Violence
14. 12. 2007
0 views

Workplace Violence

cfo presentation year end
17. 12. 2007
0 views

cfo presentation year end

Top10
25. 12. 2007
0 views

Top10

NGWA SHOW PRESENTATION
28. 12. 2007
0 views

NGWA SHOW PRESENTATION

roadmap for recovery
29. 12. 2007
0 views

roadmap for recovery

UN1001 Galvanic Corrosion
02. 01. 2008
0 views

UN1001 Galvanic Corrosion

Lec01 BASIC COUNTING
12. 09. 2007
0 views

Lec01 BASIC COUNTING

profmarins 140307
16. 11. 2007
0 views

profmarins 140307

2Monday Session3 Mulu Ketsela
29. 11. 2007
0 views

2Monday Session3 Mulu Ketsela

ECS 2 RIO Europe Lipids
30. 11. 2007
0 views

ECS 2 RIO Europe Lipids

302 01
12. 09. 2007
0 views

302 01

SPS and SEDS Meeting 090606
06. 11. 2007
0 views

SPS and SEDS Meeting 090606

AZAASF1 POL SAFETY MAR 05
08. 11. 2007
0 views

AZAASF1 POL SAFETY MAR 05

g7x85l0kr5ko47g
07. 01. 2008
0 views

g7x85l0kr5ko47g

Sentinel1 M Davidson
07. 11. 2007
0 views

Sentinel1 M Davidson

BR ROTARY MTG 1
02. 11. 2007
0 views

BR ROTARY MTG 1

Samuels
12. 03. 2008
0 views

Samuels

schleichAPS meeting
18. 03. 2008
0 views

schleichAPS meeting

2b mobile
27. 03. 2008
0 views

2b mobile

MHSRL 20070125 backgrounder
06. 09. 2007
0 views

MHSRL 20070125 backgrounder

gnews dec
28. 11. 2007
0 views

gnews dec

2PlateTectonics
30. 03. 2008
0 views

2PlateTectonics

lecture3 351
09. 04. 2008
0 views

lecture3 351

AES Summit 4 07
10. 04. 2008
0 views

AES Summit 4 07

03forfut
17. 04. 2008
0 views

03forfut

AI 070503
22. 04. 2008
0 views

AI 070503

Pizza Fractions
12. 09. 2007
0 views

Pizza Fractions

pizza point
12. 09. 2007
0 views

pizza point

14 Howard Haimes
19. 11. 2007
0 views

14 Howard Haimes

Germany Bonn Aug 2006
02. 01. 2008
0 views

Germany Bonn Aug 2006

ENC1101 8
12. 09. 2007
0 views

ENC1101 8

Cpt4
12. 09. 2007
0 views

Cpt4

Lecture 110501
07. 11. 2007
0 views

Lecture 110501

The City of Refuge
28. 12. 2007
0 views

The City of Refuge

CS374 2004 Lecture8 Haplotypes
19. 02. 2008
0 views

CS374 2004 Lecture8 Haplotypes

mathi03
26. 02. 2008
0 views

mathi03

CN14HO
24. 02. 2008
0 views

CN14HO

PeloponnesianWar
11. 12. 2007
0 views

PeloponnesianWar

bridgedecoder
30. 12. 2007
0 views

bridgedecoder

2472CSTEforSaiminato send
10. 03. 2008
0 views

2472CSTEforSaiminato send

SL Proj1b
28. 11. 2007
0 views

SL Proj1b

P416 Lec1 S07
12. 09. 2007
0 views

P416 Lec1 S07