dbirday croft

Information about dbirday croft

Published on November 16, 2007

Author: Sudiksha

Source: authorstream.com

Content

Why Can’t We All Get Along? (Structured Data and Information Retrieval):  Why Can’t We All Get Along? (Structured Data and Information Retrieval) Bruce Croft Computer Science Department University of Massachusetts Amherst Overview:  Overview History of structured data in IR Conceptual similarities and differences What is the goal? The Indri System Examples using IR for structured data XML retrieval Relevance models Entity retrieval History:  History IR systems have had Boolean field restrictions since 1970s metadata: date, type, source, keywords content structure: title, body Implementing IR systems using a relational DBMS first done in the 70s Crawford and McCleod, 1978-1983 Efficiency issues with this approach persisted until 90s (e.g. DeFazio et al, SIGIR 95) Inquery IR system successfully used object management system (Brown, SIGIR 95) History:  History Modifying DBMS model to incorporate probabilities to integrate DB/IR e.g. probabilistic relational algebra (Fuhr and Rolleke, ACM TOIS 1994) e.g. probabilistic datalog (Fuhr, SIGIR 95) Text retrieval as a SQL function in commercial DBMSs e.g. Oracle, early 90s History:  History Ranked retrieval of “complex” documents e.g. office documents with structure and significant text content (Croft, Krovetz and Turtle, IPM 1990) Bayesian inference net model to combine evidence from different parts of document structure (Croft and Turtle, EDT 1992) e.g. marked-up documents (Croft, Smith, and Turtle, SIGIR 1992) XML retrieval INEX (2002) Similarities and Differences:  Similarities and Differences Common interest in providing efficient access to information on a very large scale indexing and optimization key topics Until recently, concern about effectiveness (accuracy) of access was domain of IR Focus on structured vs. unstructured data is historically true but less relevant today Statistical inference and ranking are central to IR, becoming more important in DB Similarities and Differences:  Similarities and Differences IR systems have focused on providing access to information rather than answers e.g. Web search evaluation typically based on topical relevance and user relevance rather than correctness (except QA) IR works with multiple databases but not multiple relations IR query languages more like calculus than algebra Integrity, security, concurrency are central for DB, less so in IR What is the Goal?:  What is the Goal? One unified information system? i.e. a single conceptual and formal framework to support the entire range of information needs at least a grand challenge or is it the Web? An integrated DB/IR system? i.e. extend database model to fully support statistical inference and ranking a major challenge given established systems and models What is the Goal?:  What is the Goal? An IR system with extended capability for structured data i.e. extend IR model to include combination of evidence from structured and unstructured components of complex objects (documents) backend database system used to store objects (cf. “one hand clapping”) many applications look like this (e.g. desktop search, web shopping) users seem to prefer this approach (simple queries or forms and ranking) What is the Goal?:  What is the Goal? What about important database functionality? Source data can be stored in databases Extended IR system will construct separate indexes What about optimization? Search engines worry about optimization! Can incorporate ideas from DB optimization What about updates? Search engines worry about updates! Backend database system still available What about joins? Interesting. Treat IR objects as a view? Indri – A Candidate IR System:  Indri – A Candidate IR System Indri is a separate, downloadable component of the Lemur Toolkit Influences INQUERY [Callan, et. al. ’92] Inference network framework Query language Lemur [http://www.lemurproject.org] Language modeling (LM) toolkit Lucene [http://jakarta.apache.org/lucene/docs/index.html] Popular off the shelf Java-based IR system Based on heuristic retrieval models Designed for new retrieval environments i.e. GALE, CALO, AQUAINT, Web retrieval, and XML retrieval Zoology 101:  Zoology 101 The indri is the largest type of lemur When first spotted the natives yelled “Indri! Indri!” Malagasy for "Look!  Over there!" Design Goals:  Design Goals Off the shelf (Windows, *NIX, Mac platforms) Simple to set up and use Fully functional API w/ language wrappers for Java, etc… Robust retrieval model Inference net + language modeling [Metzler and Croft ’04] Powerful query language Designed to be simple to use, yet support complex information needs Provides “adaptable, customizable scoring” Scalable Highly efficient code Distributed retrieval Incremental update Model:  Model Based on original inference network retrieval framework [Turtle and Croft ’91] Casts retrieval as inference in simple graphical model Extensions made to original model Incorporation of probabilities based on language modeling rather than tf.idf Multiple language models allowed in the network (one per indexed context) Model:  Model D θtitle θbody θh1 r1 rN … r1 rN … r1 rN … I q1 q2 α,βtitle α,βbody α,βh1 Document node (observed) Model hyperparameters (observed) Context language models Representation nodes (terms, phrases, etc…) Belief nodes (#combine, #not, #max) Information need node (belief node) Model:  Model I D θtitle θbody θh1 r1 rN … r1 rN … r1 rN … q1 q2 α,βtitle α,βbody α,βh1 P( r | θ ):  P( r | θ ) Probability of observing a term, phrase, or feature given a context language model ri nodes are binary Assume r ~ Bernoulli( θ ) “Model B” – [Metzler, Lavrenko, Croft ’04] Model:  Model I P( θ | α, β, D ):  P( θ | α, β, D ) Prior over context language model determined by α, β Assume P( θ | α, β ) ~ Beta( α, β ) Bernoulli’s conjugate prior αr = μP( r | C ) + 1 βr = μP( ¬ r | C ) + 1 μ is a free parameter Model:  Model I D θtitle θbody θh1 r1 rN … r1 rN … r1 rN … q1 q2 α,βtitle α,βbody α,βh1 P( q | r ) and P( I | r ):  P( q | r ) and P( I | r ) Belief nodes are created dynamically based on query Belief node estimates are derived from standard link matrices Combine evidence from parents in various ways Allows fast inference by making marginalization computationally tractable Information need node is simply a belief node that combines all network evidence into a single value Documents are ranked according to P( I | α, β, D) Example: #AND:  Example: #AND A B Q Query Language:  Query Language Extension of INQUERY query language “Structured” query language Term weighting Ordered / unordered windows Synonyms Additional features Language modeling motivated constructs Added flexibility to deal with fields via contexts Generalization of passage retrieval (extent retrieval) Document Representation:  Document Representation <html> <head> <title>Department Descriptions</title> </head> <body> The following list describes … <h1>Agriculture</h1> … <h1>Chemistry</h1> … <h1>Computer Science</h1> … <h1>Electrical Engineering</h1> … … <h1>Zoology</h1> </body> </html> <title>department descriptions</title> <h1>agriculture</h1> <h1>chemistry</h1>… <h1>zoology</h1> . . . <body>the following list describes … <h1>agriculture</h1> … </body> <title> context <body> context <h1> context 1. agriculture 2. chemistry … 36. zoology <h1> extents 1. the following list describes <h1>agriculture </h1> … <body> extents 1. department descriptions <title> extents Terms:  Terms Proximity:  Proximity Context Restriction:  Context Restriction Context Evaluation:  Context Evaluation Belief Operators:  Belief Operators * #wsum is still available in INDRI, but should be used with discretion Extent Retrieval:  Extent Retrieval Extent Retrieval Example:  Extent Retrieval Example <document> <section><head>Introduction</head> Statistical language modeling allows formal methods to be applied to information retrieval. ... </section> <section><head>Multinomial Model</head> Here we provide a quick review of multinomial language models. ... </section> <section><head>Multiple-Bernoulli Model</head> We now examine two formal methods for statistically modeling documents and queries based on the multiple-Bernoulli distribution. ... </section> … </document> Query: #combine[section]( dirichlet smoothing ) SCORE DOCID BEGIN END 0.50 IR-352 51 205 0.35 IR-352 405 548 0.15 IR-352 0 50 … … … … 0.15 Treat each section extent as a “document” Score each “document” according to #combine( … ) Return a ranked list of extents. 0.50 0.05 Indri Examples:  Indri Examples “Where was George Washington born?” #combine[sentence]( #1( george washington ) born #any:place ) Paragraphs from news feed articles published between 1991 and 2000 that mention a person, a monetary amount, and the company InfoCom #filreq(#band( NewsFeed.doctype #date:between(1991 2000) ) #combine[paragraph]( #any:person #any:money InfoCom ) ) Example Indri Web Query:  Example Indri Web Query #weight( 0.1 #weight( 1.0 #prior(pagerank) 0.75 #prior(inlinks) ) 1.0 #weight( 0.9 #combine( #wsum( 1 stellwagen.(inlink) 1 stellwagen.(title) 3 stellwagen.(mainbody) 1 stellwagen.(heading) ) #wsum( 1 bank.(inlink) 1 bank.(title) 3 bank.(mainbody) 1 bank.(heading) ) ) 0.1 #combine( #wsum( 1 #uw8( stellwagen bank ).(inlink) 1 #uw8( stellwagen bank ).(title) 3 #uw8( stellwagen bank ).(mainbody) 1 #uw8( stellwagen bank ).(heading) ) ) ) ) Examples of Using IR for Structured Data:  Examples of Using IR for Structured Data XML search Relevance models for incomplete data Extracted entity retrieval XML Search:  XML Search INEX workshop is similar to TREC but focused on XML documents Queries contain varying degrees of structural specification Not clear that these queries are realistic earlier study showed that people are not good about remembering structure document structure can provide valuable evidence for content representation Example INEX Query:  Example INEX Query Hierarchical Language Models:  Hierarchical Language Models Estimate a language model for each component of a document tree (Ogilvie 2004, 2005) Smooth using a weighted mixture of a background model, a document model, a parent model, and a mixture of the children models Hierarchical Language Models:  Hierarchical Language Models Does it work?:  Does it work? Results from Ogilvie, 2003 Does it work?:  Does it work? Results from Ogilvie, 2003 Indri INEX extensions:  Indri INEX extensions Indri incorporates hierarchical language models Allows weights to be set for different language models and component type Query language extended to reference parent and child extents use the .\field operator to access a child reference use the ./field operator to access a parent reference use the .//field operator to access an ancestor reference e.g. #combine[section]( bootstrap #combine[./title]( methodology ) ) Relevance Models for Incomplete Data:  Relevance Models for Incomplete Data Relevance models (Lavrenko, 2001) are used for query expansion in IR based on generative LMs Estimates dependencies between words based on training set or initial ranking Recently extended to semi-structured data for applications where records are missing data (Lavrenko, Yi, Allan, 2006) e.g. NSDL collection with fields title, description, subject, content, audience 24% of 650,000 records have no subject field, 30% no author, 96% no audience Relevance Models for Incomplete Data:  Relevance Models for Incomplete Data Basic process is to estimate relevance models for each field based on training data for a query, then rank test records based on comparison to relevance models Relevance model estimates how likely it is that a word occurs in a field of a record, given that a record matches the specified query fields Ranking is done using a weighted cross-entropy weights reflect importance of field Relevance Models for Incomplete Data:  Relevance Models for Incomplete Data In NSDL experiment, 127 queries of form {subject=’philosophy’ AND audience=‘high school’} In test collection, all records had subject and audience field values removed Retrieved records had precision of 30% in top 10, compared to 15% for a baseline that ranked text records containing all fields Shows potential of probabilistic models for this type of application can also generate structured queries (Calado et al, CIKM 02) Extracted Entity Retrieval:  Extracted Entity Retrieval Information extraction extracts structure from text e.g. names, addresses, email addresses, CVs, publications, tables Creates semi-structured (and noisy) data rather than databases Table extraction can be the basis for question answering (Wei, Croft and McCallum, 2006) Publication extraction is the basis of CITESEER-like systems (e.g. REXA, McCallum, 2005) Person extraction can be the basis for “expert finding” Expert Finding:  Expert Finding Evaluated in TREC Enterprise Track People are represented by text that co-occurs with names which names? what text? People are ranked for a query using the text “profile” Relevance model approach is effective Conclusion:  Conclusion For many applications involving retrieval of semi-structured data, the right approach is an IR system based on a probabilistic retrieval model as a front-end, and a database system as the back-end but IR system is not implemented using database system “Right” means gives effective results and supports users’ world view IR systems based on language models (e.g. Indri) are a good candidate

#and presentations

Pawankumar Resume
08. 12. 2016
0 views

Pawankumar Resume

Learning pandas - Sample Chapter
14. 04. 2015
0 views

Learning pandas - Sample Chapter

Related presentations


Other presentations created by Sudiksha

3 Theodore Roosevelt
22. 10. 2007
0 views

3 Theodore Roosevelt

ramasetu24june200747 69
30. 09. 2007
0 views

ramasetu24june200747 69

08 Tornado
05. 10. 2007
0 views

08 Tornado

ACEI New Orleans 2004
05. 10. 2007
0 views

ACEI New Orleans 2004

Breaking Bad News May 07 ASA
08. 10. 2007
0 views

Breaking Bad News May 07 ASA

DESIGNING A TEMPERATURE SENSOR
12. 10. 2007
0 views

DESIGNING A TEMPERATURE SENSOR

blackhole
07. 10. 2007
0 views

blackhole

quiz
10. 12. 2007
0 views

quiz

BSRUN
19. 10. 2007
0 views

BSRUN

E consultancy slides march 6th
25. 10. 2007
0 views

E consultancy slides march 6th

breakinggridlock0612 01
30. 10. 2007
0 views

breakinggridlock0612 01

EXPLORERS
01. 11. 2007
0 views

EXPLORERS

Japanl
09. 10. 2007
0 views

Japanl

oct16 gfbiedu
25. 10. 2007
0 views

oct16 gfbiedu

KSA V5
23. 11. 2007
0 views

KSA V5

DavidShipman
04. 10. 2007
0 views

DavidShipman

Significance
25. 10. 2007
0 views

Significance

NASA
03. 01. 2008
0 views

NASA

Day 3 Charlotte DUFOUR TIPS
04. 12. 2007
0 views

Day 3 Charlotte DUFOUR TIPS

colangelo
07. 01. 2008
0 views

colangelo

pisa overview
17. 10. 2007
0 views

pisa overview

NG21A 07 Rundle
30. 10. 2007
0 views

NG21A 07 Rundle

McCarthypix
02. 11. 2007
0 views

McCarthypix

frital
24. 10. 2007
0 views

frital

Ammosov RPC IHEP
12. 10. 2007
0 views

Ammosov RPC IHEP

P4 2 Kawagoe
15. 10. 2007
0 views

P4 2 Kawagoe

PE Minerals
16. 02. 2008
0 views

PE Minerals

Internal Analysis Lecture
24. 02. 2008
0 views

Internal Analysis Lecture

OSHAtop102006
26. 02. 2008
0 views

OSHAtop102006

T E of the Machine Gun
27. 02. 2008
0 views

T E of the Machine Gun

secretcodes tcm4 336597
31. 12. 2007
0 views

secretcodes tcm4 336597

1960s
20. 02. 2008
0 views

1960s

Porteous
12. 03. 2008
0 views

Porteous

AG 2002 11 16
24. 10. 2007
0 views

AG 2002 11 16

Md  Ppt
24. 03. 2008
0 views

Md Ppt

science genetics
03. 10. 2007
0 views

science genetics

Grade 9 Heat
03. 04. 2008
0 views

Grade 9 Heat

Final Prelims 2006
16. 04. 2008
0 views

Final Prelims 2006

FOP01 Franchise Opportunity
17. 04. 2008
0 views

FOP01 Franchise Opportunity

pres4
18. 04. 2008
0 views

pres4

LSE
22. 04. 2008
0 views

LSE

B1
30. 10. 2007
0 views

B1

OPC Notes CT
07. 05. 2008
0 views

OPC Notes CT

chris corrigan pres
30. 04. 2008
0 views

chris corrigan pres

Facility Layout Lecture Notes
02. 05. 2008
0 views

Facility Layout Lecture Notes

Sukal Linger Presentation
08. 10. 2007
0 views

Sukal Linger Presentation

access programme 2004
17. 10. 2007
0 views

access programme 2004

SuperValu Presentation2
02. 10. 2007
0 views

SuperValu Presentation2

Larsen
07. 03. 2008
0 views

Larsen

cbm39 269
14. 04. 2008
0 views

cbm39 269

Cuban
23. 12. 2007
0 views

Cuban

ICN2001 Final Report for web
20. 03. 2008
0 views

ICN2001 Final Report for web

map ftaa windows xp
22. 10. 2007
0 views

map ftaa windows xp

The Motorola Phone Comedy
17. 10. 2007
0 views

The Motorola Phone Comedy

PresSchatanIntrod
22. 10. 2007
0 views

PresSchatanIntrod

ERCIMgridasiaRR
16. 10. 2007
0 views

ERCIMgridasiaRR

bpesp
23. 10. 2007
0 views

bpesp

Florida Congress 6 06
22. 10. 2007
0 views

Florida Congress 6 06

WrinkleInTime
24. 10. 2007
0 views

WrinkleInTime

WNVEnterpriseGIS Chicago
21. 10. 2007
0 views

WNVEnterpriseGIS Chicago

pira ing
04. 10. 2007
0 views

pira ing

SolarHeliosphere
11. 03. 2008
0 views

SolarHeliosphere

04ift tomatosalsaPoster combined
04. 03. 2008
0 views

04ift tomatosalsaPoster combined

InstallingPortlets
05. 10. 2007
0 views

InstallingPortlets