hemal thesis talk

Information about hemal thesis talk

Published on February 5, 2008

Author: Obama

Source: authorstream.com

Content

Query Processing over Incomplete Autonomous Web Databases:  Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri Committee Members: Prof. Subbarao Kambhampati (chair) Prof. Chitta Baral Prof. Yi Chen Prof. Huan Liu Introduction to Web databases:  Introduction to Web databases Many websites allow user query through a form based interface and are supported by backend databases Consider used cars selling websites such as Cars.com, Yahoo! autos, etc Incompleteness in Web databases:  Incompleteness in Web databases Web databases are often input by lay individuals without any curation. For e.g. Cars.com, Yahoo! Autos Web databases are being populated using automated information extraction techniques which are inherently imperfect The local schema of data sources may not support certain attributes supported by the global schema Incomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing value Problem Statement:  Problem Statement Many entities corresponding to tuples with missing values might be relevant to the user query Current query processing techniques return answers that exactly satisfy the user query Such techniques return results with high precision but low recall Relevant Uncertain tuple: A tuple which does not exactly satisfy the query predicates but the entity represented by that tuple might be relevant to the query How to support query processing over incomplete autonomous databases in order to retrieve ranked uncertain results? Q:Make=Honda Challenges Involved:  Challenges Involved How to predict missing values in autonomous databases? As autonomous databases are accessible only through form-based interfaces, how to retrieve relevant uncertain answers? How to keep query processing cost manageable in retrieving uncertain tuples? How to rank the retrieved uncertain answers? Related Work:  Related Work Probabilistic databases Incomplete databases are similar to probabilistic databases once we assess the probabilities for missing values TRIO: uncertainty with lineage ConQuer: handling inconsistency over databases Assume probability distributions are given for uncertain or inconsistent attributes We assess probability distribution for missing attribute and use it to rank rewritten queries to retrieve relevant answers since the probabilities cannot be stored in databases Our query rewriting framework is general and can be used by these systems if the databases are autonomous Handling Missing Values EM algorithm, Bayes Net, Association rules Possible Approaches:  Possible Approaches For a query Q:body style = convt 1.Certain Answers Only (CAO): Return certain answers only as in traditional databases 2. All Uncertain Answers (AUA): Null matches any concrete value, hence return all answers having body style=convt along with answers having body style as null 3. Relevant Uncertain Answers (RUA): Ranking answers by predicting values of missing attribute Low Recall Low Precision, infeasible Costly, infeasible Outline:  Outline Introduction QPIAD: Query Processing over Incomplete Autonomous Databases Data Integration over Incomplete Autonomous Databases Other Contributions Conclusion QPIAD System Architecture:  QPIAD System Architecture RRUA: Generating Rewritten Queries:  RRUA: Generating Rewritten Queries Restricted Relevant Uncertain Answers (RRUA) approach only retrieves only relevant incomplete tuples instead of retrieving all tuples as in AUA and RUA Consider a query Q:Body style=convt Rewritten queries are based on the determining set from AFD for Body style: Model ~~> Body style:0.9 Q1:model=‘a4’ Q2:model=‘z4’ Q3:model=‘boxster’ Determining Attribute set(dtrSet) Base Result Set:RS(Q) Learning Attribute Correlations:  Learning Attribute Correlations AFD: VIN ~~> Model where VIN is an Approximate Key(AKey) with high confidence VIN will not be useful for query rewriting and feature selection since it will not be able to retrieve additional new tuples RRUA: Ranking Rewritten Queries:  RRUA: Ranking Rewritten Queries All queries may not be equally good in retrieving relevant answers “z4” model cars are more likely to be convertibles than a car with “a4” model When database or network resources are limited, the mediator can choose to issue the top K queries to get the most relevant uncertain answers Learning Value Distributions :  Learning Value Distributions Used to rank queries based on the determining set of attributes from the AFD for query attribute We use Naïve Bayes Classifier with m-estimates with AFD as a feature selection step Rank of a rewritten query Qi = P(Am=vm|ti), where ti ε ПdtrSet(Am)(RS(Q)) Q1:model=‘a4’, R(Q1) = P(bodystyle=convt|model=a4) = 0.4 Q2:model=‘z4’, R(Q2) = P(bodystyle=convt|model=z4)= 1.0 Q3:model=‘boxster’, R(Q3) = P(bodystyle=convt|model=boxster)=0.7 R(Q2) > R(Q3) > R(Q1) Relevant uncertain answers are ranked based on the rank of the rewritten query that retrieved it Combining AFDs and Classifiers:  Combining AFDs and Classifiers More than one AFD may exist for some attributes Experimented with several approaches: Only best-AFD having highest confidence All attributes ignoring AFDs Hybrid One-AFD Ensemble of classifiers Empirical Evaluation of QPIAD :  Empirical Evaluation of QPIAD Test Databases: AutoTrader database containing 100K tuples and Census database from UCI Repository containing 50K tuples Oracular study: To evaluate the effectiveness of our system against a ground truth, we artificially insert missing values in 10% of the tuples within these databases RRUA vs AUA vs RUA:  RRUA vs AUA vs RUA Precision over Top K Tuples:  Precision over Top K Tuples Ranking the Rewritten Queries:  Ranking the Rewritten Queries Cars database Census database Robustness of QPIAD:  Robustness of QPIAD User Relevance Issues with QPIAD:  User Relevance Issues with QPIAD When the query processor presents incomplete tuples, it becomes a recommender system For a query Q:year=2000 How to convince users into believing the system results? Outline:  Outline Introduction QPIAD: Query Processing over Incomplete Autonomous Databases Data Integration over Incomplete Autonomous Databases Other Contributions Conclusion Leveraging Correlations between Data Sources:  Leveraging Correlations between Data Sources Mediator:GS(Make,Model,Year,Price,Mileage,Bodystyle) Q:Body style=coupe Correlated Source and Maximum Correlated Source:  Correlated Source and Maximum Correlated Source Consider four sources with schema: S1(Make,Model,Year,Price) S2(Engine,Drive,Bodystyle), AFD: {Engine, Drive} -> Body style confidence 0.7 S3(Make,Model,Body style) AFD: Model -> Body style confidence 0.8 S4(Make,Price,Body style) AFD: {Make, Price} -> Body Style confidence 0.6 Mediator global schema GS(Make,Model,Year,Price, Bodystyle, Engine, Drive) S3 and S4 are correlated sources with S1 on Body style attribute S3 is the maximum correlated source for S1 on Body style attribute Retrieving Relevant Uncertain Answers from CarsDirect.com:  Retrieving Relevant Uncertain Answers from CarsDirect.com Consider a query Q:body style = coupe(GS) Cars.com has an AFD: Model ~~> Body style(0.9) Cars.com is the maximum correlated source for CarsDirect.com which doesn’t support Body style but supports Model attribute Q1:model=Accord Q2:model=Mustang Q3:model=Legend Q4:model=325 Empirical Evaluation of using Correlation between Data Sources:  Empirical Evaluation of using Correlation between Data Sources We consider a mediator performing data integration over three sources: Cars.com, Yahoo! Autos and CarsDirect.com Yahoo! Autos and CarsDirect.com do not allow querying on body style but when the tuples are retrieved we can check the body style attribute to determine if the tuple retrieved has the body style specified in the query Evaluation using attribute correlations and value distributions learned from Cars.com for 5 test queries on body style attribute Retrieving Relevant Answers using Correlations from Cars.com:  Retrieving Relevant Answers using Correlations from Cars.com Handling Joins over Incomplete Autonomous databases:  Handling Joins over Incomplete Autonomous databases Mediator performing data integration across two sources: Source S1 is incomplete Source S2 is complete Issues in Handling Joins:  Issues in Handling Joins Performing joins over probabilistic databases will lead to a disjunction in join results Consider joining uncertain tuples from the two sources: or 0.6 0.4 Approximation Handling Join Queries:  Handling Join Queries Q:σMake=Honda(UsedCars) Assume AFDs: {Make,Year} ~~> Model, Model ~~> Make 1.0 0.6 Q1: Model=Odyssey:R(Q1)=1 Q2: Model=Accord:R(Q2)=1 0.6 Civic 0.4 Accord Queries on source S2 to join Q3:Model=Odyssey:R(Q3)=1 Q4:Model=Accord:R(Q4)=1 Q5:Model=Civic:R(Q5)=0.6 Experimental Results Joins:  Experimental Results Joins Outline:  Outline Introduction QPIAD: Query Processing over Incomplete Autonomous Databases Data Integration over Incomplete Autonomous Databases Other Contributions Conclusion QUIC: Querying under Imprecision and Incompleteness:  QUIC: Querying under Imprecision and Incompleteness Consider a query Q:model like Civic(Cars) User might be interested in similar cars like “Accord”, ”Camry”, etc Ranking results in presence of both similar and incomplete tuples Other Contributions[*Collaboration with Garrett Wolf]:  Other Contributions[*Collaboration with Garrett Wolf] Handling multi-attribute selection queries for incomplete databases* QUIC system for query processing under imprecision and incompleteness Online learning of value distribution based on base result set to avoid sample biases Conclusion:  Conclusion Thesis proposed a framework for query processing over incomplete autonomous web databases: QPIAD: Query processing over incomplete autonomous databases QPIAD: Data Integration over multiple incomplete data sources Results of empirical evaluation on real world databases show that our system returns relevant answers with high precision while keeping the query processing cost manageable Thank You!!:  Thank You!! Questions??

Related presentations


Other presentations created by Obama

canada powerpoint
22. 04. 2008
0 views

canada powerpoint

HealthCommitteePrese ntation
02. 04. 2008
0 views

HealthCommitteePrese ntation

greek architecture
10. 01. 2008
0 views

greek architecture

WMCh3
10. 01. 2008
0 views

WMCh3

francegroup3presenta tionperfume
10. 01. 2008
0 views

francegroup3presenta tionperfume

info session
10. 01. 2008
0 views

info session

schiffman04
15. 01. 2008
0 views

schiffman04

Formula1 Final Presentation
16. 01. 2008
0 views

Formula1 Final Presentation

TGSITW
17. 01. 2008
0 views

TGSITW

Studio Design Safety
19. 01. 2008
0 views

Studio Design Safety

Barcelona
21. 01. 2008
0 views

Barcelona

Knotweed Biology and Control
22. 01. 2008
0 views

Knotweed Biology and Control

angola
14. 01. 2008
0 views

angola

poverty AROUND THE WORLD
23. 01. 2008
0 views

poverty AROUND THE WORLD

BB86slides
04. 02. 2008
0 views

BB86slides

Lecture406
11. 01. 2008
0 views

Lecture406

Human Dimensions
21. 01. 2008
0 views

Human Dimensions

ATPDEABrenda Jacosbs
22. 01. 2008
0 views

ATPDEABrenda Jacosbs

job safety
18. 01. 2008
0 views

job safety

Integration
28. 01. 2008
0 views

Integration

mylilref
29. 01. 2008
0 views

mylilref

Chinese Wedding
29. 01. 2008
0 views

Chinese Wedding

chap6sjt
30. 01. 2008
0 views

chap6sjt

Managing stress
07. 02. 2008
0 views

Managing stress

Summit 08 Endversion 1
08. 03. 2008
0 views

Summit 08 Endversion 1

IPRs
19. 03. 2008
0 views

IPRs

97 12 02
20. 03. 2008
0 views

97 12 02

week 02
21. 02. 2008
0 views

week 02

15 southcom
31. 03. 2008
0 views

15 southcom

Tour of Africa
07. 04. 2008
0 views

Tour of Africa

Sarah Ryu
28. 03. 2008
0 views

Sarah Ryu

HomeCompostingSlides
14. 01. 2008
0 views

HomeCompostingSlides

Info20061026 47422
15. 04. 2008
0 views

Info20061026 47422

barach
07. 02. 2008
0 views

barach

SheepMgmtDuringDroug ht
24. 01. 2008
0 views

SheepMgmtDuringDroug ht

Presentation1 DCSF
22. 04. 2008
0 views

Presentation1 DCSF

michos 10b02
24. 04. 2008
0 views

michos 10b02

OperationIraqiFreedom 01 07
22. 01. 2008
0 views

OperationIraqiFreedom 01 07

WOODEN HOMES CATALOGUE 2007
23. 01. 2008
0 views

WOODEN HOMES CATALOGUE 2007

SAFIT2RM 5
08. 05. 2008
0 views

SAFIT2RM 5

2008 Olympic
30. 04. 2008
0 views

2008 Olympic

WwR Class10 061026
02. 05. 2008
0 views

WwR Class10 061026

c21 leading discussions 10040
25. 01. 2008
0 views

c21 leading discussions 10040

venue r2
02. 05. 2008
0 views

venue r2

SP AdelaideASApresent3
28. 01. 2008
0 views

SP AdelaideASApresent3

SPIE04 5488 46
09. 01. 2008
0 views

SPIE04 5488 46

evos customer training
07. 03. 2008
0 views

evos customer training

InterPlanetary IFAkyildiz
11. 01. 2008
0 views

InterPlanetary IFAkyildiz

0711MILLER
03. 04. 2008
0 views

0711MILLER

Sundari Poster
11. 02. 2008
0 views

Sundari Poster

2003 03 28 Prophecy Update
03. 03. 2008
0 views

2003 03 28 Prophecy Update

annapoorna English
15. 02. 2008
0 views

annapoorna English

Echevarria
14. 01. 2008
0 views

Echevarria

PlatosRepublic
29. 01. 2008
0 views

PlatosRepublic

PercGestalt2005
14. 01. 2008
0 views

PercGestalt2005

photorealism
25. 02. 2008
0 views

photorealism