SIGIR04

Information about SIGIR04

Published on November 20, 2007

Author: Danior

Source: authorstream.com

Content

Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR):  Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, Taiwan Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) :  Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, Taiwan Outline:  Outline Introduction The Proposed Approaches Anchor-Text-Based Approach Search-Result-Based Approach Experiments Applications LiveTrans (http://livetrans.iis.sinica.edu.tw/lt.html) Discussions & Conclusion Query Translation for CLIR:  Query Translation for CLIR Query Translation Source Query Translated Query Mono-Lingual IR Translation Dictionaries S T Problem Problem Most queries are proper nouns:  Problem Most queries are proper nouns Problem Query Translation Source Query Translated Query Mono-Lingual IR George Bush S T Sheffield Yahoo Document Classification Observation from Query Logs:  Observation from Query Logs Most real queries are Short (2.3 English words [Silverstein’98] & 3.18 Chinese characters [Pu’02]) Out-of-dictionary (82.9% of high frequent query terms ) Problem 12.4% unknown English queries for Chinese documents Most of their Chinese translations also found in the logs Demand for translation The Web as Corpora:  The Web as Corpora Query Translation Source Query Translated Query Mono-Lingual IR S T Web Anchor Texts [Lu TOIS’04] Search Result Pages Idea Purpose:  Purpose To increase translation coverage Unknown queries General domains To improve CLIR performance Query expansion Combination of multiple translation approaches To benefit cross-language Web search Speed Idea Difference from Conventional Approaches:  Difference from Conventional Approaches Idea Our Ideas:  Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach Idea Anchor Text in Multiple Languages:  Anchor Text in Multiple Languages [Lu’04] Anchor text: the descriptive part of a link of a Web page Idea Probabilistic Inference Model:  Probabilistic Inference Model [Lu’04] Page Authority Co-occurrence Approach Slide13:  Limited domains Powerful spiders required Large training corpora More network bandwidth & storage Drawbacks of Anchor-Text-Based Approach Approach Our Ideas:  Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach Idea Multilingual Search-Result Pages:  Multilingual Search-Result Pages The search-result page in Chinese of the English query “Yahoo” Snippet Snippet Idea Correct Translations:  Correct Translations Mixed-language characteristic in Chinese pages Idea Relevant Translations:  Relevant Translations Effective query expansion Idea Observation:  Observation 95% Popular queries 70% Random queries Coverage of top-ranked translation candidates in search-result pages Many relevant translations found Idea Slide19:  To extract translation candidates with correct lexical boundaries To select correct or relevant translation candidates To integrate extracted translations from different approaches into improve CLIR performance Challenges Challenges Search-Result-Based Approach:  Search-Result-Based Approach Search Engine(s) Source Query Translated Query Search-Result Pages Term Extraction … Translation Candidates Translation Selection S T Approach Challenge 1: Term Extraction:  Challenge 1: Term Extraction SCP (Symmetric Conditional Probability) Cohesion holding the words together Low frequency or long terms tend to be discarded [Silva’99] CD (Context Dependency) Dependence on the left- or right- adjacent word/character Low frequency or long terms can be extracted [Chien’97] Approach Term Extraction (II):  Term Extraction (II) Performance: SCPCD: A combination of SCP and CD – PAT-tree as data structure – LocalMaxs as key term selection algorithm – No threshold Approach Challenge 2: Translation Selection:  Challenge 2: Translation Selection S . . . T1 T2 Tn Translation candidates: 雅虎(Yahoo!) 奇摩(Kimo) 雅虎台灣(Yahoo! Taiwan) Similarity Query term: Yahoo Similarity estimation S and Ti frequently co-occur in the same pages – Not true for synonym S and Ti have similar co-occurring context terms Approach Chi-Square Test:  Chi-Square Test A statistical method based on co-occurrence Approach Each translation only needs 3 Web searches Slide25:  Boolean Query Approach Context Vector Analysis:  Context Vector Analysis A vector space model based on co-occurring context terms as feature vectors Weighting scheme: Similarity measure: Approach Comparison of Chi-Square and Context Vector Methods:  Comparison of Chi-Square and Context Vector Methods FE: feature extraction N: # of translation candidates Approach Slide28:  Challenge 3: CLIR Retrieval model [Xu’01]: Approach Slide29:  Estimation of P(s|t) Consider various ranges of similarity values score ranking in method m : assigned weight for each m Approach Experiments:  Experiments Experiments on the NTCIR-2 English-Chinese task Experiments on translating Web-query terms Experiments on translating scientists’ names and disease names (English-to-Chinese/Japanese/Korean) Evaluation Experiments on the NTCIR-2 English-Chinese Task:  Experiments on the NTCIR-2 English-Chinese Task Evaluation Translation Performance:  Translation Performance Hong Kong law parallel text collection (238K para.) [Kwok’01] Evaluation Translation Performance:  Translation Performance Web corpora Evaluation Translation Performance:  Translation Performance Search results Evaluation Translation Performance:  Translation Performance Anchor-text collection (109K URLs) [Lu’04] Evaluation Translation Performance:  Translation Performance Search result + anchor text Evaluation Performance Metric:  Performance Metric Top-k inclusion rate The percentage of queries whose translations could be found in the first k extracted translations Evaluation Translation Performance (II):  Translation Performance (II) CV has higher precision rates than X2 CV+X2 has better performance than CV or X2 Evaluation Translation Performance (III):  Translation Performance (III) AT has higher precision rates than CV+X2 CV+X2 has higher coverage rates than AT Complementary Evaluation Translation Performance (III):  Translation Performance (III) CV+X2+AT has the best performance Evaluation Extracted Correct Translations:  Extracted Correct Translations Evaluation Extracted Relevant Translations:  Extracted Relevant Translations Evaluation CLIR Performance:  CLIR Performance Evaluation CLIR Performance:  CLIR Performance Evaluation CLIR Performance:  CLIR Performance Evaluation Dic: LDC English-Chinese lexicon (102K entries) CLIR Performance:  CLIR Performance Evaluation SR: X2+CV CLIR Performance:  CLIR Performance Evaluation SR+AT: X2+CV+AT CLIR Performance:  CLIR Performance Evaluation All: X2+CV+AT+Dictionary CLIR Performance (II):  CLIR Performance (II) Dic has higher precision rates than SR and SR+AT at K = 1 50.3% 61.2% Top-1 inclusion rate Evaluation CLIR Performance (III):  CLIR Performance (III) 68.0% 78.1% Top-3 inclusion rate SR or SR+AT has higher precision rates than Dic when K > 3 Evaluation CLIR Performance (III):  CLIR Performance (III) Starting converging Evaluation CLIR Performance (IV):  CLIR Performance (IV) Using only dictionary Using dictionary + our approaches Improvement: 0.043 0.061 0.064 0.059 0.063 0.064 OOV Inclusion rate: 68.1% 81.8% 86.3% – CLIR performance improvement by translating OOV terms Evaluation Experiments on Translating of Web-Query Terms:  Experiments on Translating of Web-Query Terms Web-query logs: Test query sets: Evaluation Slide54:  Web-Query Translation Performance Evaluation Slide55:  Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation Slide56:  Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation Slide57:  Web-Query Translation Performance AT performs worse for random Web queries Evaluation Slide58:  Web-Query Translation Performance in Different Types Place > People > Computer & Network > Others > Organization Popular query set: (search-result-based approach) Evaluation Common Nouns and Verbs:  Common Nouns and Verbs The proposed search-result-based approach is less reliable to common terms Evaluation Experiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean):  Experiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean) Evaluation An Example of Multilingual Translation:  An Example of Multilingual Translation Evaluation Applications:  LiveTrans http://livetrans.iis.sinica.edu.tw/lt.html A cross-language meta-search engine To provide online translation service of query terms for cross-language Web search Applications Application Slide63:  Application Sheffield Transliteration Slide64:  Application Industry City in Mid U.K. Sheffield Univ. Sheffield Hallam Univ. Discussion and Conclusions:  Discussion and Conclusions Advantages Can translate unknown queries to improve CLIR performance Can provide query expansion for CLIR Can extract translations with multiple meanings Be flexible for query specification Be useful for online cross-language Web search Disadvantages Be Dependent on employed search engines Not perform good for common terms Not applicable to the language pairs without mixed language characteristic Conclusion Slide66:  Jaguar Jaguar Car Jaguar Animal Conclusion Slide67:  Have a temperature 38”C pneumonia SARS, severe acute respiratory symptom Conclusion Thank you for your attention!:  Thank you for your attention! Q&A

Related presentations


Other presentations created by Danior

The 1950s
25. 12. 2007
0 views

The 1950s

18 electro magnetism
12. 11. 2007
0 views

18 electro magnetism

hwr bhagn
28. 11. 2007
0 views

hwr bhagn

NAAR ROME EN FLORENCE
31. 10. 2007
0 views

NAAR ROME EN FLORENCE

OPERA intro FAM nov05
01. 11. 2007
0 views

OPERA intro FAM nov05

cruise powerpoint
06. 11. 2007
0 views

cruise powerpoint

Tanks
08. 11. 2007
0 views

Tanks

Chapter 7 Leading Effectively
14. 12. 2007
0 views

Chapter 7 Leading Effectively

Harriet Tubman
17. 12. 2007
0 views

Harriet Tubman

5BAD
02. 01. 2008
0 views

5BAD

NC CERT Disaster Preparation
03. 01. 2008
0 views

NC CERT Disaster Preparation

VDSL
03. 01. 2008
0 views

VDSL

ImpactTerror
04. 01. 2008
0 views

ImpactTerror

Sirker
07. 01. 2008
0 views

Sirker

VHS for Dec 5 2006 Meeting
07. 01. 2008
0 views

VHS for Dec 5 2006 Meeting

Korea mar05
30. 10. 2007
0 views

Korea mar05

Ellerbroek AFOSR PRET
15. 11. 2007
0 views

Ellerbroek AFOSR PRET

dokuchaev
01. 12. 2007
0 views

dokuchaev

dogclark
16. 11. 2007
0 views

dogclark

Lect15SocialRela
30. 12. 2007
0 views

Lect15SocialRela

Tue1720 157bis
30. 10. 2007
0 views

Tue1720 157bis

AFMCPTeachingModel 10 8
06. 12. 2007
0 views

AFMCPTeachingModel 10 8

DC Air Natl Guard RR brief
20. 02. 2008
0 views

DC Air Natl Guard RR brief

ballroom
24. 02. 2008
0 views

ballroom

Tas Talk
27. 02. 2008
0 views

Tas Talk

week2
30. 10. 2007
0 views

week2

R Miller WTTC Stockholm
27. 03. 2008
0 views

R Miller WTTC Stockholm

HPB Expo 2008Webinar
11. 12. 2007
0 views

HPB Expo 2008Webinar

Merwe pres
06. 11. 2007
0 views

Merwe pres

ciullapp
14. 11. 2007
0 views

ciullapp

bondiv 05 30 07
19. 11. 2007
0 views

bondiv 05 30 07

olga russia
26. 10. 2007
0 views

olga russia

Fiszelson cover ru
26. 10. 2007
0 views

Fiszelson cover ru

Steve Beslity and Jo
07. 11. 2007
0 views

Steve Beslity and Jo

Henk van Leeuwen 2007
05. 11. 2007
0 views

Henk van Leeuwen 2007

globalization tang text
24. 12. 2007
0 views

globalization tang text

Murshed
30. 12. 2007
0 views

Murshed