nakov paskaleva cognate or false friend ask the we

Information about nakov paskaleva cognate or false friend ask the we

Published on December 7, 2007

Author: Goldie

Source: authorstream.com

Content

Cognate or False Friend? Ask the Web!:  Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav Nakov, University of California, Berkeley Elena Paskaleva, Bulgarian Academy of Sciences A Workshop on Acquisition and Management of Multilingual Lexicons Introduction:  Introduction Cognates and false friends Cognates are pair of words in different languages that sound similar and are translations of each other False friends are pairs of words in two languages that sound similar but differ in their meanings The problem Design an algorithm that can distinguish between cognates and false friends Cognates and False Friends :  Cognates and False Friends Examples of cognates ден in Bulgarian = день in Russian (day) idea in English = идея in Bulgarian (idea) Examples of false friends майка in Bulgarian (mother) ≠ майка in Russian (vest) prost in German (cheers) ≠ прост in Bulgarian (stupid) gift in German (poison) ≠ gift in English (present) The Paper in One Slide:  The Paper in One Slide Measuring semantic similarity Analyze the words local contexts Use the Web as a corpus Similarities contexts  similar words Context translation  cross-lingual similarity Evaluation 200 pairs of words 100 cognates and 100 false friends 11pt average precision: 95.84% Contextual Web Similarity:  Contextual Web Similarity What is local context? Few words before and after the target word The words in the local context of given word are semantically related to it Need to exclude the stop words: prepositions, pronouns, conjunctions, etc. Stop words appear in all contexts Need of sufficiently big corpus Same day delivery of fresh flowers, roses, and unique gift baskets from our online boutique. Flower delivery online by local florists for birthday flowers. Contextual Web Similarity:  Contextual Web Similarity Web as a corpus The Web can be used as a corpus to extract the local context for given word The Web is the largest possible corpus Contains big corpora in any language Searching some word in Google can return up to 1 000 excerpts of texts The target word is given along with its local context: few words before and after it Target language can be specified Contextual Web Similarity:  Contextual Web Similarity Web as a corpus Example: Google query for "flower" Contextual Web Similarity:  Contextual Web Similarity Measuring semantic similarity For given two words their local contexts are extracted from the Web A set of words and their frequencies Semantic similarity is measured as similarity between these local contexts Local contexts are represented as frequency vectors for given set of words Cosine between the frequency vectors in the Euclidean space is calculated Contextual Web Similarity:  Contextual Web Similarity Example of context words frequencies word: flower word: computer Contextual Web Similarity:  Contextual Web Similarity Example of frequency vectors Similarity = cosine(v1, v2) v1: flower v2: computer Cross-Lingual Similarity:  Cross-Lingual Similarity We are given two words in different languages L1 and L2 We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2} Measuring cross-lingual similarity: We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2 We translate the context We measure distance between C1* and C2 Reverse Context Lookup:  Reverse Context Lookup Local context extracted from the Web can contain arbitrary parasite words like "online", "home", "search", "click", etc. Internet terms appear in any Web page Such words are not likely to be associated with the target word Example (for the word flowers) "send flowers online", "flowers here", "order flowers here" Will the word "flowers" appear in the local context of "send", "online" and "here"? Reverse Context Lookup:  Reverse Context Lookup If two words are semantically related both should appear in the local contexts of each other Let #{x,y} = number of occurrences of x in the local context of y For any word w and a word from its local context wc, we define their strength of semantic association p(w,wc) as follows: p(w, wc) = min{ #(w, wc), #(wc,w) } We use p(w,wc) as vector coordinates when measuring semantic similarity Web Similarity Using Seed Words:  Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm* We have a bilingual glossary G: L1  L2 of translation pairs and target words w1, w2 We search in Google the co-occurrences of the target words with the glossary entries Compare the co-occurrence vectors for each {p,q} ∈ G compare max (google#("w1 p") and google#("p w1")) with max (google#"w2 q") and google#("q w2")) * P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998 Evaluation Data Set:  Evaluation Data Set We use 200 Bulgarian/Russian pairs of words: 100 cognates and 100 false friends Manually assembled by a linguist Manually checked in several large monolingual and bilingual dictionaries Limited to nouns only Experiments:  Experiments We tested few modifications of our contextual Web similarity algorithm Use of TF.IDF weighting Preserve the stop words Use of lemmatization of the context words Use different context size (2, 3, 4 and 5) Use small and large bilingual glossary Compared it with the seed words algorithm Compared with traditional orthographic similarity measures: LCSR and MEDR Experiments:  Experiments BASELINE: random MEDR: minimum edit distance ratio LCSR: longest common subsequence ration SEED: the "seed words" algorithm WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting NO-STOP: WEB3 without stop words removal WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5 LEMMA: WEB3 with lemmatization HUGEDICT: WEB3 with the huge glossary REVERSE: the "reverse context lookup" algorithm COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup Resources:  Resources We used the following resources: Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words Huge bilingual glossary: 59 583 word pairs A list of 599 Bulgarian stop words A list of 508 Russian stop words Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata Evaluation:  Evaluation We order the pairs of words from the testing dataset by the calculated similarity False friends are expected to appear on the top and the cognates on the bottom We evaluate the 11pt average precision of the obtained ordering Results (11pt Average Precision):  Results (11pt Average Precision) Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms Results (11pt Average Precision):  Results (11pt Average Precision) Comparing different context sizes; keeping the stop words Results (11pt Average Precision):  Results (11pt Average Precision) Comparing different improvements of the WEB3 algorithm Results (Precision-Recall Graph):  Results (Precision-Recall Graph) Comparing the recall-precision graphs of evaluated algorithms Results: The Ordering for WEB3:  Results: The Ordering for WEB3 Discussion:  Discussion Our approach is original because: Introduces semantic similarity measure Not orthographic or phonetic Uses the Web as a corpus Does not rely on any preexisting corpora Uses reverse-context lookup Significant improvement in quality Is applied to original problem Classification of almost identically spelled true/false friends Discussion:  Discussion Very good accuracy: over 95% It is not 100% accurate Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences The Web as a corpus introduces noise Google returns the first 1 000 results only Google ranks higher news portals, travel agencies and retail sites than books, articles and forums posts Local context could contains noise Conclusion and Future Work:  Conclusion and Future Work Conclusion Algorithm that can distinguish between cognates and false friends Analyzes words local contexts, using the Web as a corpus Future Work Better glossaries Automatic augmenting the glossary Different language pairs Questions?:  Questions? Cognate or False Friend? Ask the Web!

Related presentations


Other presentations created by Goldie

TGIF introduction
17. 12. 2007
0 views

TGIF introduction

BCSWC Ed presentation
31. 12. 2007
0 views

BCSWC Ed presentation

KUCS
27. 11. 2007
0 views

KUCS

serviciodetaxi141
30. 11. 2007
0 views

serviciodetaxi141

EtruscanWomen
30. 10. 2007
0 views

EtruscanWomen

lecture07
02. 11. 2007
0 views

lecture07

TurkishStraits
05. 11. 2007
0 views

TurkishStraits

how to succeed in an interview
15. 11. 2007
0 views

how to succeed in an interview

IditarodFacts
16. 11. 2007
0 views

IditarodFacts

gerth van wijk
06. 12. 2007
0 views

gerth van wijk

Lecture Two
17. 12. 2007
0 views

Lecture Two

ColdFusiontoFlexSupe rWizard
28. 11. 2007
0 views

ColdFusiontoFlexSupe rWizard

Seth
29. 12. 2007
0 views

Seth

Deculturalization How Why
02. 01. 2008
0 views

Deculturalization How Why

yoe
03. 01. 2008
0 views

yoe

wheedleton
03. 01. 2008
0 views

wheedleton

EuroLogo2007
04. 01. 2008
0 views

EuroLogo2007

Lecture 4 2005
07. 01. 2008
0 views

Lecture 4 2005

zilles
07. 01. 2008
0 views

zilles

DNO IMA Transition Plan
30. 10. 2007
0 views

DNO IMA Transition Plan

promotions
11. 12. 2007
0 views

promotions

sav ch08
30. 12. 2007
0 views

sav ch08

Classificação
28. 12. 2007
0 views

Classificação

230 Class8
20. 02. 2008
0 views

230 Class8

Terry Hilsberg e Learning
24. 02. 2008
0 views

Terry Hilsberg e Learning

lect12
27. 02. 2008
0 views

lect12

Explorerhiddenslide
27. 03. 2008
0 views

Explorerhiddenslide

planet tute
13. 11. 2007
0 views

planet tute

heat health warning system
29. 10. 2007
0 views

heat health warning system

BaroneComm1
29. 10. 2007
0 views

BaroneComm1

lecnew05 11
03. 12. 2007
0 views

lecnew05 11

Sophie Dawson
31. 10. 2007
0 views

Sophie Dawson

ch14 1406
12. 12. 2007
0 views

ch14 1406

Fabio
16. 11. 2007
0 views

Fabio

flex wilhagen en
31. 10. 2007
0 views

flex wilhagen en

IEE Lec6
12. 11. 2007
0 views

IEE Lec6

SigRes HoughtonR Jan2004
26. 10. 2007
0 views

SigRes HoughtonR Jan2004

chap03
02. 01. 2008
0 views

chap03

Jayant
23. 11. 2007
0 views

Jayant

evans
01. 10. 2007
0 views

evans

Eskilstuna PJ
06. 11. 2007
0 views

Eskilstuna PJ

Subsea experience
07. 11. 2007
0 views

Subsea experience

Hanson DOCS
28. 12. 2007
0 views

Hanson DOCS

Bain eng
05. 12. 2007
0 views

Bain eng

document859
20. 11. 2007
0 views

document859

talk2
14. 11. 2007
0 views

talk2