bcs 03 nottingham

Information about bcs 03 nottingham

Published on November 26, 2007

Author: Javier

Source: authorstream.com

Content

Multi-Source and MultiLingual Information Extraction:  Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop, Nottingham Trent University, 12 September 2003 Outline:  Outline Introduction to Information Extraction (IE) The MUSE system for Named Entity Recognition Multilingual MUSE Future directions IE is not IR:  IE is not IR IE pulls facts and structured information from the content of large text collections (usually corpora) IR pulls documents from large text collections (usually the Web) in response to specific keywords Extraction for Document Access:  Extraction for Document Access With traditional query engines, getting the facts can be hard and slow Where has the Queen visited in the last year? Which places on the East Coast of the US have had cases of West Nile Virus? Constructing a database through IE and linking it back to the documents can provide a valuable alternative search tool. Even if results are not always accurate, they can be valuable if linked back to the original text Extraction for Document Access:  Extraction for Document Access For access to news identify major relations and event types (e.g. within foreign affairs or business news) For access to scientific reports identify principal relations of a scientific subfield (e.g. pharmacology, genomics) Application Example (1):  Application Example (1) Ontotext’s KIM query and results Application Example (2):  Application Example (2) What is Named Entity Recognition?:  What is Named Entity Recognition? Identification of proper names in texts, and their classification into a set of predefined categories of interest Persons Organisations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions Various other types as appropriate Basic Problems in NE:  Basic Problems in NE Variation of NEs – e.g. John Smith, Mr Smith, John. Ambiguity of NE types: John Smith (company vs. person) June (person vs. month) Washington (person vs. location) 1945 (date vs. time) Ambiguity between common words and proper nouns, e.g. “may” More complex problems in NE:  More complex problems in NE Issues of style, structure, domain, genre etc. Punctuation, spelling, spacing, formatting Dept. of Computing and Maths Manchester Metropolitan University Manchester United Kingdom > Tell me more about Leonardo > Da Vinci Two kinds of approaches:  Two kinds of approaches Knowledge Engineering rule based developed by experienced language engineers make use of human intuition require only small amount of training data development can be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise require large amounts of annotated training data some changes may require re-annotation of the entire training corpus List lookup approach - baseline:  List lookup approach - baseline System that recognises only entities stored in its lists (gazetteers). Advantages - Simple, fast, language independent, easy to retarget (just create lists) Disadvantages - collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity Shallow Parsing Approach (internal structure):  Shallow Parsing Approach (internal structure) Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location: Cap. Word + {City, Forest, Center, River} e.g. Sherwood Forest Cap. Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street Problems with the shallow parsing approach:  Problems with the shallow parsing approach Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police] Semantic ambiguity "John F. Kennedy" = airport (location) "Philip Morris" = organisation Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell] [Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith] Shallow Parsing Approach with Context:  Shallow Parsing Approach with Context Use of context-based patterns is helpful in ambiguous cases "David Walton" and "Goldman Sachs" are indistinguishable But with the phrase "David Walton of Goldman Sachs" and the Person entity "David Walton" recognised, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly. Identification of Contextual Information:  Identification of Contextual Information Use KWIC index and concordancer to find windows of context around entities Search for repeated contextual patterns of either strings, other entities, or both Manually post-edit list of patterns, and incorporate useful patterns into new rules Repeat with new entities Examples of context patterns:  Examples of context patterns [PERSON] earns [MONEY] [PERSON] joined [ORGANIZATION] [PERSON] left [ORGANIZATION] [PERSON] joined [ORGANIZATION] as [JOBTITLE] [ORGANIZATION]'s [JOBTITLE] [PERSON] [ORGANIZATION] [JOBTITLE] [PERSON] the [ORGANIZATION] [JOBTITLE] part of the [ORGANIZATION] [ORGANIZATION] headquarters in [LOCATION] price of [ORGANIZATION] sale of [ORGANIZATION] investors in [ORGANIZATION] [ORGANIZATION] is worth [MONEY] [JOBTITLE] [PERSON] [PERSON], [JOBTITLE] Caveats:  Caveats Patterns are only indicators based on likelihood Can set priorities based on frequency thresholds Need training data for each domain More semantic information would be useful (e.g. to cluster groups of verbs) MUSE – MUlti-Source Entity Recognition:  MUSE – MUlti-Source Entity Recognition An IE system developed within GATE Performs NE and coreference on different text types and genres Uses knowledge engineering approach with hand-crafted rules Performance rivals that of machine learning methods Easily adaptable MUSE Modules:  MUSE Modules Document format and genre analysis Tokenisation Sentence splitting POS tagging Gazetteer lookup Semantic grammar Orthographic coreference Nominal and pronominal coreference Switching Controller:  Switching Controller Rather than have a fixed chain of processing resources, choices can be made automatically about which modules to use Texts are analysed for certain identifying features which are used to trigger different modules For example, texts with no case information may need different POS tagger or gazetteer lists Not all modules are language-dependent, so some can be reused directly Multilingual MUSE:  Multilingual MUSE MUSE has been adapted to deal with different languages Currently systems for English, French, German, Romanian, Bulgarian, Russian, Cebuano, Hindi, Chinese, Arabic Separation of language-dependent and language-independent modules and sub-modules Annotation projection experiments IE in Surprise Languages:  IE in Surprise Languages Adaptation to an unknown language in a very short timespan Cebuano: Latin script, capitalisation, words are spaced Few resources and little work already done Medium difficulty Hindi: Non-Latin script, different encodings used, no capitalisation, words are spaced Many resources available Medium difficulty What does multilingual NE require?:  What does multilingual NE require? Extensive support for non-Latin scripts and text encodings, including conversion utilities Automatic recognition of encoding Occupied up to 2/3 of the TIDES Hindi effort Bilingual dictionaries Annotated corpus for evaluation Internet resources for gazetteer list collection (e.g., phone books, yellow pages, bi-lingual pages) Editing Multilingual Data:                        GATE Unicode Kit (GUK) Complements Java’s facilities Support for defining Input Methods (IMs) currently 30 IMs for 17 languages Pluggable in other applications (e.g. JEdit) Editing Multilingual Data Slide26:  Processing Multilingual Data All processing, visualisation and editing tools use GUK Future directions:  Future directions Tools and techniques Further incorporation of ML methods Annotation projection experiments Automatic pattern generation Tools for morphological analysis and parsing Applications Electronic text corpus of Sumerian literature Tools for semantic web Bioinformatics

Related presentations


Other presentations created by Javier

wap
26. 11. 2007
0 views

wap

PairashThajchayapong1
02. 01. 2008
0 views

PairashThajchayapong1

Lecture13 1
09. 10. 2007
0 views

Lecture13 1

Physical Features of Arab World
24. 10. 2007
0 views

Physical Features of Arab World

arbovirus
24. 10. 2007
0 views

arbovirus

Ch14 Lecture
29. 11. 2007
0 views

Ch14 Lecture

going in 13may02
01. 12. 2007
0 views

going in 13may02

cap3
14. 11. 2007
0 views

cap3

enfoques 4 ppt
15. 11. 2007
0 views

enfoques 4 ppt

DeafTalk
16. 11. 2007
0 views

DeafTalk

db2
19. 11. 2007
0 views

db2

REACH Overview
05. 12. 2007
0 views

REACH Overview

Romantic English Literature
14. 12. 2007
0 views

Romantic English Literature

Treaty of Versailles
23. 12. 2007
0 views

Treaty of Versailles

conman15
28. 12. 2007
0 views

conman15

intro CS 1
04. 01. 2008
0 views

intro CS 1

Radiation Concepts
04. 01. 2008
0 views

Radiation Concepts

Kryptologie Folien Web
05. 01. 2008
0 views

Kryptologie Folien Web

meld ldp iros07 talk3
07. 01. 2008
0 views

meld ldp iros07 talk3

bird
29. 10. 2007
0 views

bird

Ideal Year 2006
02. 11. 2007
0 views

Ideal Year 2006

Saggia Ecologia Presentazione
01. 10. 2007
0 views

Saggia Ecologia Presentazione

Royal Europe consumer
30. 10. 2007
0 views

Royal Europe consumer

Undergrat Presentation 2004
24. 10. 2007
0 views

Undergrat Presentation 2004

report pixel2000
01. 11. 2007
0 views

report pixel2000

Johnson 1
06. 11. 2007
0 views

Johnson 1

USA Presentation Rev 4
08. 11. 2007
0 views

USA Presentation Rev 4

Divisenko
20. 11. 2007
0 views

Divisenko

Civil Society Index Project
23. 11. 2007
0 views

Civil Society Index Project

Unit07Log
01. 11. 2007
0 views

Unit07Log

presentaz roma trieste 4
29. 10. 2007
0 views

presentaz roma trieste 4

Montana Meth Presentation
27. 12. 2007
0 views

Montana Meth Presentation

careerbuilder
20. 02. 2008
0 views

careerbuilder

Brussels 11May06
25. 10. 2007
0 views

Brussels 11May06

EDMT14
27. 02. 2008
0 views

EDMT14

pisanelli
30. 10. 2007
0 views

pisanelli

Newch6www
29. 02. 2008
0 views

Newch6www

tunnista kulutustyyppisi
05. 11. 2007
0 views

tunnista kulutustyyppisi

StratTac06 Leggett
05. 03. 2008
0 views

StratTac06 Leggett

Teela powerpoint 6
14. 03. 2008
0 views

Teela powerpoint 6

67436
27. 03. 2008
0 views

67436

dli20071
30. 03. 2008
0 views

dli20071

GEP2007
25. 10. 2007
0 views

GEP2007

hort2 floraldesign
11. 12. 2007
0 views

hort2 floraldesign

Kodal MALTA
04. 10. 2007
0 views

Kodal MALTA

17 Sussex
17. 12. 2007
0 views

17 Sussex

asdc ncss for website ihc
06. 11. 2007
0 views

asdc ncss for website ihc

frieman
15. 11. 2007
0 views

frieman

Sem Grd Ontology
19. 11. 2007
0 views

Sem Grd Ontology

Underground1
06. 12. 2007
0 views

Underground1

Avape Port
16. 11. 2007
0 views

Avape Port

ceciliat2
28. 12. 2007
0 views

ceciliat2

diane guatelli
31. 10. 2007
0 views

diane guatelli

cacti
12. 12. 2007
0 views

cacti

Attila Vitai Vodafone
26. 11. 2007
0 views

Attila Vitai Vodafone

kevin dustin
13. 11. 2007
0 views

kevin dustin

02 Italy Gorgucci
31. 10. 2007
0 views

02 Italy Gorgucci

wp4status russia2
26. 10. 2007
0 views

wp4status russia2