Text Information Retrieval Mining and Exploitation

Information about Text Information Retrieval Mining and Exploitation

Published on September 3, 2007

Author: Gourmet

Source: authorstream.com

Content

CS276BText Information Retrieval, Mining, and Exploitation:  CS276B Text Information Retrieval, Mining, and Exploitation Lecture 12 Text Mining I Feb 25, 2003 (includes slides borrowed from Marti Hearst, ) The Reason for Text Mining…:  The Reason for Text Mining… Corporate Knowledge “Ore”:  Corporate Knowledge 'Ore' Email Insurance claims News articles Web pages Patent portfolios IRC Scientific articles Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents Text Knowledge Extraction Tasks:  Text Knowledge Extraction Tasks Small Stuff. Useful nuggets of information that a user wants: Question Answering Information Extraction (DB filling) Thesaurus Generation Big Stuff. Overviews: Summary Extraction (documents or collections) Categorization (documents) Clustering (collections) Text Data Mining: Interesting unknown correlations that one can discover Text Mining:  Text Mining The foundation of most commercial 'text mining' products is all the stuff we have already covered: Information Retrieval engine Web spider/search Text classification Text clustering Named entity recognition Information extraction (only sometimes) Is this text mining? What else is needed? One tool: Question Answering:  One tool: Question Answering Goal: Use Encyclopedia/other source to answer 'Trivial Pursuit-style' factoid questions Example: 'What famed English site is found on Salisbury Plain?' Method: Heuristics about question type: who, when, where Match up noun phrases within and across documents (much use of named entities Coreference is a classic IE problem too! More focused response to user need than standard vector space IR Murax, Kupiec, SIGIR 1993; huge amount of recent work Another tool: Summarizing:  Another tool: Summarizing High-level summary or survey of all main points? How to summarize a collection? Example: sentence extraction from a single document (Kupiec et al. 1995; much subsequent work) Start with training set, allows evaluation Create heuristics to identify important sentences: position, IR score, particular discourse cues Classification function estimates the probability a given sentence is included in the abstract 42% average precision IBM Text Miner terminology: Example of Vocabulary found:  IBM Text Miner terminology: Example of Vocabulary found Certificate of deposit CMOs Commercial bank Commercial paper Commercial Union Assurance Commodity Futures Trading Commission Consul Restaurant Convertible bond Credit facility Credit line Debt security Debtor country Detroit Edison Digital Equipment Dollars of debt End-March Enserch Equity warrant Eurodollar … What is Text Data Mining?:  What is Text Data Mining? Peoples’ first thought: Make it easier to find things on the Web. But this is information retrieval! The metaphor of extracting ore from rock: Does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice. Rather: finding patterns across large collections discovering heretofore unknown information Real Text DM:  Real Text DM What would finding a pattern across a large text collection really look like? Discovering heretofore unknown information is not what we usually do with text. (If it weren’t known, it could not have been written by someone!) However, there is a field whose goal is to learn about patterns in text for its own sake … Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections. Definitions of Text Mining:  Definitions of Text Mining Text mining mainly is about somehow extracting the information and knowledge from text; 2 definitions: Any operation related to gathering and analyzing text from external sources for business intelligence purposes; Discovery of knowledge previously unknown to the user in text; Text mining is the process of compiling, organizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inquiry. TDM using Metadata (instead of Text) :  TDM using Metadata (instead of Text) Data: Reuter’s newswire (22,000 articles, late 1980s) Categories: commodities, time, countries, people, and topic Goals: distributions of categories across time (trends) distributions of categories between collections category co-occurrence (e.g., topic|country) Interactive Interface: lists, pie charts, 2D line plots (Dagan, Feldman, and Hirsh, SDAIR ‘96) True Text Data Mining:Don Swanson’s Medical Work:  True Text Data Mining: Don Swanson’s Medical Work Given medical titles and abstracts a problem (incurable rare disease) some medical expertise find causal links among titles symptoms drugs results E.g.: Magnesium deficiency related to migraine This was found by extracting features from medical literature on migraines and nutrition Swanson Example (1991):  Swanson Example (1991) Problem: Migraine headaches (M) Stress is associated with migraines; Stress can lead to a loss of magnesium; calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker; Spreading cortical depression (SCD) is implicated in some migraines; High levels of magnesium inhibit SCD; Migraine patients have high platelet aggregability; Magnesium can suppress platelet aggregability. All extracted from medical journal titles Swanson’s TDM:  Swanson’s TDM Two of his hypotheses have received some experimental verification. His technique Only partially automated Required medical expertise Few people are working on this kind of information aggregation problem. Gathering Evidence:  Gathering Evidence migraine magnesium stress CCB PA SCD All Nutrition Research All Migraine Research Or maybe it was already known?:  Or maybe it was already known? Extracting Metadata from documents:  Extracting Metadata from documents Why metadata?:  Why metadata? Metadata = 'data about data' 'Normalized' semantics Enables easy searches otherwise not possible: Time Author Url / filename And gives information on non-text content Images Audio Video For Effective Metadata We Need::  For Effective Metadata We Need: Semantics Commonly understood terms to describe information resources Syntax Standard grammar for connecting terms into meaningful 'sentences' Exchange framework So we can recombine and exchange metadata across applications and subjects Dublin Core Element Set:  Dublin Core Element Set Title (e.g., Dublin Core Element Set) Creator (e.g., Hinrich Schuetze) Subject (e.g, keywords) Description (e.g., an abstract) Publisher (e.g., Stanford University) Contributor (e.g., Chris Manning) Date (e.g, 2002.12.03) Type (e.g., presentation) Format (e.g., ppt) Identifier (e.g., http://www.stanford.edu/class/cs276a/syllabus.html) Source (e.g. http://dublincore.org/documents/dces/) Language (e.g, English) Coverage (e.g., San Francisco Bay Area) Rights (e.g., Copyright Stanford University) RDF =Resource Description Framework:  RDF = Resource Description Framework Emerging standard for metadata W3C standard Part of W3C’s metadata framework Specialized for WWW Desiderata Combine different metadata modules (e.g., different subject areas) Syndication, aggregation, threading RDF example in XML:  RDF example in XML andlt;?xml version='1.0'?andgt; andlt;rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:dc='http://purl.org/dc/elements/1.1/'andgt; andlt;rdf:Description rdf:about='http://www.ilrt.org/people/cmdjb/'andgt; andlt;dc:titleandgt;Dave Beckett's Home Pageandlt;/dc:titleandgt; andlt;dc:creatorandgt;Dave Beckettandlt;/dc:creatorandgt; andlt;dc:publisherandgt;ILRT, University of Bristolandlt;/dc:publisherandgt; andlt;/rdf:Descriptionandgt; andlt;/rdf:RDFandgt; RDF example:  RDF example My Homepage Dave Beckett’s Home Page Dave Beckett ILRT, University of Bristol has a title of created by published by Resource Description Framework (RDF):  Resource Description Framework (RDF) RDF was conceived as a way to wrap metadata assertions (eg Dublin Core information) around a web resource. The central concept of the RDF data model is the triple, represented as a labeled edge between two nodes. The subject, the object, and the predicate are all resources, represented by URIs Properties can be multivalued for a resource, and values can be literals instead of resources Graph pieces can be chained and nested RDF Schema gives frame-based language for ontologies and reasoning over RDF. mailto:[email protected] http://www.infoloom.com http://purl.org/DC/elements/1.1#Creator Metadata Pros and Cons:  Metadata Pros and Cons CONS Most authors are unwilling to spend time and energy on learning a metadata standard annotating documents they author Authors are unable to foresee all reasons why a document may be interesting. Authors may be motivated to sabotage metadata (patents). PROS Information retrieval often does not work. Words poorly approximate meaning. For truly valuable content, it pays to add metadata. Synthesis In reality, most documents have some valuable metadata If metadata is available, it improves relevance and user experience But most interesting content will always have inconsistent and spotty metadata coverage Metadata and TextCat/IE:  Metadata and TextCat/IE The claim of metadata proponents is that metadata has to be explicitly annotated, because we can’t hope to get, say, a book price from varied documents like: andlt;H1andgt; andlt;The Rhyme of the Ancient Marinerandgt; andlt;/H1andgt; andlt;iandgt;The Rhyme of the Ancient Marinerandlt;/iandgt;, by Samuel Coleridge, is available for the low price of $9.99. This Dover reprint is beautifully illustrated by Gustave Dore. andlt;pandgt; Julian Schnabel recently directed a movie, andlt;iandgt;Pandemoniumandlt;/iandgt;, about the relationship between Coleridge and Wordsworth. Metadata and TextCat/IE:  Metadata and TextCat/IE … but with IE/TextCat, these are exactly the kind of things we can do Of course, we can do it more accurately with human authored metadata But, of course, the metadata might not match the text (metadata spamming) Opens up an interesting world where agents use metadata if it’s there, but can synthesize it if it isn’t (by text cat/IE), and can verify metadata for correctness against text Seems a promising area; not much explored! Lexicon Construction:  Lexicon Construction What is a Lexicon?:  What is a Lexicon? A database of the vocabulary of a particular domain (or a language) More than a list of words/phrases Usually some linguistic information Morphology (manag- e/es/ing/ed -andgt; manage) Syntactic patterns (transitivity etc) Often some semantic information Is-a hierarchy Synonymy Lexica in Text Mining:  Lexica in Text Mining Many text mining tasks require named entity recognition. Named entity recognition requires a lexicon in most cases. Example 1: Question answering Where is Mount Everest? A list of geographic locations increases accuracy Example 2: Information extraction Consider scraping book data from amazon.com Template contains field 'publisher' A list of publishers increases accuracy Manual construction is expensive: 1000s of person hours! Sometimes an unstructured inventory is sufficient Often you need more structure, e.g., hierarchy Lexicon Construction (Riloff):  Lexicon Construction (Riloff) Attempt 1: Iterative expansion of phrase list Start with: Large text corpus List of seed words Identify 'good' seed word contexts Collect close nouns in contexts Compute confidence scores for nouns Iteratively add high-confidence nouns to seed word list. Go to 2. Output: Ranked list of candidates Lexicon Construction: Example:  Lexicon Construction: Example Category: weapon Seed words: bomb, dynamite, explosives Context: andlt;new-phraseandgt; and andlt;seed-phraseandgt; Iterate: Context: They use TNT and other explosives. Add word: TNT Other words added by algorithm: rockets, bombs, missile, arms, bullets Lexicon Construction: Attempt 2:  Lexicon Construction: Attempt 2 Multilevel bootstrapping (Riloff and Jones 99) Generate two data structures in parallel The lexicon A list of extraction patterns Input as before Corpus (not annotated) List of seed words Multilevel Bootstrapping:  Multilevel Bootstrapping Initial lexicon: seed words Level 1: Mutual bootstrapping Extraction patterns are learned from lexicon entries. New lexicon entries are learned from extraction patterns Iterate Level 2: Filter lexicon Retain only most reliable lexicon entries Go back to level 1 2-level performs better than just level 1. Scoring of Patterns:  Scoring of Patterns Example Concept: company Pattern: owned by andlt;xandgt; Patterns are scored as follows score(pattern) = F/N log(F) F = number of unique lexicon entries produced by the pattern N = total number of unique phrases produced by the pattern Selects for patterns that are Selective (F/N part) Have a high yield (log(F) part) Scoring of Noun Phrases:  Scoring of Noun Phrases Noun phrases are scored as follows score(NP) = sum_k (1 + 0.01 * score(pattern_k)) where we sum over all patterns that fire for NP Main criterion is number of independent patterns that fire for this NP. Give higher score for NPs found by high-confidence patterns. Example: New candidate phrase: boeing Occurs in: owned by andlt;xandgt;, sold to andlt;xandgt;, offices of andlt;xandgt; Shallow Parsing:  Shallow Parsing Shallow parsing needed For identifying noun phrases and their heads For generating extraction patterns For scoring, when are two noun phrases the same? Head phrase matching X matches Y if X is the rightmost substring of Y 'New Zealand' matches 'Eastern New Zealand' 'New Zealand cheese' does not match 'New Zealand' Seed Words:  Seed Words Mutual Bootstrapping:  Mutual Bootstrapping Extraction Patterns:  Extraction Patterns Level 1: Mutual Bootstrapping:  Level 1: Mutual Bootstrapping Drift can occur. It only takes one bad apple to spoil the barrel. Example: head Introduce level 2 bootstrapping to prevent drift. Level 2: Meta-Bootstrapping:  Level 2: Meta-Bootstrapping Evaluation:  Evaluation Collins&Singer: CoTraining:  Collinsandamp;Singer: CoTraining Similar back and forth between an extraction algorithm and a lexicon New: They use word-internal features Is the word all caps? (IBM) Is the word all caps with at least one period? (N.Y.) Non-alphabetic character? (ATandamp;T) The constituent words of the phrase ('Bill' is a feature of the phrase 'Bill Clinton') Classification formalism: Decision Lists Collins&Singer: Seed Words:  Collinsandamp;Singer: Seed Words Note that categories are more generic than in the case of Riloff/Jones. Collins&Singer: Algorithm:  Collinsandamp;Singer: Algorithm Train decision rules on current lexicon (initially: seed words). Result: new set of decision rules. Apply decision rules to training set Result: new lexicon Repeat Collins&Singer: Results:  Collinsandamp;Singer: Results Per-token evaluation? Lexica: Limitations:  Lexica: Limitations Named entity recognition is more than lookup in a list. Linguistic variation Manage, manages, managed, managing Non-linguistic variation Human gene MYH6 in lexicon, MYH7 in text Ambiguity What if a phrase has two different semantic classes? Bioinformatics example: gene/protein metonymy Lexica: Limitations - Ambiguity:  Lexica: Limitations - Ambiguity Metonymy is a widespread source of ambiguity. Metonymy: A figure of speech in which one word or phrase is substituted for another with which it is closely associated. (king – crown) Gene/protein metonymy The gene name is often used for its protein product. TIMP1 inhibits the HIV protease. TIMP1 could be a gene or protein. Important difference if you are searching for TIMP1 protein/protein interactions. Some form of disambiguation necessary to identify correct sense. Discussion:  Discussion Partial resources often available. E.g., you have a gazetteer, you want to extend it to a new geographic area. Some manual post-editing necessary for high-quality. Semi-automated approaches offer good coverage with much reduced human effort. Drift not a problem in practice if there is a human in the loop anyway. Approach that can deal with diverse evidence preferable. Hand-crafted features (period for 'N.Y.') help a lot. Terminology Acquisition:  Terminology Acquisition Goal: find heretofore unknown noun phrases in a text corpus (similar to lexicon construction) Lexicon construction Emphasis on finding noun phrases in a specific semantic class (companies) Application: Information extraction Terminology Acquisition Emphasis on term normalization (e.g., viral and bacterial infections -andgt; viral_infection) Applications: translation dictionaries, information retrieval Lexica For Research Index:  Lexica For Research Index Lexica of which classes would be useful? References:  References Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summarizer. http://citeseer.nj.nec.com/kupiec95trainable.html Julian Kupiec. Murax: A robust linguistic approach for question answering using an on-line encyclopedia. In the Proceedings of 16th SIGIR Conference, Pittsburgh, PA, 2001. Don R. Swanson: Analysis of Unintended Connections Between Disjoint Science Literatures. SIGIR 1991: 280-289 Tim Berners Lee on semantic web: http://www.sciam.com/ 2001/0501issue/0501berners-lee.html http://www.xml.com/pub/a/2001/01/24/rdf.html Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping (1999) Ellen Riloff, Rosie Jones. Proceedings of the Sixteenth National Conference on Artificial Intelligence Unsupervised Models for Named Entity Classification (1999) Michael Collins, Yoram Singer

Related presentations


Other presentations created by Gourmet

Christmas Customs
09. 07. 2007
0 views

Christmas Customs

Okan Bayrak
22. 04. 2008
0 views

Okan Bayrak

Gold Accounting Manager
17. 04. 2008
0 views

Gold Accounting Manager

listenc20
14. 04. 2008
0 views

listenc20

CS IDB
09. 04. 2008
0 views

CS IDB

p8whyusesolar
07. 04. 2008
0 views

p8whyusesolar

Continents Thank You Minnie
30. 03. 2008
0 views

Continents Thank You Minnie

Jairam G LNG
28. 03. 2008
0 views

Jairam G LNG

Wang02
27. 03. 2008
0 views

Wang02

Christmas
09. 07. 2007
0 views

Christmas

2 Enigma
01. 01. 2008
0 views

2 Enigma

Earth Shakes Rattles and Rolls
02. 10. 2007
0 views

Earth Shakes Rattles and Rolls

dino show
12. 10. 2007
0 views

dino show

DocNewsNo1605Documen tNo967
22. 10. 2007
0 views

DocNewsNo1605Documen tNo967

Everest
03. 09. 2007
0 views

Everest

Geography Jeopardy
03. 09. 2007
0 views

Geography Jeopardy

Internet in Nepal
03. 09. 2007
0 views

Internet in Nepal

world wonders
03. 09. 2007
0 views

world wonders

IATA IOSA
03. 09. 2007
0 views

IATA IOSA

RCE Kesennuma Omose E
02. 11. 2007
0 views

RCE Kesennuma Omose E

Harabagiu2005
20. 11. 2007
0 views

Harabagiu2005

Navigating UF March2003
03. 10. 2007
0 views

Navigating UF March2003

TWAR05006 WinHEC05
03. 01. 2008
0 views

TWAR05006 WinHEC05

Rapport Italie Farrace EN
01. 01. 2008
0 views

Rapport Italie Farrace EN

TCE 2004 Farm Safety Report
29. 12. 2007
0 views

TCE 2004 Farm Safety Report

Presentation GEF
23. 11. 2007
0 views

Presentation GEF

Diwali Oct26 06
09. 07. 2007
0 views

Diwali Oct26 06

Deepavali presentation
09. 07. 2007
0 views

Deepavali presentation

Celebrations
09. 07. 2007
0 views

Celebrations

celebrate nashville presentation
09. 07. 2007
0 views

celebrate nashville presentation

Realismo
01. 10. 2007
0 views

Realismo

C Ainsley
15. 10. 2007
0 views

C Ainsley

How has family life changed
24. 02. 2008
0 views

How has family life changed

usability
03. 09. 2007
0 views

usability

orosz 2005
12. 10. 2007
0 views

orosz 2005

EC106FA98CH52
14. 12. 2007
0 views

EC106FA98CH52

bic 1 intro
12. 03. 2008
0 views

bic 1 intro

the peninsula war
10. 10. 2007
0 views

the peninsula war

centennial
09. 07. 2007
0 views

centennial

2003 bell 08 mehallis
19. 06. 2007
0 views

2003 bell 08 mehallis

2003 bell 08 garlander
19. 06. 2007
0 views

2003 bell 08 garlander

2003 bell 07 ericmartin
19. 06. 2007
0 views

2003 bell 07 ericmartin

2003 bell 15 page
19. 06. 2007
0 views

2003 bell 15 page

2003 bell 18 bulkley
19. 06. 2007
0 views

2003 bell 18 bulkley

Hermes
29. 09. 2007
0 views

Hermes

arpresentation2 htm
09. 07. 2007
0 views

arpresentation2 htm

lecture29
05. 01. 2008
0 views

lecture29

2003 bell 07 chrislotspeich
19. 06. 2007
0 views

2003 bell 07 chrislotspeich

Witt Herb Resistance
03. 09. 2007
0 views

Witt Herb Resistance

Satire Power Point
20. 02. 2008
0 views

Satire Power Point

Gpo AH Acuaticos Panama 02 2006
23. 10. 2007
0 views

Gpo AH Acuaticos Panama 02 2006

charla cartagena
22. 10. 2007
0 views

charla cartagena

GCB in Space 4
27. 09. 2007
0 views

GCB in Space 4

2003 bell 08 posey
19. 06. 2007
0 views

2003 bell 08 posey

2003 bell 08 birky
19. 06. 2007
0 views

2003 bell 08 birky

changes to specialty training
03. 01. 2008
0 views

changes to specialty training

2003 bell 12 kristhomke
19. 06. 2007
0 views

2003 bell 12 kristhomke

A Snapshot of Salems Streets
30. 12. 2007
0 views

A Snapshot of Salems Streets

2007910164952
03. 01. 2008
0 views

2007910164952

East Asia and the World Willard
28. 02. 2008
0 views

East Asia and the World Willard

2003 bell 08 bakker
19. 06. 2007
0 views

2003 bell 08 bakker

AAC JET 2004
23. 10. 2007
0 views

AAC JET 2004

Mars1
03. 09. 2007
0 views

Mars1

LeeEddington
01. 11. 2007
0 views

LeeEddington