SemTag and Seeker

Information about SemTag and Seeker

Published on November 20, 2007

Author: Francisco

Source: authorstream.com

Content

SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation:  SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation Presented by: Samir Tartir, Fall 2004 Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rjagopalan, Andrew Tomkins, John A. Tomlin, Jason Y. Zien IBM Almaden Research Center http://www.almaden.ibm.com/webfountain/resources/semtag.pdf Outline:  Outline Motivation Related work TAP SemTag Architecture Phases TBD Results Methodology Seeker Design Architecture Environment Conclusion and Future work. Motivation:  Motivation Natural language processing is the most significant obstacle in automating web annotation. To allow for the Semantic Web to become a reality we need: Web-services to maintain metadata. Annotated documents (OWL, RDF, XML, ...). Motivation, Cont’d:  Motivation, Cont’d Problem: Applications that can use such data are needed. But, These applications cannot be useful unless there is enough semantically tagged data. Related Work:  Related Work Systems built as a result of the Semantic Web are divided among two types: Create ontologies Page annotation. Examples: Protégé, OntoAnnotate, Anntea, SHOE, … Some AI approaches were used, but, they need a lot of training. Some used other NL understanding techniques, example ALPHA. TAP:  TAP SemTag uses TAP. TAP is a public broad, shallow knowledgebase. TAP Contains lexical and taxonomical information about popular objects like music, movies, sports, etc. SemTag:  SemTag Applied to a collection of 264 million web pages, and generated 434 million automatically disambiguated semantic tags, published to the web as a label bureau. (http://www.w3.org/PICS) SemTag uses the TBD (taxonomy-based disambiuation) algorithm to solve natural language ambiguities. SemTag:  SemTag The goal is to add semantic tags to the existing HTML body of the web. Example: “The Chicago Bulls announced that Michael Jordan will…” Will be: The <resource ref = http://tap.stanford.edu/Basketball Team_Bulls>Chicago Bulls</resource> announced yesterday that <resource ref = “http://tap.stanford.edu/ AthleteJordan_Michael”> Michael Jordan</resource> will...’’ Uses a Semantic Label Bureau to store the resulting annotations, and to query the results. SemTag Architecture:  SemTag Architecture SemTag Phases:  SemTag Phases 1. Spotting: Retrieve documents from Seeker. Tokenize documents. Find contexts (10 words + label + 10 words) related to TAP nodes. 2. Learning: Use representative sample to determine distribution of terms of the taxonomy. SemTag Phases, cont’d:  SemTag Phases, cont’d 3. Tagging Disambiguate windows (using TBD). Add to the database. TBD Overview:  TBD Overview Ambiguities types: Same label at multiple locations in TAP. Some entities have labels that occur in context not in TAP. Training is done by: Automatic metadata: larger part. Manual metadata: smaller part, only for highly ambiguous labels. TBD Overview, cont’d:  TBD Overview, cont’d Each node has a set of labels. E.g.: cats, football, cars all contain the label Jaguar. A spot is a label in a context. Each internal node in TAP has a similarity function that determines whether a node belongs to a particular context. The Sim Algorithm:  The Sim Algorithm The TBD Algorithm:  The TBD Algorithm SemTag Results:  SemTag Results Applied on 264 million pages Produced 550 million labels and 434 spots. Accuracy 82%. SemTag Methodology:  SemTag Methodology 1. Lexicon generation: Built a collection of 1.4 million unique words occurring in a random subset of windows containing approximately 90 million total words. Took the most frequent 200,100 words. Took the most frequent 100 words out. Remaining 200,000 words are used in further computations. SemTag Methodology, cont’d:  SemTag Methodology, cont’d 2. Similarity functions: Used distribution of words on TAP nodes to derive fu. 3. Measurement values: Used human judges to find mua of the largest 24 largest TAP nodes. 4. Full TBD Processing. 5. Evaluation: Compared TBD results with human judgments. Seeker:  Seeker A platform used by SemTag and other increasingly sophisticated text analytics applications. Provides scalable, extensible extraction from erratic resources. An erratic resource is one that may have limited availability, a rapid rate of change, contain conflicting or questionable content, or may be impossible to ingest in totality (e.g., the World Wide Web). Seeker Design Goals:  Seeker Design Goals Composability: Ability to develop complex annotation from simple ones. Modularity: Support different methodologies for annotation. Extensibilty: Ability to cope with the rapid evolution of technologies. Seeker Design Goals, Cont’d:  Seeker Design Goals, Cont’d Scalability: Being able to support a sample and all the data. Robustness: Dealing with failure. Seeker Design:  Seeker Design To achieve modularity and extensibility, SOA (service-oriented architecture) was used where communication is done thru language-independent, network-level APIs. To achieve scalability and robustness, some components were considered “infrastructure”. Seeker Architecture:  Seeker Architecture Seeker Environment:  Seeker Environment 128 dual-processor, 1 Ghz machines. Each machine is attached via a switched gigabit network to a half terabyte of network-attached storage. IO occupies one of the 128 machines. Seeker Substrate:  Seeker Substrate SOA: a local-area, loosely-coupled, pull-based, distributed system. Requirement: high-speed, high-availability, efficient multiple programming languages support. Vinci: A SOAP-derived package for high-performance intranet applications. Uses a lightly-encoded XML over raw TCP sockets to provide the required speed. Conclusion:  Conclusion Automatic semantic tagging is essential to bootstrap the Semantic Web. It’s possible to achieve good accuracy with simple disambiguation approaches. Future Work:  Future Work Develop more approaches and algorithms to automated tagging. Make annotated data public and seeker as a public service. Slide28:  Questions?

Related presentations


Other presentations created by Francisco

Finding Nemo
07. 11. 2007
0 views

Finding Nemo

Yeopresentation 000
13. 04. 2008
0 views

Yeopresentation 000

forKoreaLA
27. 03. 2008
0 views

forKoreaLA

AGE OF IMPERIALISM
14. 03. 2008
0 views

AGE OF IMPERIALISM

S1 Respdrugs
05. 03. 2008
0 views

S1 Respdrugs

Cervenka NCTCOG IOWA PeerReview
29. 02. 2008
0 views

Cervenka NCTCOG IOWA PeerReview

Green Eggs And Ham correct
27. 02. 2008
0 views

Green Eggs And Ham correct

daddy day camp
20. 02. 2008
0 views

daddy day camp

SIFT Forth
07. 01. 2008
0 views

SIFT Forth

The Sonnet
02. 10. 2007
0 views

The Sonnet

lecture3spr07
04. 10. 2007
0 views

lecture3spr07

Explosives
08. 11. 2007
0 views

Explosives

cex kr t4
01. 12. 2007
0 views

cex kr t4

Conjoint Analysis
02. 11. 2007
0 views

Conjoint Analysis

Chapter 02 Engineering Ethics
05. 11. 2007
0 views

Chapter 02 Engineering Ethics

Los Pronombres Reflexivos
05. 11. 2007
0 views

Los Pronombres Reflexivos

2003MonsterCalendar
06. 11. 2007
0 views

2003MonsterCalendar

Motions of the Celestial Sphere
13. 11. 2007
0 views

Motions of the Celestial Sphere

Ames Internship 2004 10
15. 11. 2007
0 views

Ames Internship 2004 10

Measuring Customer Satisfaction
16. 11. 2007
0 views

Measuring Customer Satisfaction

Chapter8Overview
21. 11. 2007
0 views

Chapter8Overview

Evaluation of Student Learning
13. 12. 2007
0 views

Evaluation of Student Learning

Virtual Public Diplomacy
23. 12. 2007
0 views

Virtual Public Diplomacy

generational cohorts
27. 12. 2007
0 views

generational cohorts

Chapter 20
28. 12. 2007
0 views

Chapter 20

Introductory Lecture1
06. 12. 2007
0 views

Introductory Lecture1

ln ltc EDCD
01. 01. 2008
0 views

ln ltc EDCD

water harvesting in antiquity
02. 01. 2008
0 views

water harvesting in antiquity

Solar Eclipse Through Sp4
04. 01. 2008
0 views

Solar Eclipse Through Sp4

Bienvenidos a la Clase
17. 12. 2007
0 views

Bienvenidos a la Clase

section2
23. 11. 2007
0 views

section2

3176 2630
29. 11. 2007
0 views

3176 2630

cattaneo01
26. 11. 2007
0 views

cattaneo01

UGAmain2
28. 09. 2007
0 views

UGAmain2

telepharmacy presentation 42503
23. 11. 2007
0 views

telepharmacy presentation 42503

JMcKane Restrict
02. 11. 2007
0 views

JMcKane Restrict

xuan part1
07. 01. 2008
0 views

xuan part1

GIPP 2007
27. 09. 2007
0 views

GIPP 2007

casestudy angsana
10. 12. 2007
0 views

casestudy angsana

Anthology
01. 11. 2007
0 views

Anthology