Published on November 20, 2007
SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation Presented by: Samir Tartir, Fall 2004 Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rjagopalan, Andrew Tomkins, John A. Tomlin, Jason Y. Zien IBM Almaden Research Center http://www.almaden.ibm.com/webfountain/resources/semtag.pdf Outline: Outline Motivation Related work TAP SemTag Architecture Phases TBD Results Methodology Seeker Design Architecture Environment Conclusion and Future work. Motivation: Motivation Natural language processing is the most significant obstacle in automating web annotation. To allow for the Semantic Web to become a reality we need: Web-services to maintain metadata. Annotated documents (OWL, RDF, XML, ...). Motivation, Cont’d: Motivation, Cont’d Problem: Applications that can use such data are needed. But, These applications cannot be useful unless there is enough semantically tagged data. Related Work: Related Work Systems built as a result of the Semantic Web are divided among two types: Create ontologies Page annotation. Examples: Protégé, OntoAnnotate, Anntea, SHOE, … Some AI approaches were used, but, they need a lot of training. Some used other NL understanding techniques, example ALPHA. TAP: TAP SemTag uses TAP. TAP is a public broad, shallow knowledgebase. TAP Contains lexical and taxonomical information about popular objects like music, movies, sports, etc. SemTag: SemTag Applied to a collection of 264 million web pages, and generated 434 million automatically disambiguated semantic tags, published to the web as a label bureau. (http://www.w3.org/PICS) SemTag uses the TBD (taxonomy-based disambiuation) algorithm to solve natural language ambiguities. SemTag: SemTag The goal is to add semantic tags to the existing HTML body of the web. Example: “The Chicago Bulls announced that Michael Jordan will…” Will be: The <resource ref = http://tap.stanford.edu/Basketball Team_Bulls>Chicago Bulls</resource> announced yesterday that <resource ref = “http://tap.stanford.edu/ AthleteJordan_Michael”> Michael Jordan</resource> will...’’ Uses a Semantic Label Bureau to store the resulting annotations, and to query the results. SemTag Architecture: SemTag Architecture SemTag Phases: SemTag Phases 1. Spotting: Retrieve documents from Seeker. Tokenize documents. Find contexts (10 words + label + 10 words) related to TAP nodes. 2. Learning: Use representative sample to determine distribution of terms of the taxonomy. SemTag Phases, cont’d: SemTag Phases, cont’d 3. Tagging Disambiguate windows (using TBD). Add to the database. TBD Overview: TBD Overview Ambiguities types: Same label at multiple locations in TAP. Some entities have labels that occur in context not in TAP. Training is done by: Automatic metadata: larger part. Manual metadata: smaller part, only for highly ambiguous labels. TBD Overview, cont’d: TBD Overview, cont’d Each node has a set of labels. E.g.: cats, football, cars all contain the label Jaguar. A spot is a label in a context. Each internal node in TAP has a similarity function that determines whether a node belongs to a particular context. The Sim Algorithm: The Sim Algorithm The TBD Algorithm: The TBD Algorithm SemTag Results: SemTag Results Applied on 264 million pages Produced 550 million labels and 434 spots. Accuracy 82%. SemTag Methodology: SemTag Methodology 1. Lexicon generation: Built a collection of 1.4 million unique words occurring in a random subset of windows containing approximately 90 million total words. Took the most frequent 200,100 words. Took the most frequent 100 words out. Remaining 200,000 words are used in further computations. SemTag Methodology, cont’d: SemTag Methodology, cont’d 2. Similarity functions: Used distribution of words on TAP nodes to derive fu. 3. Measurement values: Used human judges to find mua of the largest 24 largest TAP nodes. 4. Full TBD Processing. 5. Evaluation: Compared TBD results with human judgments. Seeker: Seeker A platform used by SemTag and other increasingly sophisticated text analytics applications. Provides scalable, extensible extraction from erratic resources. An erratic resource is one that may have limited availability, a rapid rate of change, contain conflicting or questionable content, or may be impossible to ingest in totality (e.g., the World Wide Web). Seeker Design Goals: Seeker Design Goals Composability: Ability to develop complex annotation from simple ones. Modularity: Support different methodologies for annotation. Extensibilty: Ability to cope with the rapid evolution of technologies. Seeker Design Goals, Cont’d: Seeker Design Goals, Cont’d Scalability: Being able to support a sample and all the data. Robustness: Dealing with failure. Seeker Design: Seeker Design To achieve modularity and extensibility, SOA (service-oriented architecture) was used where communication is done thru language-independent, network-level APIs. To achieve scalability and robustness, some components were considered “infrastructure”. Seeker Architecture: Seeker Architecture Seeker Environment: Seeker Environment 128 dual-processor, 1 Ghz machines. Each machine is attached via a switched gigabit network to a half terabyte of network-attached storage. IO occupies one of the 128 machines. Seeker Substrate: Seeker Substrate SOA: a local-area, loosely-coupled, pull-based, distributed system. Requirement: high-speed, high-availability, efficient multiple programming languages support. Vinci: A SOAP-derived package for high-performance intranet applications. Uses a lightly-encoded XML over raw TCP sockets to provide the required speed. Conclusion: Conclusion Automatic semantic tagging is essential to bootstrap the Semantic Web. It’s possible to achieve good accuracy with simple disambiguation approaches. Future Work: Future Work Develop more approaches and algorithms to automated tagging. Make annotated data public and seeker as a public service. Slide28: Questions?