Published on November 19, 2007
Mining Keywords from Large Topic Taxonomies: Mining Keywords from Large Topic Taxonomies Dissertation Presentation Christiana Christophi Supervisor: Dr. Marios Dikaiakos 12 November 2004 Outline: Outline Goal Motivation Related Work and Technology System Description Evaluation Conclusions and Future Work Goal: Goal This dissertation researches the area of Information Retrieval and Web Mining in order to create a system that dynamically enriches a large topic taxonomy with a set of related keywords at each topic node. What is a taxonomy: What is a taxonomy A structure that organizes large text databases hierarchically by topic to help searching, browsing and filtering. Many corpora are organized into taxonomies like: internet directories (Yahoo) digital libraries (Library of Congress Catalogue) patent databases (IBM patent database) Motivation: Motivation There is not a publicly available general-purpose keyword taxonomy. Construction of hierarchical keyword taxonomies requires a continuous and intensive human effort. The result has a restricted size and subject coverage. Our approach: Exploit an existing taxonomy and then use the contents of pages within to produce a general-purpose keyword catalog. Slow process of classifying documents A classification task requires an example document set for each topic and then performing classification based on all the content of the example set. Our approach: Classify the document based on keywords of each topic-node. Drawbacks of keyword based searches. Search engines simply return results that contain the user’s querying terms Our approach: The catalog could help build a taxonomy classifier. Synonymy and Polysemy: Synonymy and Polysemy Synonymy the same concept can be expressed using different sets of terms e.g. classification categorization taxonomy negatively affects recall Polysemy identical terms can be used in very different semantic contexts e.g. jaguar negatively affects precision Precision and Recall: Precision and Recall Precision Represents the percentage of retrieved documents that are relevant to a query Recall Represents the percentage of documents that are relevant to the query and were retrieved. Related Work: Related Work Extraction of keywords for classification or link analysis. Corpus of documents that was processed was relatively small. WebKB knowledge base that consists of barely 8,282 pages describing university people, courses and research projects Hoover’s data set consisting of 4,285 companies corporate information. Datasets are created be extracting sub-trees from Web directories like Yahoo! and ODP hierarchy (aka Dmoz) e.g. Study for topic-based properties of the Web by Chakrabarti uses as an initial set a 482-topic Dmoz collection. The ACCIO system by Davidov et al in “Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory” which performs automatic acquisition of labeled datasets for text categorization. Related Work: Related Work Hierarchical classification by Chakrabarti Effort to select features for text documents. Two models of documents Bernoulli: number of documents a term appears in Binary: a term either appears or not in a document Focus on classification performance rather than better keyword extraction. Information Retrieval : Information Retrieval Information Retrieval (IR) constructs an index for a given corpus and responds to queries by retrieving all the relevant corpus objects and as few non-relevant objects as possible. index a collection of documents (access efficiency) given user’s query rank documents by importance (accuracy) determine a maximally related subset of documents Indexing: Indexing An index structure is a hash indexed or B+ tree indexed table. This table consists of a set of records each containing two fields: ID and posting_list. Objects from the corpus pointing to a list of lexical items Inverted index effective for very large collections of documents associates lexical items to their occurrences in the collection A dictionary example each key is a term t V, where V is the Vocabulary associated value p(t) points to a posting list Index: Index Inverted Index: Inverted Index IR Models: IR Models Statistical:Use statistical measures to rank documents that best satisfy the query terms. Boolean model Documents are represented by a set of index terms Each term is viewed as Boolean variable (e.g. TRUE: Term is present) Vector space model (VSM) Semantic:Use syntactic and semantic analysis to simulate human understanding of text. Natural Language Processing (NLP) Match query and documents semantic content. Latent Semantic Indexing (LSI) Reduces space so that each dimension in the reduced space tends to represent those terms that are more likely to co-occur in the collection. Statistical IR Models: Statistical IR Models Theoretic models - Boolean model Documents are represented by a set of index terms Each term is viewed as Boolean variable (e.g. TRUE: Term is present) User query represents a Boolean expression. Documents retrieved satisfy the Boolean expression. Terms are combined using Boolean operators like ‘AND’, ‘OR’ and ‘NOT’. No ranking of the results is possible; a document either satisfies the query or not. Vector space model (VSM): Vector space model (VSM) Documents are represented by a set of vectors in a vector space, where each unique term represents one dimension of that space. A vector-space representation of m text documents is as follows: Let n denote the number of terms in the vocabulary. Consider an n x m matrix A, whose entry aij indicates either the presence or absence of term i in document j. The presence of a word can either be denoted by a 1 or a numeric weight. Semantic IR Models: Semantic IR Models Natural Language Processing (NLP) Match query and documents semantic content. Indexing phrases (improve precision), thesaurus-group (improve recall) Latent Semantic Indexing (LSI) Vector information retrieval method, which in coordination with singular value decomposition (SVD) is used for dimensionality reduction. Each dimension in the reduced space tends to represent those terms that are more likely to co-occur in the collection. LSI considers as semantically close documents that have many words in common and as semantically distant documents with few words in common. Term Weighting: Term Weighting A term that appears many times within a document is likely to be more important than a term that appears only once A term that occurs in a few documents is likely to be a better discriminator than a term that appears in most or all documents Large documents must be treated equally to smaller documents Term Weighting: Term Weighting VSM weighting Term Frequency Element (Local Weight of term i in document j): Lij Collection Frequency Element (Global Weight of term i): Gi Normalization Length Element (Normalization factor for document j): Nj wij= Lij * Gi * Nj Local Weight : Local Weight Global Weight : Global Weight Normalization length : Normalization length Web Mining: Web Mining Web Mining: The use of Data Mining techniques to automatically discover and extract information from the WWW documents and services. The Web provides a fertile ground for data mining: an immense and dynamic collection of pages with countless hyperlinks and huge volumes of access and usage information. Searching, comprehending, and using the semistructured information stored on the Web poses a significant challenge because: Web page complexity exceeds the complexity of traditional text collection. The Web constitutes a highly dynamic information source. The Web serves a broad spectrum of user communities. A small portion of Web pages contain truly relevant or useful information. DM to supplement keyword-based indexing, which is the cornerstone for Web search engines. Classification maps data into predefined groups or classes Classification schemes: Classification schemes Nearest Neighbour (NN) Index documents. Given a document d, fetch k training documents that are closer to d. The class that occurs most times is assigned to d. Classification schemes: Classification schemes Naïve Bayes (NB) Naïve assumption: attribute independence P(E1,…,Ek|C) = P(E1|C)·…·P(Ek|C) C: class value of instance E: Evidence (=instance) Support Vector Machines (SVM) Construct a direct function from term space to the class variable. Ability to learn can be independent of the dimensionality of the feature space. Classification evaluation: Classification evaluation Hold-out estimate Training set and testing set used are mutually independent. Repeatedly resample the dataset to calculate hold-out estimates by randomly reordering and partitioning it into training (2/3) and test sets (1/3) Computing average and standard deviation of accuracy. Cross-validation Randomly reorder the data set and then split it into n folds of equal size. In each iteration, 1 fold used for testing and n-1 folds for training accuracy: averaged test results over all folds. The folds can be random or modified to reflect class distributions in each fold as in the complete dataset (stratified cross-validation). Leave-one-out (loo) cross-validation n = # of examples. Less reliable results. Used with small datasets System Description: System Description Focus: enrich a topic taxonomy with good-quality keywords. Extract URLs from the ODP and download the pages. Process these pages by extracting keywords. Assess the keywords gathered for each document and coalesce them into keywords for the category to which the Web page belongs in the topic taxonomy. Design Issues: Design Issues Scalable: process a large corpus of documents Flexible: incorporate more documents and additively process them without requiring a reprocessing of the whole document body Highly customizable: use different system parameters Modular: Different algorithms can be used like “plug&play” The taxonomy used: The taxonomy used Acquired from the Open Directory Project (ODP). Indexed over 4 million sites catalogued in 590,000 categories Initially named Gnuhoo and then NewHoo by Skrenta and Truel (1998) Within 2 weeks - 200 editors, 27000 sites and 2000 categories. Within 5 weeks - acquired by Netcenter. Basic idea: As the Internet expands, the number of Internet users also augments. These users can each help organise a small portion of the web. Available through a Resource Description Framework (RDF). System Architecture: System Architecture The system comprises of several modules. Each module takes as input the results of the previous module in line and outputs an Index. Each module of our system is composed by Worker class, which performs the processing tasks for each step, Index Creator class that coordinatesWorkers by delegating tasks Indexer class that outputs the results of each module to a file in binary form Index Reader class that reloads the Index from the file back to memory. The whole procedure is split into two phases. Document Keyword Extraction: result - a catalog with a set of keywords for each document of the Information Repository, Topic Keyword Extraction: result - a keyword catalog for each topic in the hierarchical taxonomy. System Architecture: System Architecture Document Keyword Extraction: Document Keyword Extraction Preprocessing: Preprocessing Preprocessing Extracts a list of URLs from the ODP taxonomy down to a specified depth. Downloads Web Pages. Assigns a unique number (docID) to each page Forward Index Creator: Forward Index Creator Forward Index Creator Processes Web pages Parses each file and extracts keywords. Ignores stop word list. 336 common English words Perform stemming Calculate local weight Produce a Forward Index and Topics Index Stemming: Stemming Reduce all morphological variants of a word to a single index term Porter Stemming Algorithm Five rules for reduction, which are applied to the word sequentially. The rule chosen applies to the longest suffix. Examples of the trimming performed are: processes process policies polici emotional emotion e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE BINARIZATION => BINARIZE Local Weight: Local Weight Calculate local weight for each word in each document. Consider the HTML tag within which a word appears. Group tags into 7 class. Assign each class with an importance factor called CI (Class Importance). Calculate the frequency in each class and produce Term Frequency Vector (TFV). Local weight Lij: Forward Index Creator: Forward Index Creator Forward Index Creator Web documents Forward Index Topics Index Lexicon Creator: Lexicon Creator Lexicon Creator Processes Forward Indexes. For each word, it calculates the global weight. Produces a Lexicon Index Global weight is Inverted Document Frequency (IDF). IDF scales down words that appear in many documents. Lexicon Creator: Lexicon Creator Lexicon Creator Lexicon Index FwIndexes Normalization Index Creator: Normalization Index Creator Normalization Index Creator Processes Forward Indexes and Lexicons. For each document, calculate the normalization length. Produces a Normalization Index Cosine normalization to compensate for document length. This data transformation essentially projects each vector onto the unit Euclidian sphere. Normalization Index Creator: Normalization Index Creator Normalization Index Creator Normalization Index FwIndexes Lexicon Weight Calculator: Weight Calculator Weight Calculator Processes Forward Indexes, Lexicon and Normalization Indexes. Calculate the weight of each word in each document Weight Calculator: Weight Calculator Weight Calculator Forward Index FwIndexes Lexicon NormIndexes Topic Keyword Extraction: Topic Keyword Extraction Forward Index and Topics Index Sorter : Forward Index and Topics Index Sorter Must match the document set of keywords to topic set of keywords Must replace docID with topicID Have to sort indexes to do this efficiently. Large volume of the indexes to be processed. Data cannot fit into main memory External Sorting FwIndex & TopicsIndex Sorter External Sorting: External Sorting Repetitively load on memory a segment from the file, sort it with an algorithm like Quicksort, and write the sorted data back to the file. Merge sorted blocks with MergeSort by making several passes through the file, creating consecutively larger sorted blocks until the whole file is sorted. The most common scheme is the "k-Way Merging“. A 4-way merge on 16 runs Topics ID Replacer: Topics ID Replacer Replace each docID in the sorted Forward Index with the corresponding topicID TopicsID Replacer Topics ID Replacer: Topics ID Replacer TopicsID Replacer Topics Keywords Index Topics Keywords Index Sorter: Topics Keywords Index Sorter Sort the Topics Keyword Index by topicID Topics Keyword Index Sorter Topics Records Merger: Topics Records Merger Process the Topics Keyword Index Coalesce together multiple records of the same “topicID, wordID” by calculating the mean value of the weight Topics Record Merger topicID, wordID, ƒ(weight1,…, weightn) Evaluating Results: Evaluating Results Extracted URLs from ODP whose topics are up to depth 6. Root-Topic/ [Sub-topic1]/ [Sub-topic2] / [Sub-topic3] / [Sub-topic4] / [Sub-topic5] / [Sub-topic6] Extracted 1,153,086 such URLs. Downloaded those pages using the eRACE crawler. The crawler managed to download 1,046,021 webpages. From these pages, 257,978 pages were under the topic Top/World, (the majority are non English). For the processing of these pages, we used a non-exclusive Solaris machine having 4 CPUs and 8GB of memory. We executed the Java program using 4GB of memory. Experiments and Results: Experiments and Results Experiments and Results: Experiments and Results Experiments and Results: Experiments and Results Experiments and Results: Experiments and Results In order to evaluate the keywords produces we try to classify documents based on them. Tool used was Weka Weka is a collection of machine learning algorithms for data mining tasks Contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Open source software issued under the GNU General Public License. Input data are represented in ARFF (Attribute-Relation File Format). Includes a collection of instances that have a set of common characteristics (features). Experiment 1: Experiment 1 Select only web documents located under the topic ‘/Top/Arts/Movies/’. reduction of the dataset to 28102 webpages 5756 unique topics and 123253 wordIDs Minimum threshold of the weight of a word to be 0,01. This reduced the keywords’ set to 19137. A big number to represent features: we selected the 40 most important keywords. For the majority of topics only a few webpages as samples We selected topics for which there were at least 10 webpages This reduced the set of topics from 5756 to 93 and the number of instances (webpages) of the data set from 28102 to 2432. Experiment 1: Experiment 1 Topic Appearance before selection Experiment 1 – Classification Results: Experiment 1 – Classification Results Experiment 2: Experiment 2 Dataset includes documents that belong to topics of depth 2 in the taxonomy. The result is a dataset of 3939 documents, 194 topics and 16373 words. We produced an ARFF file, where each record represents a document and states the weight for each of the 16373 keywords in the particular document, as well as the topic to which the document belongs. Experiment 2 – Classification Results: Experiment 2 – Classification Results Experiment 2 (NN, k=5): Experiment 2 (NN, k=5) TP Rate: It is the fraction of positive instances predicted as positive. FP Rate: It is the fraction of negative instances predicted as positive. Experiment 2 (NN, k=1): Experiment 2 (NN, k=1) TP Rate: It is the fraction of positive instances predicted as positive. FP Rate: It is the fraction of negative instances predicted as positive. Conclusions: Conclusions We have proposed a system that takes advantage of the content of Web pages by using well-known IR techniques to extract keywords for enriching a large-scale taxonomy. The challenge was to build a scalable, flexible and highly customizable system. We have tested the system with a corpus of over 1,000,000 documents with satisfactory performance and speed. The algorithms are designed to eliminate large in-memory processing. The memory is analogous to the size of the lexicon (words) and not the number of documents The system is able to incorporate more documents and additively process them without requiring a reprocessing of the whole document corpus. It is customizable and modular enough to use different parameters IR algorithms. This gives the ability to compare the results of different techniques or to better tune parameter values. The results presented have indicated the good-quality of the keywords produced, although it seems there is room for improvement. Future Work: Future Work We believe that there is a lot of work that can be done in improving and exploiting this tool. Reduce the processing time by improving the algorithms used. Better management of Input/Output tasks, e.g. the deployment of the Mapped IO functionality of Java or employment of lower level languages to perform such tasks. Recognize the language of each Internet page, and therefore use the appropriate stemmer and stop word list for each one. Extract meta-data assigned by the volunteer editors from the ODP dump. We must investigate what is the optimal set of attributes that can describe a given topic and maximize the quality of the application as well as what parameters can affect this decision The large database of keywords for each category can be exploited in different ways. Help automatically catalog documents into a topic hierarchy while they can complement the documents with metadata. Useful tool for statistically analyzing large hierarchies like ODP and Yahoo. It can be used to assign a topic to user queries and thus filter out irrelevant topics. Future Work: Future Work A more sophisticated version of the tool could also exploit thesaurus like WordNet. This could enforce the semantic meaning of the results. We could study methodologies of diffusing keywords up or down the hierarchy. We will need some visualization tools for the presentation of the results. The system could also be provided as an Internet service where a user could query the keywords of a particular topic or sub-tree. The results could be presented using some visualization technique. Slide66: THANK YOU FOR YOUR ATTENTION. QUESTIONS?