Published on November 1, 2007
TagSense- Marrying Folksonomy and Ontology: TagSense - Marrying Folksonomy and Ontology By: Zixin Wu Advisor: Amit P. Sheth Committee: John A. Miller Prashant Doshi Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo Folksonomy: Folksonomy Web page and photos from Flick.com Web page from del.icio.us Folksonomy Definitions: Folksonomy Definitions The behavior of massive tagging in social context and its product – tags for Web resources. It is collaborative metadata extraction and annotation. (from Thomas Vander Wal): Folksonomy is the result of personal free tagging of information and objects (anything with a URL) for one's own retrieval. The tagging is done in a social environment (usually shared and open to others).  (from Tom Gruber): the emergent labeling of lots of things by people in a social context.  Features of Folksonomy: Features of Folksonomy Makes metadata extraction from multimedia Web resources easier. Extract information from the perspective of information consumer, e.g. put tags about the house in a photo but not the dog in it. Popular tags prevail and tags for a Web resource converge over time. The Long Tail: The Long Tail Power Law Distribution of Tags : Power Law Distribution of Tags  Folksonomy Triad [4,5]: Folksonomy Triad [4,5] The person tagging The Web resource being tagged The tag(s) being used on that Web resource We can use two of the elements to find a third element. e.g. find persons with similar interests by comparing the Web resources they tagged and the tags they used Motivation Scenarios – Ambiguous Words: Motivation Scenarios – Ambiguous Words Search for “apple” Search for “turkey” Disambiguation: Disambiguation What people usually do: add more keywords for disambiguation Trade off between precision and recall rates Motivation Scenarios – Background Knowledge: Motivation Scenarios – Background Knowledge Task: Find photos about cities in Europe Solution1: Search “city Europe” Solution2: try the name of cities in Europe one by one Could be improved if the system knows Which term/concept is a city Which city is in Europe Significant Drawbacks of Folksonomy: Significant Drawbacks of Folksonomy Keyword ambiguity Lack of background knowledge Ontology: Ontology Ontology is an important term in Knowledge Representation and the key enabler of the Semantic Web A formal specification of a conceptualization  Ontologies state knowledge explicitly by using URIs and relationships, e.g. “#Paris #is_located_in #Europe” Current Specifications: RDF(s) [7,8], OWL , etc. Semantic Annotation: Semantic Annotation Figure from  Multiple Ontologies: Multiple Ontologies One Ontology cannot be always comprehensive enough Ontologies may be incompatible If multiple ontologies are used, we need to select and rank ontologies for a query. Objectives: Objectives Shorten the time and effort for information retrieval in folksonomy improve recall rates by considering synonyms and enabling semantic search improve result ranking by putting the most appropriate items on the top of query results Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo Approach Overview: Approach Overview Do not add any burden to our users: they should be able to use only tags to describe and search Web resources Do not expect our users have Semantic Web background Utilize ontologies as background knowledge in information retrieval Approach Overview: Approach Overview Ontologies Folksonomy Some Terms: Some Terms Web resource: anything with a URL Label: one or more keywords, e.g. air ticket Tag: a label tagged to a Web resource. Two different tags may have the same label Sense Cluster (or cluster): where tags with similar meanings are put together. Ideally, a cluster corresponds to a meaning. But often times, a meaning is represented by multiple clusters together. Semantic annotation: to associate a cluster with ontological concepts Approach Overview: Approach Overview (a dot is a tag, a circle in blue is a sense cluster, a circle in yellow is an actual meaning) Ontology 1 Ontology 2 Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo Data Cleanup: “Dirty” Tags: Data Cleanup: “Dirty” Tags “bird” and “birds” “ebook” and “e-book”, “air-ticket”, “airticket”, and ”air ticket” “freephotos” should be “free”,”photo” “travelagent” should be “travel agent” “sculture” should be “sculpture” @pub-travel Europe2005 Tag Normalization: Tag Normalization Check 2 online dictionaries: Webster.com and Dict.cn Webster.com: stemming and misspelling Swimming -> swim, dogs -> dog Sculture -> sculpture Dict.cn: more words and compound words ibm: not in Webster, but in Dict.cn open source Try to split tags “freephotos” -> “free” and “photo” Ignore pure numbers, such as 2005, 07_01_2005 Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo Sense Indexing: Sense Indexing Ticket Keywords Senses Access Permit For an offender Fine Good Sense Indexing: Sense Indexing The mappings between keywords and senses are n:m Index Web resources by senses instead of keywords. Put tags with similar meaning into the same cluster Need to disambiguate each node when indexing Differences from Word Sense Disambiguation [11-15]: Differences from Word Sense Disambiguation [11-15] No sentence: no sentence structure, no part-of-speech analysis. The order of the labels in a Web resource are not necessary relevant. Produced in a social context: significant number of terms are not in lexicons. Terms change more frequently. That means we need to create senses for those terms. Relatively less noise. Why Clustering (1): Why Clustering (1) Since we will match the clusters to ontological concepts, why not annotate each tag? Some terms are not in any ontology By aggregating the contexts of the tags in the same cluster, we know which contexts are important, which are noise (especially for narrow folksonomy) apple mac powerbook light paint long apple mac powerbook ajax web design Why Clustering (2): Why Clustering (2) We get more context for semantic annotation Athens University Georgia Athens University Greece ? Synonym: Synonym Seems impossible to automatically detect synonym ONLY based on the context of tags Reason: that contexts are similar enough does not imply synonyms Solution: use WordNet’s  synsets as synonym lists Polysemy: Polysemy Cluster tags which have the same labels (or synonyms) into “sense clusters” based on the similarity of their contexts. Context of Tags: Context of Tags Context of a tag T Other tags co-occur with T in a Web resource The co-occurrence frequencies e.g. User1: “turkey,istanbul,mosque”; User2: “turkey,istanbul, tour” In narrow folksonomy, all co-occurrence frequencies are 1 turkey istanbul tour mosque 1 1 1 1 2 Relatedness of Tags: Relatedness of Tags Basic idea: TF-IDF turkey istanbul tour mosque 1 1 1 1 2 turkey istanbul tour mosque 1/2 2/4 2/4 1/4 And then times IDF Co-occurrence TF Context of a Cluster: Context of a Cluster Other clusters whose tags connect (co-occur) to the tags in this cluster The co-occurrence frequency of two clusters is the aggregation of the co-occurrence frequencies of the tags in the clusters 2 3 5 Relatedness of Clusters: Relatedness of Clusters The same calculation as relatedness of nodes Important Context of a Cluster: Important Context of a Cluster Relatedness Context Important Context Level 1 Important Context Level 2 Important Context Level 3 Motivation for Building Senses: Motivation for Building Senses In order to search photos about turkey bird, some people use “bird” besides “turkey”, some use “animal”, some use “food”, “wild”, etc. Can we include all these tags, and then use them to build a sense? The clue to recognize these tags is that they co-occur with each others more often than with other tags (which are also the context of “turkey”) Tag Disambiguation Process: Tag Disambiguation Process Put all tags with the same label (or synonyms) into one cluster. Do the following 3 phases to build senses. Tag Disambiguation Phase 1: Tag Disambiguation Phase 1 Identify Important Context Level 1 Create a undirected weighted graph called Context Graph Each node in the graph is a cluster in the Important Context Level 1 The weight of an edge is the relatedness of the two clusters. (relatedness is asymmetric, we take the larger one). Apply a threshold to the edges of the Context Graph, so that the graph becomes one or more disconnected component. Create a sense corresponding to each component, and use the clusters in a component as the context of the corresponding sense. Tag Disambiguation Phase 1: Tag Disambiguation Phase 1 We are disambiguating “turkey”, so the cluster “turkey” is hidden for better illustration. Tag Disambiguation Phase 2: Tag Disambiguation Phase 2 The purpose of this phase is to find missing senses in Phase 1, which are not used often in the dataset Identify Important Context Level 2 For each cluster in Important Context Level 2, find the most related sense built in Phase 1 (and also above a threshold). If there is such a sense, merge the cluster being considered to that sense’s context. Otherwise, build a new sense and use the cluster as the context. Tag Disambiguation Phase 2: Tag Disambiguation Phase 2 The red clusters are newly discovered in Phase 2 Tag Disambiguation Phase 3: Tag Disambiguation Phase 3 Identify Important Context Level 3 Similar to Phase 2, but do not create any new sense; just enrich the context of the senses built in Phase 1 and Phase 2. Tag Disambiguation Process - continue: Tag Disambiguation Process - continue Compare each tag we are considering with the senses. Select the best matched sense and assign the tag to it. Do step 2 and step 3 again when the number of the tags we are considering is increased to a certain percentage. Tag Disambiguation Process: Tag Disambiguation Process turkey istanbul turkish x y MatchScore=x+y Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo Utilizing Ontologies: Utilizing Ontologies Match each cluster to ontological concepts where appropriate But there is no named relationships between tags That means we cannot compare by the names of relationships We will need relatedness of ontological concepts We will also need similarity of ontological concepts in semantic search Relatedness of Ontological Concepts: Relatedness of Ontological Concepts Basic idea: TF-IDF 0 for any pair of concepts without relationship. TF-IDF(c1,c2)=TF(c1,c2)*IDF(c1) Relatedness of Ontological Concepts: Relatedness of Ontological Concepts TF (c1 to c2): Issue query “c1 c2” to Yahoo! Search Engine Get the hit count h Issue queries for each concept cx connected to c2:”cx c2” Get the hit counts hx TF(c1,c2)=h/∑hx IDF(c): Issue query “c” to Yahoo! Search Engine Get the hit count h Yahoo! current index size: 20 billion pages IDF(C)=-log(h/20 billion) Similarity of Ontological Concepts: Similarity of Ontological Concepts First, only consider the taxonomy in the ontology Information Content : IC(c) = -log(prob(c)) Sim(c1, c2)=2*IC(ancestor)/(IC(c1)+IC(c2))  Car Sedan Coupe 58 M 76 M 1040 M Hit Counts Sum Probability Information Content Car Sedan Coupe 58 M 76 M 1174 M Car Sedan Coupe 0.0029 0.0038 0.0587 Car Sedan Coupe 2.54 2.42 2.23 Sim (Sedan, Coupe) = 2*2.23/(2.54+2.42) = 0.899 Similarity of Ontological Concepts: Similarity of Ontological Concepts Also consider other types of relationships by using Jaccard (COSINE) Similarity coefficient Athens Atlanta Georgia Is_located_in Is_located_in Matching Clusters to Ontologies: Matching Clusters to Ontologies Compare the important context of a cluster with the context (concepts) of an ontological concept Sum up the relatedness of matched context clusters Select the best ontological concept which gets the best matching score and also above a threshold Matching Clusters to Ontologies: Matching Clusters to Ontologies A context cluster x is considered matched to a context concept y if: They have the same label (or synonym), or If x is matched to y’ and the relatedness (or similarity) of y’ to y is above a threshold, or If the relatedness of x’ (which is matched to y) to x is above a threshold Matching Clusters to Ontologies - example: Matching Clusters to Ontologies - example turkey turkey bird bird turkey turkey animal bird animal Semantic annotation Rel(bird,animal)>threshold Sim(bird,animal)>threshold turkey turkey animal bird bird Semantic annotation Case 2 Case 1 Case 3 Rel(bird,animal)>threshold Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo Semantic Search [19,20]: Semantic Search [19,20] Search by the ontological relationships Currently only consider “subclass” and “type” relationships Map to the corresponding clusters by semantic annotations Expand the corresponding clusters by including other clusters with the same label, because some clusters may not have semantic annotation but they should have. Semantic Search: Ottawa Ottawa Madrid Seoul Seoul Madrid Madrid Madrid Seoul Geography Domain Ontology Politics Domain Ontology Ottawa Semantic Search Most-Desired Senses Ranking: Most-Desired Senses Ranking We need to rank the candidate clusters The system show one photo for each candidate cluster The user select the best photo from the samples The system ranks other clusters based on the selection Most-Desired Senses Ranking: Most-Desired Senses Ranking The basic idea is finding shortest paths in a graph from a single source Put a constant energy on the source cluster, and distribute the energy to other clusters The weight of an edge is the similarity of the clusters Slide62: Ottawa Ottawa Madrid Seoul Seoul Madrid Madrid Madrid Seoul Geography Domain Ontology Politics Domain Ontology 1 0.21556036 0.19367748 0.05285901 0.46428478 0.32185575 0.08457597 0.3705629 0.24315993 Ottawa 0.09375922 Clusters Similarity: Clusters Similarity If the semantic annotations of two clusters refer to the same ontology, use the similarity of corresponding ontological concepts. Otherwise, calculate cluster similarity by the context of the two clusters. Cluster Similarity by Context: Cluster Similarity by Context A modified version of Dice similarity Let’s say we are comparing cluster1 and cluster2 Compare only the important context of cluster1 and cluster2 Calculate the percentage of overlapped context Decide if context cluster c1 of cluster1 and context cluster c2 of cluster2 is matched by the way in matching clusters to ontologies Ontology Ranking [21,23]: Ontology Ranking [21,23] Ontologies come from a repository If multiple ontology is used for a query, we need to give a weight to each ontology The ontology with higher weight has higher “power” to decide the similarity/relatedness of two ontological concepts Rank ontologies by using the 4 most recent queries of the same user Ontology Ranking: Ontology Ranking Centrality Measure  Thing C D(c) H(c) Ontology Ranking: Ontology Ranking Density Measure  Ontology Ranking: Ontology Ranking Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo System overview: System overview Photos with Tags Queries Sense Index Sense Indexing Module Ontology Mapping Module Ontology Mapping Search Engine Ontology Ranking Module Ontology Ranks Semantic Query Module Query Result Ontologies Tag Cleanup Module Ontology Measuring Module Ontology Measures Query History Evaluation Measures: Evaluation Measures Compare with Google Desktop on the same datasets How much time a user has to spend in order to find the required photos. How many clicks of the mouse a user has to do in order to find the required photos. How many different queries a user has to issue in order to find the required photos. The user may change the query at any time if he feels necessary. Evaluations (1): Evaluations (1) Experiment set 1: for disambiguation DataSets 500 photos with a tag “apple” 500 photos with a tag “turkey” User Case 1: User Case 1 Task1: find 50 photos about Apple electronic products User Case 2: User Case 2 Task2: find 30 photos about the fruit apple User Case 3: User Case 3 Task3: 50 photos about the country Turkey User Case 4: User Case 4 Task4: find 10 photos about turkey birds Evaluations (2): Evaluations (2) Experiment set 2: for semantic search DataSets About 300 photos for each of the following tag: Beijing, Madrid, Ottawa, Rome, Seoul, Tokyo, Baltimore, New York, Pittsburgh, Washington D.C., Amsterdam, Florence, Venice, Athens Greece, Athens Georgia Ontologies An ontology in travel domain (partially from Realtravel.com) Modified AKTiveSA  project ontology in geography domain An ontology in politics domain (partially from SWETO) Use Case 5: Use Case 5 Task 5: find up to 5 photos for 5 cities in Europe Evaluation: Evaluation Most-Desire Senses Ranking approach may involve time overhead in selecting the most wanted photo sense Changing query involves time overhead in thinking and typing Overall, users spent significantly less time and effort in finding the information they want Outline: Outline Background and Motivation Approach Overview Tag Normalization Sense Indexing Utilizing ontologies Semantic Search and Ranking Implementation and Evaluations Conclusions Demo Conclusions: Conclusions We proposed an approach to combine folksonomies and ontologies Index Web resources by senses into sense clusters Match sense clusters to ontological concepts Semantic search based on ontological relationships Most-Desired Sense Ranking approach Multiple ontologies ranking Evaluation: users spent significant less time and effort in finding the information they want Slide82: Demo Slide83: Questions and comments References (1): References (1)  Wal, T.V. Folksonomy Coinage and Definition. 2004 [cited; Available from: http://vanderwal.net/folksonomy.html.  Gruber, T., Ontology of Folksonomy: A Mash-up of Apples and Oranges. International Journal on Semantic Web and Information Systems, 2007. 3(1).  Halpin, H., V. Robu, and H. Shepherd. The Complex Dynamics of Collaborative Tagging. in WWW '07: Proceedings of the 16th international conference on World Wide Web. 2007: ACM.  Wal, T.V. Folksonomy Definition and Wikipedia. 2005 [cited; Available from: http://www.vanderwal.net/random/entrysel.php?blog=1750.  Mika, P., Ontologies are us: A unified model of social networks and semantics. Journal of Web Semantics, 2007. 5(1): p. 5-15.  Gruber, T.R., A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 1993. 5(2): p. 199-220.  Resource Description Framework (RDF). [cited; Available from: http://www.w3.org/RDF/.  RDF Vocabulary Description Language 1.0: RDF Schema. 2004 [cited; Available from: http://www.w3.org/TR/rdf-schema/. References (2): References (2)  McGuinness, D.L. and F.v. Harmelen. OWL Web Ontology Language. 2004 [cited; Available from: http://www.w3.org/TR/owl-features/.  Kiryakov, A., et al., Semantic annotation, indexing, and retrieval. Web Semantics: Science, Services and Agents on the World Wide Web, 2004. 2(1): p. 49-79.  Ide, N. and J. Véronis, Word sense disambiguation: The state of the art. Computational Linguistics, 1998. 1(24): p. 1-40.  Wilks, Y. and M. Stevenson. Sense Tagging: Semantic Tagging with a Lexicon. in the SIGLEX Workshop Tagging Text with Lexical Semantics: What, why and how? 1997. Washington, D.C.  Diab, M. and P. Resnik. An Unsupervised method for Word Sense Tagging using Parallel Corpara. in the 40th Annual Meeting of the Association for Computational Linguistics. 2002. Philadelphia, Pennsylvania.  Molina, A., et al. Word Sense Disambiguation using Statistical Models and WordNet. in 3rd International Conference on Language Resources and Evaluation. 2002. Las Palmas de Gran Canaria, Spain.  Banerjee, S. and B.P. Mullick, Word Sense Disambiguation and WordNet Technology. Literary and Linguistic Computing, 2007. 22(1): p. 1-15.  Fellbaum, C., WordNet: An Electronic Lexical Database. 1998: The MIT Press. References (3): References (3)  Resnik, P., Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 1999. 11: p. 95-130.  Lin, D., An Information-Theoretic Definition of Similarity, in International Conference on Machine Learning (ICML). 1998: Madison, Wisconsin, USA.  Sheth, A., et al., Managing Semantic Content for the Web. IEEE Internet Computing, 2002. 6(4): p. 80-87.  Guha, R., R. McCool, and E. Miller. Semantic search. in the 12th international conference on World Wide Web. 2003.  Arumugam, M., A. Sheth, and I.B. Arpinar, Towards Peer-to-Peer Semantic Web: A Distributed Environment for Sharing Semantic Knowledge on the Web, in International Workshop on Real World RDF and Semantic Web Applications. 2002: Hawaii, USA.  Alani, H. and C. Brewster. Ontology ranking based on the analysis of concept structures. in the 3rd international conference on Knowledge capture. 2005.  Zhang, Y., W. Vasconcelos, and D. Sleeman. OntoSearch: An Ontology Search Engine. in The Twenty-fourth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence. 2004. Cambridge, UK.  AKTiveSA. [cited; Available from: http://sa.aktivespace.org/.  Aleman-Meza, B., et al. SWETO: Large-Scale Semantic Web Test-bed. in 16th Int'l Conf. Software Eng. & Knowledge Eng., Workshop on Ontology in Action, Knowledge Systems Inst. 2004.