Automatic Indexing

Information about Automatic Indexing

Published on August 31, 2007

Author: Charlie

Source: authorstream.com

Content

Automatic Indexing &Text Categorization:  Automatic Indexing andamp; Text Categorization Dr. Miguel E. Ruiz School of Informatics Department of Library and Information Studies Special presentation for LIS514 Indexing and Surrogation Dr. June Abbas Text Processing:  Text Processing Document Text + Structure Accents, Spacing, ect. stopwords Noun groups stemming Automatic or manual indexing. Structure Recognition Text Full text Structure Index term Vector Space Model (Logical representation):  Vector Space Model (Logical representation) Documents and queries are expressed using a vector whose components are all the possible index terms(t). Each index term has an associated weight that indicates the importance of the index term in the document (or query). Vector Space Model (Logical representation):  Vector Space Model (Logical representation) In other words, the document dj and the query q are represented as t-dimensional vectors.  dj q Documents in Vector Space:  Documents in Vector Space t1 t2 t3 D1 D2 D10 D3 D9 D4 D7 D8 D5 D11 D6 For example, in a three dimensional space documents and queries would look like this. Q Document Vectors :  Document Vectors nova galaxy heat Hollywood film role diet fur 1.0 0.5 0.3 0.5 1.0 1.0 0.8 0.7 0.9 1.0 0.5 1.0 1.0 0.9 1.0 0.5 0.7 0.9 0.6 1.0 0.3 0.2 0.8 0.7 0.5 0.1 0.3 D1 D2 D3 D4 D5 D6 D7 D8 D9 Document ids Terms Vector Space Model:  Vector Space Model The vector space model proposes to evaluate the degree of similarity of document dj with regard to the query q as the correlation between the two vectors dj and q. Vector Space Model:  Vector Space Model This correlation can be quantified in different ways, for example by the cosine of the angle between these two vectors. Vector Space Model: Computing Similarity Scores:  Vector Space Model: Computing Similarity Scores 1.0 0.8 0.6 0.8 0.4 0.6 0.4 1.0 0.2 0.2 Vector Space Model:  Vector Space Model Since wi,j  0 and Wi,q  0 , sim(dj,q) varies between 0 to +1. The vector space model assumes that the similarity value is an indication of the relevance of the document to the given query. Thus vector space model ranks the retrieved documents by their similarity with the query. Vector Space Model:  Vector Space Model How can we compute the values of the weights wi,j ? One of the most popular methods is based on combining two factors: The importance of each index term in the document The importance of the index term in the collection Vector Space Model:  Vector Space Model Importance of the index term in the document: This can be measured by the number of times that the term appears in the document. This is called the term frequency which is denoted by the symbol tf. Vector Space Model:  Vector Space Model The importance of the index term in the collection: An index term that appears in every document in the collection is not very useful, but a term that occurs in only a few documents may indicate that these few documents could be relevant to a query that uses this term. This factor is usually called the inverse document frequency or the idf factor. Vector Space Model:  Vector Space Model Mathematically The inverse document frequency (idf) can be expressed as: Where: N= number of documents in the collection ni = number of document that contain the term i Vector Space Model:  Vector Space Model Combining these two factors we can obtain the weight of an index term i as: Also called the tf-idf weighting scheme Text Processing:  Text Processing Index term selection ca be done manually or automatically A method for automatically selecting index terms was proposed by Luhn (in 1958) based on the use of Zipf’s Law Text Processing:  Text Processing Zipf’s Law: Let f be the frequency of occurrence of various words in a given position of the text and r be their rank order, that is, the order of their frequency of occurrence. The product of f  r is approximately a constant. Text Processing:  Text Processing Luhn proposed that if we graph f versus r we can define an upper and lower cut-off. Words that fall within these two cut-off points are likely to be good predictors of the contents of a document. Zipf’s Law applied to indexing:  Zipf’s Law applied to indexing Words by rank order Frequency of words Text Processing:  Text Processing Thesaurus index terms In its simple form a thesaurus consist of a precompiled list of words and for each word a set of synonyms. We can assign terms from a thesaurus (either manually or automatically) to the set of index terms of the document. Automatic Text Categorization:  Automatic Text Categorization Categorization in Linguistic:  Categorization in Linguistic Rosch’s Hypothesis: Each category has its own structure. Some members are better examples of the category. Any category has a member called prototype. Overview of Automatic Text Categorization:  Overview of Automatic Text Categorization Text categorization is the process of algorithmically analyzing an electronic document to assign a set of categories (or index terms) from a predefined vocabulary to succinctly describe the content of a document. Machine Learning and Text Categorization:  Machine Learning and Text Categorization Machine learning is an area of artificial intelligence that concentrates on the study of algorithms that 'learn' how to perform specific tasks. Supervised learning: Learning from a set of previously categorized examples. (Pattern Matching) Unsupervised learning: finding useful relations between members of the target set. (Clustering) Machine Learning and Text Categorization:  Machine Learning and Text Categorization Machine learning methods used for Text categorization: Linear classifiers Neural networks Decision trees Inductive learning Support vector machines Slide26:  Training Process Feature Selection Training examples Training epoch Errorandlt; e Subset Selection no yes Threshold optimization Trained classifier Slide27:  Text Categorization Process Document Build Document Vector Thresholding Assigned Categories Applications of Text Categorization:  Applications of Text Categorization Document indexing Automatic generation of metadata Classification of patents Document filtering Adaptive filtering Spam filtering Word sense disambiguation Essay grading. Feature Selection:  Feature Selection Feature selection consists in choosing the words or phrases are are better predictors of a given category. Examples Correlation Coefficient Mutual Information Correlation Coefficient:  Correlation Coefficient Hierarchical Mixture of Experts:  Hierarchical Mixture of Experts Based on 'divide-and-conquer' principle. Jordan and Jacobs (1993)    Expert 1 Expert 2 Expert 3 Expert 4 Gate 2.1 Gate 2.2 Gate 1 x x x x x x x y1.1 y1.2 y2.1 y2.2 y2 y y1 Training set selection:Category Zone:  Training set selection:Category Zone - + + + + + - - - - - - - - - - - - - Slide33:  The category zone is defined as the set of positive examples plus significant negative examples. It is inspired by the query zone concept proposed by Singhal, Mitra, andamp; Buckley (1997). Knn-based Category Zones:  Knn-based Category Zones For each positive example SMART System (Retrieval) Top k retrieved documents Add example and top K retrieved docs Category zone Slide35:  Experimental Collection MeSH Categories (~23,00 concepts) we selected the Heart Diseases sub tree (119 concepts) OHSUMED collection (233,445 MEDLINE records) from 1987 - 1991 Training set 183,229 records. (1987-90) Test set 50,216 records (1991) Sets of Categories:  Sets of Categories Set of categories in the 'Heart Diseases' subtree (HD-119): a total of 103 categories High frequency categories (HD-49) with at least 75 examples in the collection Medium frequency categories (HD-28) between 15 and 74 examples Low frequency categories (HD-26) between 1 and 14 examples. Slide37:  Performance Measure Recall= a/(a+c) Precision= a/(a+b) F1=(2*P*R)/(P+R) Categories assigned by human indexers are Taken as gold standard Research Questions:  Research Questions Does a hierarchical classifier built on the HME model improve performance when compared to a flat classifier? How does our hierarchical method compare with other text categorization approaches? Baselines:  Baselines Rocchio Classifier: is the vector of the difference between the centroid of positive examples, and the centroid of negative examples. Baselines:  Baselines Flat neural network classifier: A classifier obtained by combining the output of the expert networks. Expert 1 Expert 2 Expert 49 . . . Final Hierarchical Classifier:  Final Hierarchical Classifier Comparison between Classifiers:  Comparison between Classifiers Challenges of text categorization using MeSH:  Challenges of text categorization using MeSH Hierarchy of the indexing vocabulary andamp; indexing rules Indexing rules (i.e. assign the most specific category) Use of Qualifiers (or sub-headings) Multi-hierarchy (a category can appear in multiple places in the hierarchy) i.e. Acquired immunodeficiency Syndrome appears as a virus infection and as a sexually transmitted disease No specific indication of which sense is being assigned Limited number of categories are assigned manually (only the 'most important' are assigned) Examples of Operational Automatic Indexing Systems:  Examples of Operational Automatic Indexing Systems NLM Indexing Project:  NLM Indexing Project Medical Text Indexer (MTI) Project that explores the automatic indexing tools for current indexing practices in the National Library of Medicine. Fully Automatic or semi-automatic indexing tool. Overview of MTI:  Overview of MTI NLM Indexing Project:  NLM Indexing Project The initial version of the system generated too many spurious indexing terms. MTI Filtering Strict (Terms recommended by MetaMap and Pubmed Related Citations) Medium (discard terns that are too general) Base filtering (rules used to add, boost substitute and remove terms) Examples: http://ii.nlm.nih.gov/Demo/II_demo.html Systems that use Automatic Text Categorization:  Systems that use Automatic Text Categorization A good example of commercial software that uses text categorization is: Inxight smart discovery Taxonomy management and categorization tool http://www.inxight.com/products/smartdiscovery/tc/

Related presentations


Other presentations created by Charlie

Personality Development
17. 11. 2007
0 views

Personality Development

History of Plastics
30. 04. 2008
0 views

History of Plastics

Juniper Networks 22 Nov 2005
28. 04. 2008
0 views

Juniper Networks 22 Nov 2005

CA Communications 02 20 08
18. 04. 2008
0 views

CA Communications 02 20 08

BharAloutlookISRI
17. 04. 2008
0 views

BharAloutlookISRI

direct basis
16. 04. 2008
0 views

direct basis

AnEconomicHistory English
14. 04. 2008
0 views

AnEconomicHistory English

Chap002
13. 04. 2008
0 views

Chap002

Financial Crisis
10. 04. 2008
0 views

Financial Crisis

NATO Today
23. 12. 2007
0 views

NATO Today

TM photo pp ppt
08. 10. 2007
0 views

TM photo pp ppt

2007 seminar 3
12. 10. 2007
0 views

2007 seminar 3

Roalddahl
12. 10. 2007
0 views

Roalddahl

micro credit presentation
15. 10. 2007
0 views

micro credit presentation

lecture 9 12 proteins 2007
16. 10. 2007
0 views

lecture 9 12 proteins 2007

Spanish American War
22. 10. 2007
0 views

Spanish American War

ppw 6 28 04
07. 10. 2007
0 views

ppw 6 28 04

PP2
23. 10. 2007
0 views

PP2

KATRINA TEACHERS GUIDEpr
04. 09. 2007
0 views

KATRINA TEACHERS GUIDEpr

ECA Knowledge Fair
31. 08. 2007
0 views

ECA Knowledge Fair

ROLE OF JOURNALISTS UNION
31. 08. 2007
0 views

ROLE OF JOURNALISTS UNION

wendy
15. 11. 2007
0 views

wendy

Maximize Access Coverage
28. 11. 2007
0 views

Maximize Access Coverage

Notable Arborists
02. 10. 2007
0 views

Notable Arborists

INDEX OF SEGREGATION
07. 12. 2007
0 views

INDEX OF SEGREGATION

wilhelm2
04. 01. 2008
0 views

wilhelm2

FV1 day1
07. 01. 2008
0 views

FV1 day1

berdai
23. 10. 2007
0 views

berdai

weddings
11. 12. 2007
0 views

weddings

McDowell
29. 10. 2007
0 views

McDowell

usa jl
13. 11. 2007
0 views

usa jl

Construccion de un NOM
24. 10. 2007
0 views

Construccion de un NOM

TuLiP Overview
04. 09. 2007
0 views

TuLiP Overview

tulip
04. 09. 2007
0 views

tulip

HawaiiPresentation
17. 12. 2007
0 views

HawaiiPresentation

ompi tm cas 04 5
23. 10. 2007
0 views

ompi tm cas 04 5

Chapter12
03. 10. 2007
0 views

Chapter12

asian inc
29. 10. 2007
0 views

asian inc

Mat Prod L10
14. 02. 2008
0 views

Mat Prod L10

featurefilm
17. 10. 2007
0 views

featurefilm

EJ Genetic Research
24. 02. 2008
0 views

EJ Genetic Research

badagliacco
24. 02. 2008
0 views

badagliacco

Science and Warfare Lecture 1
26. 02. 2008
0 views

Science and Warfare Lecture 1

AHCIVI 1
27. 02. 2008
0 views

AHCIVI 1

student pressentation mngn
07. 11. 2007
0 views

student pressentation mngn

trucks 4 comm
28. 02. 2008
0 views

trucks 4 comm

bioweapons
04. 03. 2008
0 views

bioweapons

2007EMSVaccinationTr aining
10. 03. 2008
0 views

2007EMSVaccinationTr aining

Mehta diving and the environment
11. 03. 2008
0 views

Mehta diving and the environment

Perform Basis06 A0 en last
25. 03. 2008
0 views

Perform Basis06 A0 en last

wttcsantiago2007
26. 03. 2008
0 views

wttcsantiago2007

Living on Mars
07. 04. 2008
0 views

Living on Mars

lect22 handout
15. 10. 2007
0 views

lect22 handout

pedagogy
04. 09. 2007
0 views

pedagogy

FEE dev IHEP
31. 08. 2007
0 views

FEE dev IHEP

Rong Gen Cai
01. 12. 2007
0 views

Rong Gen Cai

mps break st louis
18. 06. 2007
0 views

mps break st louis

Moving on with Statistics
19. 06. 2007
0 views

Moving on with Statistics

Module 2 TAKS05
19. 06. 2007
0 views

Module 2 TAKS05

microsoft office overview
19. 06. 2007
0 views

microsoft office overview

Math in Middle School
19. 06. 2007
0 views

Math in Middle School

Math Concordance Show
19. 06. 2007
0 views

Math Concordance Show

Mary George
19. 06. 2007
0 views

Mary George

Lower Division
19. 06. 2007
0 views

Lower Division

Lecture Amiens
19. 06. 2007
0 views

Lecture Amiens

lady adalovelace
19. 06. 2007
0 views

lady adalovelace

Kelm
31. 08. 2007
0 views

Kelm

Oct06 CAC Presentation1
18. 06. 2007
0 views

Oct06 CAC Presentation1

NLI 0460
18. 06. 2007
0 views

NLI 0460

nicholas
18. 06. 2007
0 views

nicholas

NCLB Highly Qualified
18. 06. 2007
0 views

NCLB Highly Qualified

NCLB An dE Rate1029
18. 06. 2007
0 views

NCLB An dE Rate1029

MWR 07073
18. 06. 2007
0 views

MWR 07073

mtts product show
18. 06. 2007
0 views

mtts product show

OMSC
18. 06. 2007
0 views

OMSC

PACA 16 de agosto
22. 10. 2007
0 views

PACA 16 de agosto

dh firenze
19. 10. 2007
0 views

dh firenze

gridpp16 servicechallenges
24. 10. 2007
0 views

gridpp16 servicechallenges

lwi
19. 06. 2007
0 views

lwi

AnLiu IDAR07 nocomment
12. 10. 2007
0 views

AnLiu IDAR07 nocomment

VoIPSlides
12. 03. 2008
0 views

VoIPSlides

3 Russia 05
26. 10. 2007
0 views

3 Russia 05

Lynnand Marsha
19. 06. 2007
0 views

Lynnand Marsha

07 0314 k ahuja
28. 09. 2007
0 views

07 0314 k ahuja

PresJMorales
22. 10. 2007
0 views

PresJMorales

Math TEKS K5
19. 06. 2007
0 views

Math TEKS K5

me579 16 internetMC
15. 11. 2007
0 views

me579 16 internetMC

Briars
04. 09. 2007
0 views

Briars

Esm Juny 05 IESE tcm48 42493
01. 10. 2007
0 views

Esm Juny 05 IESE tcm48 42493

CESARE PACIOTTI
10. 10. 2007
0 views

CESARE PACIOTTI

kep engl2007
15. 10. 2007
0 views

kep engl2007

LCG Switzerland Phase 2
19. 10. 2007
0 views

LCG Switzerland Phase 2

stoddart
06. 03. 2008
0 views

stoddart

Ch Kor Symp00
13. 10. 2007
0 views

Ch Kor Symp00

xps seminar jan e
19. 06. 2007
0 views

xps seminar jan e

soda3
03. 01. 2008
0 views

soda3

nys status Report 2006 2007
18. 06. 2007
0 views

nys status Report 2006 2007

JOSB
21. 11. 2007
0 views

JOSB

raffo phdthesis
07. 10. 2007
0 views

raffo phdthesis

parolari
03. 01. 2008
0 views

parolari

bailey
23. 10. 2007
0 views

bailey