svd

Information about svd

Published on December 3, 2007

Author: Talya

Source: authorstream.com

Content

The Mathematics of Information Retrieval:  The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin Acknowledgments:  Acknowledgments This presentation is based on the following paper: “Matrices, Vector Spaces, and Information Retrieval.” by Michael W. Barry, Zlatko Drmat, and Elizabeth R.Jessup. Indexing of Scientific Works:  Indexing of Scientific Works Indexing primarily done by using the title, author list, abstract, key word list, and subject classification These are created in large part to allow them to be found in a search of scientific documents The use of automated information retrieval (IR) has improved consistency and speed Vector Space Model for IR:  Vector Space Model for IR The basic mechanism for this model is the encoding of a document as a vector All documents’ vectors are stored in a single matrix Latent Semantic Indexing (LSI) replaces the original matrix by a matrix of a smaller rank while maintaining similar information by use of Rank Reduction Creating the Database Matrix :  Creating the Database Matrix Each document is defined in a column of the matrix (d is the number of documents) Each term is defined as a row (t is the number of terms) This gives us a t x d matrix The document vectors span the content Simple Example:  Simple Example Let the six terms as follows: T1: bak(e, ing) T2: recipes T3: bread T4: cake T5: pastr(y, ies) T6: pie The following are the d=5 documents D1: How to Bake Bread Without Recipes D2: The Classical Art of Viennese Pastry D3: Numerical Recipes: The Art of Scientific Computing D4: Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes D5:Pastry: A Book of Best French Recipes Thus the document matrix becomes: A = The matrix A after Normalization:  The matrix A after Normalization Thus after the normalization of the columns of A we get the following: Making a Query:  Making a Query Next we will use the document matrix to ease our search for related documents. Referring to our example we will make the following query: Baking Bread We will now format a query using our terms definitions given before: q= (1 0 1 0 0 0)T Matching the Document to the Query:  Matching the Document to the Query Matching the documents to a given query is typically done by using the cosine of the angle between the query and document vectors The cosine is given as follows: A Query:  A Query By using the cosine formula we would get: We will set our lower limit on our cosine at .5. Thus by conducting a query “baking bread” we get the following two articles: D1: How to Bake Bread Without Recipes D4: Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes Singular Value Decomposition:  Singular Value Decomposition The Singular Value Decomposition (SVD) is used to reduce the rank of the matrix, while also giving a good approximation of the information stored in it The decomposition is written in the following manner: Where U spans the column space of A, is the matrix with singular values of A along the main diagonal, and V spans the row space of A. U and V are also orthogonal. SVD continued:  SVD continued Unlike the QR Factorization, SVD provides us with a lower rank representation of the column and row spaces We know Ak is the best rank-k approximation to A by Eckert and Young’s Theorem that states: Thus the rank-k approximation of A is given as follows: Ak= Uk kVkT Where Uk=the first k columns of U k=a k x k matrix whose diagonal is a set of decreasing values, call them: VkT=is the k x d matrix whose rows are the first k rows of V SVD Factorization :  SVD Factorization Interpretation:  Interpretation From the matrix given on the slide before we notice that if we take the rank-4 matrix has only four non-zero singular values Also the two non-zero columns in tell us that the first four columns of U give us the basis for the column space of A Analysis of the Rank-k Approximations:  Analysis of the Rank-k Approximations Using the following formula we can calculate the relative error from the original matrix to its rank-k approximation: ||A-Ak||F= Thus only a 19% relative error is needed to change from a rank-4 to a rank-3 matrix, however a 42% relative error is necessary to move to a rank-2 approximation from a rank-4 approximation As expected these values are less than the rank-k approximations for the QR factorization Using the SVD for Query Matching:  Using the SVD for Query Matching Using the following formula we can calculate the cosine of the angles between the query and the columns of our rank-k approximation of A. Using the rank-3 approximation we return the first and fourth books again using the cutoff of .5 Term-Term Comparison:  Term-Term Comparison It is possible to modify the vector space model for comparing queries with documents in order to compare terms with terms. When this is added to a search engine it can act as a tool to refine the result First we run our search as before and retrieve a certain number of documents in the following example we will have five documents retrieved. We will then create another document matrix with the remaining information, call it G. Another Example:  Another Example T1:Run(ning) T2:Bike T3:Endurance T4:Training T5:Band T6:Music T7:Fishes D1:Complete Triathlon Endurance Training Manual:Swim, Bike, Run D2:Lake, River, and Sea-Run Fishes of Canada D3:Middle Distance Running, Training and Competition D4:Music Law: How to Run your Band’s Business D5:Running: Learning, Training Competing Terms Documents Analysis of the Term-Term Comparison:  Analysis of the Term-Term Comparison For this we use the following formula: Clustering:  Clustering Clustering is the process by which terms are grouped if they are related such as bike, endurance and training First the terms are split into groups which are related The terms in each group are placed such that their vectors are almost parallel Clusters:  Clusters In this example the first cluster is running The second cluster is bike, endurance and training The third is band and music And the fourth is fishes Analyzing the term-term Comparison:  Analyzing the term-term Comparison We will again use the SVD rank-k approximation Thus the cosine of the angles becomes: Conclusion:  Conclusion Through the use of this model many libraries and smaller collections can index their documents However, as the next presentation will show a different approach is used in large collections such as the internet

Related presentations


Other presentations created by Talya

Chapter 19
06. 11. 2007
0 views

Chapter 19

MDR TB Aaron
07. 01. 2008
0 views

MDR TB Aaron

Ourplanetearth
03. 10. 2007
0 views

Ourplanetearth

Serway CP poll ch05
09. 10. 2007
0 views

Serway CP poll ch05

ELISA powerpoint Agrow knowledge
24. 10. 2007
0 views

ELISA powerpoint Agrow knowledge

Presentation Jean Pierre Allain
24. 10. 2007
0 views

Presentation Jean Pierre Allain

H114m
26. 11. 2007
0 views

H114m

Invasive Species ID Powerpoint
11. 12. 2007
0 views

Invasive Species ID Powerpoint

Fallacies vs Facts
12. 12. 2007
0 views

Fallacies vs Facts

upload c beatles64 4n27
29. 10. 2007
0 views

upload c beatles64 4n27

2 relational model
30. 10. 2007
0 views

2 relational model

POSTMODERN MANAGEMENT
30. 10. 2007
0 views

POSTMODERN MANAGEMENT

Introduction Overview
02. 11. 2007
0 views

Introduction Overview

Ocean platforms prince
05. 11. 2007
0 views

Ocean platforms prince

AST PRESENTATION TO PCA
06. 11. 2007
0 views

AST PRESENTATION TO PCA

ANI
06. 11. 2007
0 views

ANI

rebecca 3rd
06. 11. 2007
0 views

rebecca 3rd

The Slave Diary Assessment
07. 11. 2007
0 views

The Slave Diary Assessment

Sugarcane presentation
23. 11. 2007
0 views

Sugarcane presentation

l 3
26. 11. 2007
0 views

l 3

Perth
04. 01. 2008
0 views

Perth

birds total
22. 11. 2007
0 views

birds total

TECNICASDE ESTUDIOA DISTANCIA
05. 01. 2008
0 views

TECNICASDE ESTUDIOA DISTANCIA

Sue Gries LC
07. 01. 2008
0 views

Sue Gries LC

morrill
03. 10. 2007
0 views

morrill

Naveen Kumar October 24 2006
12. 11. 2007
0 views

Naveen Kumar October 24 2006

nhs fellowship
25. 10. 2007
0 views

nhs fellowship

icml kdd2003
27. 09. 2007
0 views

icml kdd2003

FOSOntoWeb
30. 10. 2007
0 views

FOSOntoWeb

f ad 06112006
26. 10. 2007
0 views

f ad 06112006

20040112 SPACE04
03. 01. 2008
0 views

20040112 SPACE04

bobkov
03. 01. 2008
0 views

bobkov

AGM 2005
20. 02. 2008
0 views

AGM 2005

FL Unit 1 pgs 16 18
24. 02. 2008
0 views

FL Unit 1 pgs 16 18

Lecture15 2003 10 14 FINAL
27. 02. 2008
0 views

Lecture15 2003 10 14 FINAL

art1945 60
19. 12. 2007
0 views

art1945 60

lail
05. 03. 2008
0 views

lail

ISAE 7giu05def
31. 10. 2007
0 views

ISAE 7giu05def

acg
14. 03. 2008
0 views

acg

CentralDogma
27. 03. 2008
0 views

CentralDogma

Gorini Sidlaw Mon
30. 10. 2007
0 views

Gorini Sidlaw Mon

X Internet Q2 2002
13. 04. 2008
0 views

X Internet Q2 2002

AGM Mar 06 Final frm Kim e
07. 12. 2007
0 views

AGM Mar 06 Final frm Kim e

Ballroom Dance Lessons
23. 11. 2007
0 views

Ballroom Dance Lessons

Hosea HHFW 6 28 07
21. 11. 2007
0 views

Hosea HHFW 6 28 07

radiacevesmir
15. 11. 2007
0 views

radiacevesmir

Iyer
16. 11. 2007
0 views

Iyer

proebsting
05. 11. 2007
0 views

proebsting

200753051013737
28. 12. 2007
0 views

200753051013737

Schatzmann Nett
25. 10. 2007
0 views

Schatzmann Nett

ukio bankas
14. 11. 2007
0 views

ukio bankas

E145 WorkshopB Mktg
16. 11. 2007
0 views

E145 WorkshopB Mktg

StarMotion
13. 11. 2007
0 views

StarMotion

BEL Valves
20. 11. 2007
0 views

BEL Valves

herrera Parallel 4 1
28. 11. 2007
0 views

herrera Parallel 4 1

job search bootcamp
17. 12. 2007
0 views

job search bootcamp