sample solution

Information about sample solution

Published on January 9, 2008

Author: Bianca

Source: authorstream.com

Content

CS/Info 430: Information Retrieval:  CS/Info 430: Information Retrieval Sample Midterm Examination Notes on the Solutions Midterm Examination -- Question 1:  Midterm Examination -- Question 1 1(a) Define the terms inverted file, inverted list, posting. Inverted file: a list of the words in a set of documents and the documents in which they appear. Inverted list: All the entries in an inverted file that apply to a specific word. Posting: Entry in an inverted list 1(b) When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval? Q1 (continued):  Q1 (continued) Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resource. Q1 (continued):  Q1 (continued) 1(c) You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents. New documents are being continually added to the collection.   (i) What file structure(s) would you use?   (ii) How well does your design satisfy the criteria listed in Part (b)? Q1 (continued):  Q1 (continued) Term Pointer to list of postings ant bee cat dog elk fox gnu hog Lists of postings Separate inverted index from lists of postings inverted index postings file Question 1 (continued):  Question 1 (continued) (a) Postings file may be stored sequentially as a linked list. (b) Index file is best stored as a tree. Binary trees provide fast searching but have problems with updating. B-trees are better, with B+-trees as the best. Note: Other answers are possible to this part of the question. Question 1 (continued):  Question 1 (continued) 1(c)(ii) How well does your design satisfy the criteria listed in Part (b)? • Sequential list for each term is efficient for storage and for processing Boolean queries. The disadvantage is a slow update time for long inverted lists. • B-trees combine fast retrieval with moderately efficient updating. • Bottom-up updating is usual fast, but may require recursive tree climbing to the root. • The main weakness is poor storage utilization; typically buckets are only 0.69 full. Midterm Examination -- Question 2:  Midterm Examination -- Question 2 2(b) You have the collection of documents that contain the following index terms: D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta (i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting. Incidence array:  Incidence array D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 1 1 1 1 1 1 1 D2 1 1 1 D3 1 1 1 1 D4 1 1 1 1 7 3 4 4 Document similarity matrix:  Document similarity matrix D1 D2 D3 D4 D1 0.65 0.76 0.76 D2 0.65 0.00 0.87 D3 0.76 0.00 0.25 D4 0.76 0.87 0.25 Question 2 (continued):  Question 2 (continued) 2b(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency. Frequency Array :  Frequency Array D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 1 1 1 1 1 1 1 D2 1 1 3 D3 3 1 1 1 D4 2 1 1 2 Inverse Document Frequency Weighting:  Inverse Document Frequency Weighting Principle: (a) Weight is proportional to the number of times that the term appears in the document (b) Weight is inversely proportional to the number of documents that contain the term: wik = fik / dk Where: wik is the weight given to term k in document i fik is the frequency with which term k appears in document i dk is the number of documents that contain term k Frequency Array with Weights:  Frequency Array with Weights D1: alpha bravo charlie delta echo foxtrot golf D2: golf golf golf delta alpha D3: bravo charlie bravo echo foxtrot bravo D4: foxtrot alpha alpha golf golf delta alpha bravo charlie delta echo foxtrot golf D1 0.33 0.50 0.50 0.33 0.50 0.33 0.33 D2 0.33 0.33 1.00 D3 1.50 0.50 0.50 0.33 D4 0.67 0.33 0.33 0.67 dk 3 2 2 3 2 3 3 length 1.09 1.11 1.69 1.05 Lengths have been corrected. Document similarity matrix:  Document similarity matrix D1 D2 D3 D4 D1 0.46 0.74 0.58 D2 0.46 0.00 0.86 D3 0.74 0.00 0.06 D4 0.56 0.86 0.06 Question 3:  Question 3 3(a) Define the terms recall and precision. 3(b) Q is a query. D is a collection of 1,000,000 documents. When the query Q is run, a set of 200 documents is returned. (i) How in a practical experiment would you calculate the precision? Have an expert examine each of the 200 documents and decide whether it is relevant. Precision is number judged relevant divided by 200. (ii) How in a practical experiment would you calculate the recall? It is not feasible to examine 1,000,000 records. Therefore sampling must be used ... Question 3 (continued):  Question 3 (continued) 3(c) Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant. (i) What is the precision? 50/200 = 0.25 (ii) What is the recall? 50/100 = 0.5 3(d) Explain in general terms the method used by TREC to estimate the recall. Question 3 (continued):  Question 3 (continued) For each query, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant The human expert who set the query looks at every document in the pool and determines whether it is relevant. Documents outside the pool are not examined. [In TREC-8: 7,100 documents in the pool 1,736 unique documents (eliminating duplicates) 94 judged relevant]

Related presentations


Other presentations created by Bianca

ESP
13. 02. 2008
0 views

ESP

GAETC GCSS GALILEO GPS
02. 05. 2008
0 views

GAETC GCSS GALILEO GPS

General Human Rights
09. 01. 2008
0 views

General Human Rights

1a LINE ARTs
10. 01. 2008
0 views

1a LINE ARTs

lofar lagendoen
11. 01. 2008
0 views

lofar lagendoen

arv1 introduction
12. 01. 2008
0 views

arv1 introduction

gravity
14. 01. 2008
0 views

gravity

allergy facts
14. 01. 2008
0 views

allergy facts

ReneMagritte2005
14. 01. 2008
0 views

ReneMagritte2005

sara initiative
14. 01. 2008
0 views

sara initiative

Cultural Considerations lecture
14. 01. 2008
0 views

Cultural Considerations lecture

Retirement Benefits ERS and TRS
16. 01. 2008
0 views

Retirement Benefits ERS and TRS

2007 Masterclass PPoint1
17. 01. 2008
0 views

2007 Masterclass PPoint1

KENNEDY
17. 01. 2008
0 views

KENNEDY

Unit7
18. 01. 2008
0 views

Unit7

Hazardous Waste
18. 01. 2008
0 views

Hazardous Waste

NREL
21. 01. 2008
0 views

NREL

Overview UOP Yields
23. 01. 2008
0 views

Overview UOP Yields

SoftwareEstimation1
19. 01. 2008
0 views

SoftwareEstimation1

stephTSEAwebcast1
04. 02. 2008
0 views

stephTSEAwebcast1

DigitalNatives
04. 02. 2008
0 views

DigitalNatives

20041112JM SC04 glifpanel
04. 02. 2008
0 views

20041112JM SC04 glifpanel

J Morduch
05. 02. 2008
0 views

J Morduch

dooling
05. 02. 2008
0 views

dooling

sd alignment
11. 02. 2008
0 views

sd alignment

AxegÃrd ICEP Biorefinery final
12. 02. 2008
0 views

AxegÃrd ICEP Biorefinery final

AreWeEndangered
15. 01. 2008
0 views

AreWeEndangered

dose response demo
09. 01. 2008
0 views

dose response demo

etracey012407
28. 01. 2008
0 views

etracey012407

CC Nutrition Assessment Data All
28. 01. 2008
0 views

CC Nutrition Assessment Data All

Chapter 03 Problem Solving
29. 01. 2008
0 views

Chapter 03 Problem Solving

bussines models
31. 01. 2008
0 views

bussines models

PHSC3033 Clouds
06. 02. 2008
0 views

PHSC3033 Clouds

of korin
07. 02. 2008
0 views

of korin

nnd
07. 02. 2008
0 views

nnd

cairn energy3
12. 02. 2008
0 views

cairn energy3

pps 316
14. 02. 2008
0 views

pps 316

101live
14. 02. 2008
0 views

101live

DotNET Turma A
21. 02. 2008
0 views

DotNET Turma A

prius 2
05. 02. 2008
0 views

prius 2

Arthritis
27. 02. 2008
0 views

Arthritis

2002 04 19 Viscosity
09. 01. 2008
0 views

2002 04 19 Viscosity

andy warhol
29. 02. 2008
0 views

andy warhol

DG4488
08. 03. 2008
0 views

DG4488

ActuarialProfession0 106
12. 03. 2008
0 views

ActuarialProfession0 106

JForest 911 Impact
14. 03. 2008
0 views

JForest 911 Impact

jotspot long tail sw
16. 03. 2008
0 views

jotspot long tail sw

The truth revealed
20. 03. 2008
0 views

The truth revealed

nsf irnc mar05
31. 03. 2008
0 views

nsf irnc mar05

RobertPBrooks 9 50am
08. 01. 2008
0 views

RobertPBrooks 9 50am

target public
11. 01. 2008
0 views

target public

InterscholasticSports
23. 01. 2008
0 views

InterscholasticSports

Lec09 CHEBYSHEV AND BERNOULLI
13. 02. 2008
0 views

Lec09 CHEBYSHEV AND BERNOULLI

alma11
17. 04. 2008
0 views

alma11

Pichugina
14. 02. 2008
0 views

Pichugina

funding opportunities
18. 04. 2008
0 views

funding opportunities

RED cameron
21. 04. 2008
0 views

RED cameron

1175259417 sb myregionrhodes
22. 04. 2008
0 views

1175259417 sb myregionrhodes

Lisboa Portugalfinalversion
24. 04. 2008
0 views

Lisboa Portugalfinalversion

SummaryStatistics
16. 01. 2008
0 views

SummaryStatistics

Notification MTG 051706 ppt
17. 01. 2008
0 views

Notification MTG 051706 ppt

WTO4
07. 05. 2008
0 views

WTO4

Abbott 6 3 06
08. 05. 2008
0 views

Abbott 6 3 06

dewey shelf numbers
30. 04. 2008
0 views

dewey shelf numbers

Tzimas
02. 05. 2008
0 views

Tzimas

MJennings
19. 03. 2008
0 views

MJennings

100603transportation future
15. 03. 2008
0 views

100603transportation future

GETS 2003Ch
25. 03. 2008
0 views

GETS 2003Ch

APS02
10. 01. 2008
0 views

APS02

FNM0702 prev CD
12. 01. 2008
0 views

FNM0702 prev CD

Sonnerup uppsala2007
22. 01. 2008
0 views

Sonnerup uppsala2007

409awebinarslides
13. 01. 2008
0 views

409awebinarslides

The Perfect Day
05. 02. 2008
0 views

The Perfect Day

Biedermeier
16. 04. 2008
0 views

Biedermeier

VGP Careers
03. 04. 2008
0 views

VGP Careers

1 Medieval PPT 8 31
20. 01. 2008
0 views

1 Medieval PPT 8 31

miis
10. 01. 2008
0 views

miis

12a
26. 01. 2008
0 views

12a

040210ProFailure
15. 04. 2008
0 views

040210ProFailure

1162839394 ITI v5 0
08. 04. 2008
0 views

1162839394 ITI v5 0

11 SLIDES edited 6 7
07. 04. 2008
0 views

11 SLIDES edited 6 7

2007 07
29. 01. 2008
0 views

2007 07