lecture19

Information about lecture19

Published on January 4, 2008

Author: Ariane

Source: authorstream.com

Content

CS 430 / INFO 430 Information Retrieval:  CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1 Course Administration:  Course Administration No classes: Wednesday, November 16 Thursday, November 17 Web Search:  Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges • Volume of material -- several billion items, growing steadily • Items created dynamically or in databases • Great variety -- length, formats, quality control, purpose, etc. • Inexperience of users -- range of needs • Economic models to pay for the service Strategies:  Strategies Subject hierarchies • Use of human indexing -- Yahoo! (original) Web crawling + automatic indexing • General -- Infoseek, Lycos, AltaVista, Google, Yahoo! (current) Mixed models • Human directed web crawling and automatic indexing -- iVia/NSDL Components of Web Search Service:  Components of Web Search Service Components • Web crawler • Indexing system • Search system • Advertising system Considerations • Economics • Scalability • Legal issues Slide7:  Lectures and Classes Lecture 19 Web Crawling Discussion 9 Ranking Web documents Lecture 20 Graphical methods Lecture 21 Context and performance Discussion 10 File systems Lecture 23 User interface considerations Web Searching: Architecture:  Web Searching: Architecture Build index Search Index to all Web pages • Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.) • The central index is implemented as a single system on a very large number of computers Examples: Google, Yahoo! What is a Web Crawler?:  What is a Web Crawler? Web Crawler • A program for downloading web pages. • Given an initial set of seed URLs, it recursively downloads every page that is linked from pages in the set. • A focused web crawler downloads only those pages whose content satisfies some criterion. Also known as a web spider Simple Web Crawler Algorithm:  Simple Web Crawler Algorithm Basic Algorithm Let S be set of URLs to pages waiting to be indexed. Initially S is is a set of known seeds. Take an element u of S and retrieve the page, p, that it references. Parse the page p and extract the set of URLs L it has links to. Update S = S + L - u Repeat as many times as necessary. [Large production crawlers may run continuously] Not so Simple…:  Not so Simple… Performance -- How do you crawl 1,000,000,000 pages? Politeness -- How do you avoid overloading servers? Legal -- What if the owner of a page does not want the crawler to index it? Failures -- Broken links, time outs, spider traps. Strategies -- How deep do we go? Depth first or breadth first? Implementations -- How do we store and update S and the other data structures needed? What to Retrieve:  What to Retrieve No web crawler retrieves everything Most crawlers retrieve HTML (leaves and nodes in the tree) ASCII clear text (only as leaves in the tree) Some retrieve PDF PostScript,… Indexing after crawl Some index only the first part of long files Do you keep the files (e.g., Google cache)? Robots Exclusion:  Robots Exclusion The Robots Exclusion Protocol A Web site administrator can indicate which parts of the site should not be visited by a robot, by providing a specially formatted file on their site, in http://.../robots.txt. The Robots META tag A Web author can indicate if a page may or may not be indexed, or analyzed for links, through the use of a special HTML META tag See: http://www.robotstxt.org/wc/exclusion.html Robots Exclusion:  Robots Exclusion Example file: /robots.txt # Disallow allow all robots User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear Disallow: /foo.html # To allow Cybermapper User-agent: cybermapper Disallow: Extracts from: http://www.nytimes.com/robots.txt:  Extracts from: http://www.nytimes.com/robots.txt # robots.txt, www.nytimes.com 3/24/2005 User-agent: * Disallow: /college Disallow: /reuters Disallow: /cnet Disallow: /partners Disallow: /archives Disallow: /indexes Disallow: /thestreet Disallow: /nytimes-partners Disallow: /financialtimes Allow: /2004/ Allow: /2005/ Allow: /services/xml/ User-agent: Mediapartners-Google* Disallow: The Robots META tag:  The Robots META tag The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required. Note that currently only a few robots implement this. In this simple example: <meta name="robots" content="noindex, nofollow"> a robot should neither index this document, nor analyze it for links. http://www.robotstxt.org/wc/exclusion.html#meta High Performance Web Crawling:  High Performance Web Crawling The web is growing fast: • To crawl a billion pages a month, a crawler must download about 400 pages per second. • Internal data structures must scale beyond the limits of main memory. Politeness: • A web crawler must not overload the servers that it is downloading from. Example: Mercator and Heritrix Crawlers:  Example: Mercator and Heritrix Crawlers Altavista was a research project and production Web search engine developed by Digital Equipment Corporation. Mercator was a high-performance crawler for production and research. Mercator was developed by Allan Heydon, Marc Njork, Ramie Stata and colleagues at Compaq Systems Research Center (continuation of work of Digital's AltaVista group). Heritrix is a high-performance, open-source crawler developed by Ramie Stata and colleagues at the Internet Archive. (Stata is now at Yahoo!) Mercator and Heritrix are described together, but there are major implementation differences. Mercator/Heritrix: Design Goals:  Mercator/Heritrix: Design Goals Broad crawling: Large, high-bandwidth crawls to sample as much of the Web as possible given the time, bandwidth, and storage resources available. Focused crawling: Small- to medium-sized crawls (usually less than 10 million unique documents) in which the quality criterion is complete coverage of selected sites or topics. Continuous crawling: Crawls that revisit previously fetched pages, looking for changes and new pages, even adapting its crawl rate based on parameters and estimated change frequencies. Experimental crawling: Experiment with crawling techniques, such as choice of what to crawl, order of crawled, crawling using diverse protocols, and analysis and archiving of crawl results. Mercator/Heritrix:  Mercator/Heritrix Design parameters • Extensible. Many components are plugins that can be rewritten for different tasks. • Distributed. A crawl can be distributed in a symmetric fashion across many machines. • Scalable. Size of within memory data structures is bounded. • High performance. Performance is limited by speed of Internet connection (e.g., with 160 Mbit/sec connection, downloads 50 million documents per day). • Polite. Options of weak or strong politeness. • Continuous. Will support continuous crawling. Mercator/Heritrix: Main Components:  Mercator/Heritrix: Main Components Scope: Determines what URIs are ruled into or out of a certain crawl. Includes the seed URIs used to start a crawl, plus the rules to determine which discovered URIs are also to be scheduled for download. Frontier: Tracks which URIs are scheduled to be collected, and those that have already been collected. It is responsible for selecting the next URI to be tried, and prevents the redundant rescheduling of already-scheduled URIs. Processor Chains: Modular Processors that perform specific, ordered actions on each URI in turn. These include fetching the URI, analyzing the returned results, and passing discovered URIs back to the Frontier. Building a Web Crawler: Links are not Easy to Extract and Record:  Building a Web Crawler: Links are not Easy to Extract and Record Relative/Absolute CGI Parameters Dynamic generation of pages Server-side scripting Server-side image maps Links buried in scripting code Keeping track of the URLs that have been visited is a major component of a crawler Mercator/Heritrix: Main Components:  Mercator/Heritrix: Main Components • Crawling is carried out by multiple worker threads, e.g., 500 threads for a big crawl. • The URL frontier stores the list of absolute URLs to download. • The DNS resolver resolves domain names into IP addresses. • Protocol modules download documents using appropriate protocol (e.g., HTML). • Link extractor extracts URLs from pages and converts to absolute URLs. • URL filter and duplicate URL eliminator determine which URLs to add to frontier. Mercator/Heritrix: The URL Frontier:  Mercator/Heritrix: The URL Frontier A repository with two pluggable methods: add a URL, get a URL. Most web crawlers use variations of breadth-first traversal, but ... • Most URLs on a web page are relative (about 80%). • A single FIFO queue, serving many threads, would send many simultaneous requests to a single server. Weak politeness guarantee: Only one thread allowed to contact a particular web server. Stronger politeness guarantee: Maintain n FIFO queues, each for a single host, which feed the queues for the crawling threads by rules based on priority and politeness factors. Mercator/Heritrix: Duplicate URL Elimination:  Mercator/Heritrix: Duplicate URL Elimination Duplicate URLs are not added to the URL Frontier Requires efficient data structure to store all URLs that have been seen and to check a new URL. In memory: Represent URL by 8-byte checksum. Maintain in-memory hash table of URLs. Requires 5 Gigabytes for 1 billion URLs. Disk based: Combination of disk file and in-memory cache with batch updating to minimize disk head movement. Mercator/Heritrix: Domain Name Lookup:  Mercator/Heritrix: Domain Name Lookup Resolving domain names to IP addresses is a major bottleneck of web crawlers. Approach: • Separate DNS resolver and cache on each crawling computer. • Create multi-threaded version of DNS code (BIND). In Mercator, these changes reduced DNS loop-up from 70% to 14% of each thread's elapsed time. Research Topics in Web Crawling:  Research Topics in Web Crawling • How frequently to crawl and what strategies to use. • Identification of anomalies and crawling traps. • Strategies for crawling based on the content of web pages (focused and selective crawling). • Duplicate detection. Crawling to build an historical archive:  Crawling to build an historical archive Internet Archive: http://www.archive.org A non-for profit organization in San Francisco, created by Brewster Kahle, to collect and retain digital materials for future historians. Services include the Wayback Machine. Further Reading:  Further Reading Heritrix http://crawler.archive.org/ Allan Heydon and Marc Najork, Mercator: A Scalable, Extensible Web Crawler. Compaq Systems Research Center, June 26, 1999. http://www.research.compaq.com/SRC/mercator/papers/www/paper.html

Related presentations


Other presentations created by Ariane

Safe Winter Driving
04. 01. 2008
0 views

Safe Winter Driving

Diamond
14. 11. 2007
0 views

Diamond

Heart Attack 2007
04. 01. 2008
0 views

Heart Attack 2007

Food groups
04. 03. 2008
0 views

Food groups

Radio wave propagation S
07. 11. 2007
0 views

Radio wave propagation S

WHAT IS A SENTENCE
19. 11. 2007
0 views

WHAT IS A SENTENCE

IPv6 addressing
28. 09. 2007
0 views

IPv6 addressing

FL Geology
03. 10. 2007
0 views

FL Geology

Taylor
05. 12. 2007
0 views

Taylor

util summit one call
12. 12. 2007
0 views

util summit one call

Monroe
02. 11. 2007
0 views

Monroe

Our Story Engagement Party 2007
05. 11. 2007
0 views

Our Story Engagement Party 2007

China Movement Forward
11. 10. 2007
0 views

China Movement Forward

Presentation FATEALLCHEM English
04. 10. 2007
0 views

Presentation FATEALLCHEM English

azienda e impresa
20. 11. 2007
0 views

azienda e impresa

Opioids
22. 11. 2007
0 views

Opioids

Apsa
23. 11. 2007
0 views

Apsa

Equalizer Design
26. 11. 2007
0 views

Equalizer Design

presse16042002
01. 12. 2007
0 views

presse16042002

mnu page 19
17. 12. 2007
0 views

mnu page 19

cold war beginning
18. 12. 2007
0 views

cold war beginning

SCADA security
25. 12. 2007
0 views

SCADA security

LME LFG
30. 12. 2007
0 views

LME LFG

food allergy
03. 01. 2008
0 views

food allergy

turing
05. 01. 2008
0 views

turing

Vineyard Maintenance
07. 01. 2008
0 views

Vineyard Maintenance

Commodity Grapes
07. 01. 2008
0 views

Commodity Grapes

07 MITRE radar
06. 11. 2007
0 views

07 MITRE radar

mammoth wall new technologies
16. 11. 2007
0 views

mammoth wall new technologies

summary of warfare to date
04. 01. 2008
0 views

summary of warfare to date

yale presentation
24. 02. 2008
0 views

yale presentation

Korean War Lisa
28. 02. 2008
0 views

Korean War Lisa

Children Adolescents
11. 03. 2008
0 views

Children Adolescents

Futures Podcast Lectures Series
12. 03. 2008
0 views

Futures Podcast Lectures Series

ITS 13 07e
14. 03. 2008
0 views

ITS 13 07e

CONVERA
18. 03. 2008
0 views

CONVERA

Offshore Info Session 2007
27. 03. 2008
0 views

Offshore Info Session 2007

Reduce our impact
02. 10. 2007
0 views

Reduce our impact

nlpspo linkkdd04
13. 04. 2008
0 views

nlpspo linkkdd04

Morris9 28 05
19. 11. 2007
0 views

Morris9 28 05

tankleak jb10 06
08. 11. 2007
0 views

tankleak jb10 06

061101Panofsky
19. 12. 2007
0 views

061101Panofsky

UC Flex Town Hall 5 Final
27. 11. 2007
0 views

UC Flex Town Hall 5 Final

CEP retreat presentation 7 24 07
29. 10. 2007
0 views

CEP retreat presentation 7 24 07

12 Mufson lightBox may05
29. 11. 2007
0 views

12 Mufson lightBox may05

ANTHROPOLOGICAL APPROACH
13. 11. 2007
0 views

ANTHROPOLOGICAL APPROACH

vatant
04. 12. 2007
0 views

vatant

ebipres
07. 01. 2008
0 views

ebipres

lecture9 queryexpansion
06. 12. 2007
0 views

lecture9 queryexpansion

LeeEysturlid NCSS Presentation
21. 12. 2007
0 views

LeeEysturlid NCSS Presentation