century

Information about century

Published on October 19, 2007

Author: Malden

Source: authorstream.com

Content

Slide1:  Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science http://www.cs.wustl.edu/~cytron/ Century Club May 2002 Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood, George Varghese (UCSD) Mahesh Jayaram Thanks: Ben Brodie Center for Distributed Object Computing Department of Computer Science Washington University Slide2:  Outline Computers have come a long way Slide3:  Outline Computers have come a long way Today’s computers are never lonely Slide4:  Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Slide5:  Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media Slide6:  Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media Internet packet filtering Slide7:  Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media Internet packet filtering Conclusion Slide8:  A Grandchild’s Gift Slide9:  If cars improved that much in 30 years … $4000 60,000 miles per hour Seats 10,000 people Gets 20,000 miles per gallon Breaks every 70 years The Haystack:  The Haystack The Internet is large and growing Content on the Internet is growing even faster A haystack sits still, but the Internet…. Slide11:  Growth of the Internet (why computers aren’t lonely anymore) Y2K Problem (?): More computers sold than TVs Slide12:  Growth of Internet Content (volumes and volumes of data) Anybody can publish Problem is how to find what you want Slide13:  Page 6B What can tech companies do? Some say they're at a loss, but others offer budding solutions By Kevin Maney On July 7, 1940, as the nation edged toward World War II, IBM put out a statement that made headlines. The company offered all its facilities for national defense, ready to convert to making anything the government needed. Other leaders in the electro-mechanical technology of the day -- Ford Motor, General Motors, General Electric -- also threw their weight into defense efforts. They switched from making cars and washing machines to building tanks, aircraft engines and machine guns. So here we are in 2001, readying for another war. The U.S. technology industry is the best and most innovative in the world. It is the nation's pride and joy. Shouldn't it do something? 9/17/2001 Slide14:  . . . One possibility is in data-mining technology. Data mining is a way to collect millions of pieces of information in a computer system, sift through that data, make sense of them and come up with something useful. ''We (the U.S. tech industry) are experts at data mining and have vast resources of data to mine,'' says Tom Evslin, CEO of Internet communications company ITXC. ''We have used it to target advertising. We can probably use it to identify suspicious activity or potential terrorists.'' . . . Slide15:  Fast searching of magnetic media with Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood Enabling Technology: Disk Drives:  Enabling Technology: Disk Drives Magnetic disk storage areal density vs. year of IBM product introduction (From D. A. Thompson) Almost 10,000,000x increase in 45 years! Cost per Megabyte:  Cost per Megabyte Price history of hard disk product vs. year of product introduction (From D. A. Thompson) Cost decreasing 3% per week! Massive Storage & Data:  ﴀ Storage industry will ship 4,000,000,000,000,000,000 Bytes this year FedEx generated 14 Terabytes of data last year US intelligence collects data equaling the printed collection of the US library every day! Massive Storage & Data Massive Data Sets:  Massive Data Sets Employee records Consumer information Maps/mission/intelligence data Genome maps Data sets now measured in Terabytes, and are dynamic! Genome Application:  Genome Application Genome maps growing expanded daily Wash U sequencing center Each of us has 80,000 genes found among 3 billion characters of DNA (A,C,G,T) Look for matches Identify function Disease: understand, diagnose, detect, medicine, therapy Biofuels, warfare, toxic waste Understand evolution Forensics, organ donors, authentication More effective crops, disease resistance DNA String Matching:  DNA String Matching Looking for CACGTTAGT…TAGC Interested in matches and near matches Search human genome and other gene oceans Need to search entire data sets Bio Computation Problem:  ﴀ Bio Computation Problem *BIG* Genome Databases A C G T G T A C A G DNA pattern DNA sequence Match? Approximate matches are just as useful Finding a needel in a heystuck:  Finding a needel in a heystuck DNA and live text can contain errors We often seek an approximate match, for example needle No match? Try 2-transpositions enedle, needle, nedele, neelde, needel No match? Try 1-deletions eedle, nedle, nedle, neele, neede, needl No match? Try insertions, larger edits, … An exponential number of possibilities How is this done today?:  No How is this done today? Think of every way a word can be misspelled Present each misspelling to the computer for an exact match enedle needle nedele neelde needel How can we do better?:  How can we do better? Data is present on magnetic media Hardware at the disk is Already fault tolerant (more on this later) needel  needle Distributed across all surfaces We win if number of misspellings is large, and the number of false hits is small Another Application:Intelligence Data:  Another Application:Intelligence Data Lots of data Changing constantly Many perturbations Tzar, tsar, czar, . . . Don’t know what we want to look for beforehand Slide27:  Google Search Engine Crawls the web once per month Caches web pages Fast, exact text-based search (see how soon) Image Database Applications:  Image Database Applications Challenging database Unstructured Massive data sets Don’t know what we need to look for in each picture Satellite Data:  Satellite Data Low-orbit fly-over every 90 minutes Look for differences in images Large objects Troops Changes to landscape Flag, transmit these differences immediately National Reconnaissance Office City assessors . . . Washington University:  Washington University Hilltop Campus How do we find what we’re looking for?!:  How do we find what we’re looking for?! Conventional Structured Database:  Conventional Structured Database Word James computer agent Bond Inverted list - pointers <1,2> <1,4> <2> <1,3,4> Madison <3> mobile <2> movie <3,4> Challenges in Searching Massive Databases:  Challenges in Searching Massive Databases Know what to search for need to build index beforehand maintain index as it changes Do not know what to search for need to search the whole database! Conventional Search:  Conventional Search Hard drive Processor Memory I/O bus Memory bus Conventional Search:  Conventional Search Hard drive Processor Memory I/O bus Memory bus find …. Conventional Search Conventional Search:  Conventional Search Hard drive Processor Memory I/O bus Memory bus contents yes, no, no, yes, yes …. Conventional Search Conventional Approach:  ﴀ Conventional Approach WUSTL’s Approach:  ﴀ WUSTL’s Approach Streaming Approach:  ﴀ Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing Streaming Approach Streaming Approach:  ﴀ Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing find Streaming Approach Streaming Approach:  ﴀ Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing find Streaming Approach Streaming Approach:  ﴀ Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing Parallelism through each transducer and drive find yes, no, no, yes, yes Streaming Approach Magnetic Recording Channel Schematic:  Magnetic Recording Channel Schematic Encoder Decoder Detector Input User Data Decoded User Data Channel Bits Head Disk Analog Readback A B C To Bus or Cache Key streaming over Data:  Key streaming over Data Disk Level Implementation:  Disk Level Implementation 100-bit-key matching through a pseudo-random binary series score matches Slide46:  Status: Prototype in progress ﴀ Hard drive Host ATAPI Controller IDE_to_ATM module Slide47:  Internet Packet Filtering with Mahesh Jayaram and George Varghese Slide48:  Finding Needles in a Moving Haystack Cost of Internet Request:  As technology improves, transmission time decreases but latency stays the same Year Cost of Internet Request Time Slide50:  Example: Garden Hose Fire department and gardener suffer the same wait Slide51:  Example: Hot Shower You want this water Latency (time to get hot water) ~ distance Slide52:  Convection circuit continuously circulates hot water Latency ~ 0 Latency-Free Hot Shower Slide53:  Better to receive than to give Cable broadcast Radio broadcast TV guide channel Gate connection announcements in flight Winning lottery number Modern name: push technology Better to receive than to give:  Better to receive than to give How do you get what you want?:  How do you get what you want? Slide56:  Packet Filters Filter F (Weather) Slide57:  Packet Filters Filter F (Weather) Existing Approach:  Existing Approach IBM Quote Weather Flight Schedule Our approach:  Our approach IBM Quote Weather Flight Schedule Composite filter makes just one pass Slide60:  How we do it IBM Quote Weather Flight Schedule Slide61:  TCPConnHeader : EtherType IPHeader TCPPortPair EtherType : #IP_TYPE IPHeader : Vers HlenPlusRest Vers : HalfByte HlenPlusRest : 0 1 0 1 FixedRest | 0 1 1 0 FixedRest OneIPOption | 0 1 1 1 FixedRest TwoIPOption | 1 0 0 0 FixedRest ThreeIPOption | 1 0 0 1 FixedRest FourIPOption | 1 0 1 0 FixedRest FiveIPOption | 1 0 1 1 FixedRest FiveIPOption OneIPOption | 1 1 0 0 FixedRest FiveIPOption TwoIPOption | 1 1 0 1 FixedRest FiveIPOption ThreeIPOption | 1 1 1 0 FixedRest FiveIPOption FourIPOption | 1 1 1 1 FixedRest FiveIPOption FiveIPOption FixedRest : ServiceType TotalLength Identification Flags FragmentOffset TimeToLive Protocol HeaderChecksum IPAddrPair ServiceType : Byte TotalLength : TwoByte Identification : TwoByte Flags : bit bit bit FragmentOffset : bit Byte HalfByte TimeToLive : Byte Protocol : #TCP_PROTOCOL HeaderChecksum : TwoByte IPAddrPair : #IP_SRC_DST_PAIR FiveIPOption : ThreeIPOption TwoIPOption FourIPOption : TwoIPOption TwoIPOption ThreeIPOption : TwoIPOption OneIPOption TwoIPOption : OneIPOption OneIPOption OneIPOption : Option Padding Option : ThreeByte Padding : Byte TCPPortPair : #TCP_PORT_PAIR FourByte : TwoByte TwoByte ThreeByte : TwoByte Byte TwoByte : Byte Byte Byte : HalfByte HalfByte HalfByte : bit bit bit bit bit : 0 | 1 Sample grammar for TCP packet Slide62:  Results The more things you want, the slower existing approaches get Our performance doesn’t degrade Slide63:  Conclusions The Internet and its content are growing explosively Disk storage is abundant, cheap, reliable Technology must provide fast, inexact searching of text and images As more data is hurled at and past us, fast filtering of Internet traffic is a must Slide64:  Questions?

Related presentations


Other presentations created by Malden

Brickner2
02. 05. 2008
0 views

Brickner2

staple foods
04. 10. 2007
0 views

staple foods

2003 prelims
05. 10. 2007
0 views

2003 prelims

Radar7
07. 10. 2007
0 views

Radar7

Challenges in the asian market
09. 10. 2007
0 views

Challenges in the asian market

prefix delegation requirement 01
09. 10. 2007
0 views

prefix delegation requirement 01

CBCH3
10. 10. 2007
0 views

CBCH3

nile croc
11. 10. 2007
0 views

nile croc

320 circulation 2
12. 10. 2007
0 views

320 circulation 2

NIKE
12. 10. 2007
0 views

NIKE

phone int may99
18. 10. 2007
0 views

phone int may99

Planning Your Career
21. 10. 2007
0 views

Planning Your Career

Havlik pres
12. 10. 2007
0 views

Havlik pres

The agricultural revolution
23. 10. 2007
0 views

The agricultural revolution

Lesson 13 War in the Atlantic
23. 10. 2007
0 views

Lesson 13 War in the Atlantic

uml
24. 10. 2007
0 views

uml

BOJapan
09. 10. 2007
0 views

BOJapan

ESYS150 06 lect9
30. 10. 2007
0 views

ESYS150 06 lect9

Graduation Final Project
22. 10. 2007
0 views

Graduation Final Project

PresentaciÃn Coopagua
22. 10. 2007
0 views

PresentaciÃn Coopagua

Neck infect 980225
03. 01. 2008
0 views

Neck infect 980225

Finished project
03. 01. 2008
0 views

Finished project

Holland Hoisting
04. 01. 2008
0 views

Holland Hoisting

Middle School Seminar
04. 01. 2008
0 views

Middle School Seminar

Properties of Water
05. 01. 2008
0 views

Properties of Water

Campaigns Elections presentation
07. 01. 2008
0 views

Campaigns Elections presentation

Sustainable Energy
07. 01. 2008
0 views

Sustainable Energy

dossier level club para eventos
23. 10. 2007
0 views

dossier level club para eventos

MAROKKO 2004
24. 10. 2007
0 views

MAROKKO 2004

Vortrag Journee Partenaires07
17. 10. 2007
0 views

Vortrag Journee Partenaires07

Ahmed Presentation
21. 11. 2007
0 views

Ahmed Presentation

A Exam Sethu
30. 09. 2007
0 views

A Exam Sethu

bresil ecotourisme
20. 11. 2007
0 views

bresil ecotourisme

Chapter10
02. 10. 2007
0 views

Chapter10

85VideoCompress
27. 02. 2008
0 views

85VideoCompress

lis618p03s 08
08. 10. 2007
0 views

lis618p03s 08

0170128 indiv all
16. 03. 2008
0 views

0170128 indiv all

ECOMM2003 conclusion
20. 03. 2008
0 views

ECOMM2003 conclusion

GPTM 20 Aniversary
10. 04. 2008
0 views

GPTM 20 Aniversary

L7
15. 10. 2007
0 views

L7

healthyfamilyppt
04. 03. 2008
0 views

healthyfamilyppt

cdrllc
22. 04. 2008
0 views

cdrllc

Web 2.0 Technologies
30. 10. 2007
0 views

Web 2.0 Technologies

AngelDevil
01. 10. 2007
0 views

AngelDevil

Enagic
07. 05. 2008
0 views

Enagic

April2006Sandra
08. 05. 2008
0 views

April2006Sandra

3017TechThemesPPT
20. 02. 2008
0 views

3017TechThemesPPT

Maintenance
30. 04. 2008
0 views

Maintenance

Ethics23Apr08
02. 05. 2008
0 views

Ethics23Apr08

mis1
02. 05. 2008
0 views

mis1

self study x rayfilm screens
02. 05. 2008
0 views

self study x rayfilm screens

idlingpresentationpr oposed
26. 02. 2008
0 views

idlingpresentationpr oposed

NKKM
02. 05. 2008
0 views

NKKM

bsirl book
02. 05. 2008
0 views

bsirl book

8thForum ClosingPlenaryInclTB G3
12. 03. 2008
0 views

8thForum ClosingPlenaryInclTB G3

NASPL speech without pics
07. 04. 2008
0 views

NASPL speech without pics

1st review
15. 10. 2007
0 views

1st review

docNews404
19. 10. 2007
0 views

docNews404

Leo Visconti 01
16. 11. 2007
0 views

Leo Visconti 01

Chap11
09. 04. 2008
0 views

Chap11

aula eso 1103
24. 10. 2007
0 views

aula eso 1103

biografia
23. 10. 2007
0 views

biografia

5 3
11. 10. 2007
0 views

5 3

IHY ParticiaptingCountri es
19. 10. 2007
0 views

IHY ParticiaptingCountri es

11 Cosmopolitanism
31. 12. 2007
0 views

11 Cosmopolitanism

Online Orientation 8 10 07
06. 12. 2007
0 views

Online Orientation 8 10 07

guida
22. 10. 2007
0 views

guida

ds web host
18. 04. 2008
0 views

ds web host

moni200510
10. 10. 2007
0 views

moni200510

Indicators Billings
07. 01. 2008
0 views

Indicators Billings

seml
29. 09. 2007
0 views

seml

Introduction national 2007
03. 10. 2007
0 views

Introduction national 2007

GUS KOEHLERCalEd4 03
11. 04. 2008
0 views

GUS KOEHLERCalEd4 03

6 Macabrey
17. 10. 2007
0 views

6 Macabrey

YEP05
10. 12. 2007
0 views

YEP05