pfusion kth 28 12 2006

Information about pfusion kth 28 12 2006

Published on April 30, 2008

Author: Tommaso

Source: authorstream.com

Content

Content-Based Search in Internet-Scale Peer-to-Peer Systems:  Content-Based Search in Internet-Scale Peer-to-Peer Systems by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of Cyprus Thursday, December 28th, 2006 Royal Institute of Technology (KTH) Stockholm, Sweden http://www.cs.ucy.ac.cy/~dzeina/ Presentation Goals:  Presentation Goals To provide an overview of Content-Based Search Algorithms in P2P systems with an emphasis on ISM. To provide an overview of Topologically-Aware overlay construction mechanisms in P2P systems with an emphasis on DDNO. To present other research activities that our group is currently involved in. References Related to this Talk:  References Related to this Talk "pFusion: An Architecture for Internet-Scale Content-Based Search and Retrieval" by D. Zeinalipour-Yazti, V. Kalogeraki, D. Gunopulos, IEEE Transactions on Parallel and Distributed Systems, (IEEE TPDS), accepted, 2006. "Structuring Topologically-Aware Overlay Networks using Domain Names", D. Zeinalipour-Yazti, V. Kalogeraki, Computer Networks (Comnet), Elsevier Publications, Volume 50, Issue 16 , pp. 3064-3082, 2006. "Exploiting Locality for Scalable Information Retrieval in Peer-to-Peer Systems", D. Zeinalipour-Yazti, V. Kalogeraki and D. Gunopulos, Information Systems (InfoSys), Volume 30, Issue 4, Pages 277-298, 2005. Introduction to Peer-to-Peer:  Introduction to Peer-to-Peer The P2P Computing paradigm became a powerful model for developing infrastructure-less Internet-Scale systems. Internet-Scale: Large number of geographically spread nodes. Fascinating Applications : File-sharing (e.g. Napster, Gnutella, eDonkey,...) Internet Telephony (e.g. Skype from Kazaa team) Spam Detection Networks (e.g. SpamNet) Web Caching (e.g. SQUIRREL based on Pastry) Web Anonymizers (e.g. Tarzan) Distributed Computing (e.g. [email protected]) P2P Online Gaming (e.g. Sony Playstation), Content Distribution Networks (e.g. Coral) Slide5:  A) Content-Based Search in P2P Systems Problem Definition:  Problem Definition Setting: A network of peers where each node maintains a collection of documents (vector of keywords | image features, etc) Goals: Effectively query the distributed documents by keywords. Consume the less possible network resources. Search in Unstructured P2P Environments :  Search in Unstructured P2P Environments Agnostic Techniques a) TTL-based Breadth-First-Search (BFS) + Each Node forwards the query to all its neighbors. - Excessive network and resource consumption. b) Random BFS (RBFS) [CIKM’02] + Each Node forwards the query to a random subset of neighbors. - Some important segments may become unreachable. Search in Unstructured P2P Environments :  Search in Unstructured P2P Environments Techniques using Past Statistics c) Most Results in Past Heuristic (>RES) [IPTPS’02] Query the nodes with the most results in the last K queries Usually explores the larger network segments but …. … fails to explore the nodes with the most relevant content d) Intelligent Search Mechanism (ISM) [CIKM’02] Each Node maintains a query(hit) profile for its neighbors Uses the cosine similarity to drive the queries to the results Motivation:  Motivation We crawled the Gnutella P2P Network for 5 hours with 17 workstations. We analyzed 15,153,524 query messages. Observation: High locality of specific queries. We try to exploit this property for more efficient searches Search in Unstructured P2P Environments:  Search in Unstructured P2P Environments Intelligent Search Mechanism (ISM) a) Profile mechanism b) Cosine Similarity – The Similarity Function c) RelevanceRank – Ranking Neighbors by similarity |L|-dim space: {olympic,games,vldb,athens,florida,storm} e.g. If q=“athens olympic” => q (vector of q) = [1,0,0,1,0,0] p1 p2 p4 p3 Current Node Experimental Evaluation:  Experimental Evaluation We developed a distributed News Agency (useful for Citizen Journalism, Video Sharing) using our open source Peerware system. http://www.cs.ucr.edu/~csyiazti/peerware.html Each peer utilizes Apache’s Lucene Information Retrieval system to efficiently organize information locally. We run experiments on 75 workstations (up to 1000 peers connected in a random topology) with datasets from Reuters and TREC Los Angeles Times. We compare Recall Rate vs. Num. of messages Experimental Evaluation:  Experimental Evaluation Experimental Evaluation:  Experimental Evaluation Summary of Results The ISM achieves in some cases 100% Recall Rate while using 40-50% less Messages and 30-40% less Time than BFS. ISM performs extremely well under failures – high churn rates (10% failure rate still yields 85% recall rate) Scales well to large environments (since only local information is utilized) Performs best with high locality of queries Slide14:  B) Topologically-Aware Overlay Networks Network Mismatch:  Network Mismatch P2P Networks are usually network-agnostic. Physical with Overlay Network Mismatch The Random Overlay Network The Topologically Aware Network Network-Efficient Topologies:  Network-Efficient Topologies The network mismatch between the Physical and the Overlay layer results in high latencies and excessive network resource consumption. Smaller Latency => Faster Interaction and Higher Data Transfer rates because of TCP windowing. We will discuss the following topologies RANDOM Topology. (Network-Agnostic) Short-Long (SL) Topology. Binning SL (BinSL) Topology. DDNO Topology (used in pFusion) (Network-Aware) i) Random Topology:  i) Random Topology a) Random Topology Each peer randomly connects to k other peers. This is the technique used in most systems (such as Gnutella v0.4) Advantages Simplicity. Needs only Local Knowledge. Leads to connected topologies if degree > logn Disadvantages Doesn’t take into account the underlying network A Random Graph of 100 nodes DDNO – Distributed Domain Name Order:  DDNO – Distributed Domain Name Order Motivation By our analysis of Gnutella (300,000 IPs) we found that 58% of the network belongs to only 20 ISPs DDNO – Distributed Domain Name Order:  DDNO – Distributed Domain Name Order Task: Locate nodes in the same domain without any global knowledge Solution Naïve Solution: Perform a Random Walk. Improved Search: Deploy a ZoneCache which tells a node towards which direction to move. ii) Short Long Topology:  ii) Short Long Topology b) Short-Long Topology [Infocomm’02] Build a Global latency adjacency matrix Each peer connects to k/2 closest peers (Short Links). It then connects to k/2 random peers (Long Links). The construction of the adjacency matrix requires global knowledge. (e.g. each peer pings its neighbors and sends this info to a centralized index) Note: By choosing only Short Links yields disconnected topologies A Short Graph of 1000 nodes iii) BinSL Topology:  iii) BinSL Topology c) Binning SL Topology (BinSL) Approximate Distances with Distributed Binning Each node calculates the RTT to k well-known landmarks. The numeric ordering of the landmarks defines the bin of a node. Furthermore latencies are divided into level ranges. e.g. Level0=[0,100)ms Level1=[100,200)ms , Level3=rest Each peer then connects to k/2 peers that have the same bin code. It then connects to k/2 random peers. [Infocomm’02] BinCode = Landmarks:Levels = l2l1l3:011 Well-chosen Landmarks landmarks BINSL Drawback: Depends on the Number and Quality of Landmarks:  BINSL Drawback: Depends on the Number and Quality of Landmarks ΔG : the sum of delays across all the edges of the 1000-node Overlay Graph The overlay latencies were taken from the AKAMAI CDN DDNO Advantages :  DDNO Advantages We perform a Query and measure the delay until the expected answer arrive. We observe that a DDNO network minimizes this delay for all search methods (BFS, RBFS, >RES and ISM) by 30% over RANDOM RANDOM GRAPH DDNO GRAPH ~30% Slide24:  Other Research Activities Distributed Top-K Query Processing :  Distributed Top-K Query Processing Problem Example Assume that we have a cluster of n=5 webservers. Each server maintains locally the same m=5 webpages. When a web page is accessed by a client, a server increases a local hit counter by one. Other Research I (in collaboration with IBM Almaden & AT&T Research and Univ. of Toronto) Distributed Top-K Query Processing :  Distributed Top-K Query Processing Problem Example (cont’d) TOP-1 Query: “Which Webpage has the highest number of hits across all servers (i.e. highest Score(oi) )?” Score(oi) can only be calculated if we combine the hit count from all 5 servers. Local score URL TOTAL SCORE Other Research I Distributed Top-K Query Processing :  Distributed Top-K Query Processing Publications: "The Threshold Join Algorithm for Top-k Queries in Distributed Sensor Networks", D. Zeinalipour-Yazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V. Tsotras, M. Vlachos, N. Koudas, D. Srivastava In ACM DMSN (VLDB'2005), Trondheim, Norway, pp. 61-66, 2005. "Data Acquision in Sensor Networks with Large Memories“, D. Zeinalipour-Yazti, S. Neema, D. Gunopulos, V. Kalogeraki and W. Najjar,, IEEE Intl. Workshop on Networking Meets Databases IEEE NetDB (ICDE'2005), Tokyo, Japan, 2005. "Distributed Spatio-Temporal Similarity Search" D. Zeinalipour-Yazti, S. Lin, D. Gunopulos, ACM 15th Conference on Information and Knowledge Management, (ACM CIKM 2006), November 6-11, Arlington, VA, USA, Other Research I FailRank: Failure Management in Grids:  FailRank: Failure Management in Grids Motivation: Studies have shown that a very large percentage of scheduled jobs fail on EGEE. Task: Design a Failure Management Module that will allow Grid Resource Brokers to divert jobs away from failing Grid Sites Core Idea: Continuously fuse together meta-information that is available for Job Queues trough monitoring services web sites, LDAP, etc. then rank this information. Other Research II FailRank: Failure Management in Grids:  FailRank: Failure Management in Grids Status: We collected a 4GB trace for 2,500 queues (1 month) from a variety of sources (SiteFunctionalTests, LDAP queries on Information Service, etc) We are currently analyzing this trace to quantify failures (i.e., associate grid status state with failures). Publications: "Managing failures in a Grid System using FailRank." D. Zeinalipour-Yazti, K. Neokleous, C. Georgiou, M.D. Dikaiakos, Technical Report TR-2006-04, Department of Computer Science, University of Cyprus, September 2006 "Monitoring and Ranking of Grid Failures using FailRank." D. Zeinalipour-Yazti, K. Neocleous, C. Georgiou, M.D. Dikaiakos, EGEE '06 Conference (poster session), Geneva, September 26, 2006. Other Research II ICGrid: Intensive Care Grid:  ICGrid: Intensive Care Grid ICGrid is a distributed platform that enables the integration, correlation and retrieval of clinically interesting episodes across Intensive Care Units Other Research III ICGrid: Intensive Care Grid:  ICGrid: Intensive Care Grid Demonstrations and Posters: Poster: "ICGrid: Intensive Care Grid." D. Zeinalipour-Yazti, M.D. Dikaiakos, M. Papa, T. Kyprianou, G. Panayi, EGEE '06 Conference (poster session), Geneva, September 26, 2006. Demo: "ICGrid: Intensive Care Grid." D. Zeinalipour-Yazti, H. Gjermundrod, M. D. Dikaiakos, G. Panayi and Th. Kyprianou, CoreGRID Industrial Confernece (part of the [email protected] series of events), Sophia-Antipolis, Nov. 30-Dec. 1, 2006 (best demonstration award). Invited Demo: "ICGrid: Intensive Care Grid." D. Zeinalipour-Yazti, H. Gjermundrod, M. D. Dikaiakos, G. Panayi and Th. Kyprianou, IST Conference, Helsinki, Finland, Nov. 17, 2006. Other Research III Data Management in Sensor Networks:  Data Management in Sensor Networks Motivation: Transmitting over the radio is the most expensive function in a Wireless Sensor Network. Core Idea (The In-Situ Storage Paradigm): Store the data locally at each sensor Access this information with On-demand Queries rather than continuously. Main Contribution The RiversideSensor System Provides index structures for giga-scale storage and retrieval in sensor systems Other Research IV (in collaboration with University of California, Riverside) Data Management in Sensor Networks:  Data Management in Sensor Networks We designed the Microhash Index Structure that enables efficient access to values stored on Flash Media (the most prevalent storage media of sensor systems). Publications: MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices", D. Zeinalipour-Yazti, S. Lin, V. Kalogeraki, D. Gunopulos and W. Najjar, The 4th USENIX Conference on File and Storage Technologies (FAST’05), 2005. "Efficient Indexing Data Structures for Flash-Based Sensor Devices", S. Lin, D. Zeinalipour-Yazti, V. Kalogeraki, D. Gunopulos, W. Najjar ACM Transactions on Storage (TOS), ACM Press, In Press, 2006. Other Research IV Content-Based Search in Internet-Scale Peer-to-Peer Systems:  Content-Based Search in Internet-Scale Peer-to-Peer Systems by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of Cyprus Department of Computer Science - University of Cyprus Thursday, December 28th, 2006 Royal Institute of Technology (KTH) Stockholm, Sweden http://www.cs.ucy.ac.cy/~dzeina/ Thank you! Happy New Year! TJA Step 1 (LB Phase):  TJA Step 1 (LB Phase) Each node sends its top-k results to its parent. Each intermediate node performs a union of all received lists (denoted as τ): Query: TOP-1 TJA Step 1 (HJ Phase):  TJA Step 1 (HJ Phase) Disseminate τ to all nodes Each node sends back everything with score above all objectIDs in τ. Before sending the objects, each node tags as incomplete scores that could not be computed exactly (upper bound) } Complete Incomplete TJA Step 1 (CL Phase):  TJA Step 1 (CL Phase) Have we found K objects with a complete score? Yes: The answer has been found! No: Find the complete score for each incomplete object (all in a single batch phase) CL ensures correctness! This phase is rarely required in practice.

Related presentations


Other presentations created by Tommaso

GPS Introduction
04. 02. 2008
0 views

GPS Introduction

san francisco california
21. 03. 2008
0 views

san francisco california

OSH Act and RA
18. 01. 2008
0 views

OSH Act and RA

03 Learning and Memory
14. 01. 2008
0 views

03 Learning and Memory

ceramics
09. 01. 2008
0 views

ceramics

the Glass Ceiling
09. 01. 2008
0 views

the Glass Ceiling

WEEKLY VERSION The Marble Champ
10. 01. 2008
0 views

WEEKLY VERSION The Marble Champ

si redesign san diego
10. 01. 2008
0 views

si redesign san diego

Hawiian Religion V1 S Shintonism
10. 01. 2008
0 views

Hawiian Religion V1 S Shintonism

ProstateCancer
14. 01. 2008
0 views

ProstateCancer

Relocatenursehomeres 2007
11. 01. 2008
0 views

Relocatenursehomeres 2007

solarsystem
16. 01. 2008
0 views

solarsystem

chapter1 Strategies for Success
19. 01. 2008
0 views

chapter1 Strategies for Success

Lecture 10
21. 01. 2008
0 views

Lecture 10

waste reduction
22. 01. 2008
0 views

waste reduction

Lurie 2
09. 01. 2008
0 views

Lurie 2

GSCI Strategic June 2004
23. 01. 2008
0 views

GSCI Strategic June 2004

Managing Facilitating Goods
23. 01. 2008
0 views

Managing Facilitating Goods

Northwest Coast
23. 01. 2008
0 views

Northwest Coast

Steps to Producing Cotton
24. 01. 2008
0 views

Steps to Producing Cotton

Ydstie
24. 01. 2008
0 views

Ydstie

Lifelinks Core ppt
04. 02. 2008
0 views

Lifelinks Core ppt

Hammerly
05. 02. 2008
0 views

Hammerly

Edu FullStory0308
05. 02. 2008
0 views

Edu FullStory0308

2ndReview
25. 01. 2008
0 views

2ndReview

AC028
28. 01. 2008
0 views

AC028

ForumPresentation
07. 02. 2008
0 views

ForumPresentation

poetry terms 2007
07. 02. 2008
0 views

poetry terms 2007

CPB729 LEC
12. 02. 2008
0 views

CPB729 LEC

Charity Walk 2006
15. 02. 2008
0 views

Charity Walk 2006

ISV Hosting Solution for SaaS
21. 02. 2008
0 views

ISV Hosting Solution for SaaS

AllCarSeats
22. 02. 2008
0 views

AllCarSeats

Disorder f
17. 01. 2008
0 views

Disorder f

Fever
29. 02. 2008
0 views

Fever

13 megacz2a
03. 03. 2008
0 views

13 megacz2a

Ramasetu20July2007
08. 03. 2008
0 views

Ramasetu20July2007

Phaedo1Spring
29. 01. 2008
0 views

Phaedo1Spring

9432513
15. 03. 2008
0 views

9432513

timdye
16. 03. 2008
0 views

timdye

3 John Jackson
31. 03. 2008
0 views

3 John Jackson

PresentacionPeopleSo ft
03. 04. 2008
0 views

PresentacionPeopleSo ft

Progetto Cono Sud
28. 03. 2008
0 views

Progetto Cono Sud

ch16 sec3 5792618
15. 04. 2008
0 views

ch16 sec3 5792618

csb13 gr
16. 04. 2008
0 views

csb13 gr

overdrive
17. 04. 2008
0 views

overdrive

COS all purpose 07
05. 02. 2008
0 views

COS all purpose 07

B3U4A
18. 04. 2008
0 views

B3U4A

Social Responsibility
21. 04. 2008
0 views

Social Responsibility

IIPEC ACC Lakheri 7
11. 02. 2008
0 views

IIPEC ACC Lakheri 7

fmrel
10. 03. 2008
0 views

fmrel

trosow
07. 05. 2008
0 views

trosow

World Dairy Summit
08. 05. 2008
0 views

World Dairy Summit

dw ccn bren
02. 05. 2008
0 views

dw ccn bren

CropLife
14. 02. 2008
0 views

CropLife

ctcnet2001
13. 01. 2008
0 views

ctcnet2001

Prospectus version5 october2006
14. 03. 2008
0 views

Prospectus version5 october2006

WED am Isaac 5 13
13. 02. 2008
0 views

WED am Isaac 5 13

Peanut Program
25. 01. 2008
0 views

Peanut Program

special waste 05
04. 02. 2008
0 views

special waste 05

InternetCareerResear ch
23. 01. 2008
0 views

InternetCareerResear ch

DOEPPDGScottVisit hbn052202s
25. 03. 2008
0 views

DOEPPDGScottVisit hbn052202s

Courtney Slide Show
14. 02. 2008
0 views

Courtney Slide Show

APEuro
22. 01. 2008
0 views

APEuro

PRESENT91
17. 01. 2008
0 views

PRESENT91

LaserLink July 2005
27. 03. 2008
0 views

LaserLink July 2005

Previsioni Media Sociali 2009
30. 12. 2008
0 views

Previsioni Media Sociali 2009

2 HGIStandards
08. 04. 2008
0 views

2 HGIStandards

Redistricting 1
11. 01. 2008
0 views

Redistricting 1

TWC G9 Team3 TheRightOne
29. 01. 2008
0 views

TWC G9 Team3 TheRightOne

The United Cities 7 18 06
30. 01. 2008
0 views

The United Cities 7 18 06

Antolini Galimberti Valsecchi
24. 01. 2008
0 views

Antolini Galimberti Valsecchi

UltimateSportsAdvent ure Full
11. 03. 2008
0 views

UltimateSportsAdvent ure Full

Understandings for Tragedy
12. 02. 2008
0 views

Understandings for Tragedy

RMC Group
11. 01. 2008
0 views

RMC Group