2006 09 13DeepakAgarwal

Information about 2006 09 13DeepakAgarwal

Published on October 23, 2007

Author: Tarzen

Source: authorstream.com

Content

On Mining Massive Dynamic Data :  On Mining Massive Dynamic Data Deepak Agarwal Yahoo! Research SF Bay Area Chapter, ACM 13th September, 2006 Background:  Background PhD in statistics, university of Connecticut ’01 Multi-level hierarchical Bayesian models to study deforestation in Madagascar. Working with massive data in GIS got me interested in Data Mining Joined Statistics and Data Mining at AT&T Massive Dynamic networks; monitoring massive streams. Intrigued by internet advertising; joined Y! in 2006 Yahoo! Research:  Yahoo! Research Head: Prabhakar Raghavan; 2005 Search Economics Machine Learning Statistics and Data Mining http://research.yahoo.com/ Context:  Context Massive amounts of data Internet wave, telecommunications; pose new statistical challenges Computer science →progress in managing data, computing summary statistics efficiently Dynamic nature of data Ubiquitous, methods for static data not optimal Methods for time series, point processes germane Challenge: building scalable methods Focus on three problems:  Focus on three problems Estimating “delta” through time Monitoring massive streams Mining massive dynamic social networks Effective but lossy representation Sequential sampling for learning optimize the explore-exploit tradeoff Different from fixed sample size design Other research areas:  Other research areas Detecting “hotspots” in massive spatial data Scan statistics (SODA ’06, KDD ’06) Forecasting long-term and short-term Applications at Yahoo! Data squashing Scaling down data to facilitate statistical modeling (KDD ’03) Hierarchical Bayesian modeling Shrinkage estimation for massive data (KDD ’02; SDM ’04; ICDM ’05) Problem 1: Monitoring massive streams:  Problem 1: Monitoring massive streams Estimating “delta” through time; several apps. Network monitoring: Traffic volume in SNMP etc. Bio-surveillance : Emergency room data; events on websites spam detection: e-mail spam; web spam etc. Business intelligence: traffic pattern to customer care centers; augments usual dashboard type statistics Challenges:  Challenges Estimate accurate baseline model Change detection with good sensitivity/specificity Adjusting for multiple testing; global features Adaptive procedure; easy to update Incorporating correlations among series Application:  Application Question: Can we detect social disruption events in China before they get reported in the mainstream media? Our Answer: Probably yes, if we knew what data to use Slide10:  Word patterns Slide11:  We would like to thank Simon Urbanek for providing the plot How did we get the patterns?:  How did we get the patterns? Patterns: emerged from retrospective analysis of biological events (West Nile, SARS); foreseen as potential indicators and warnings of social disruption Direct indicators (e.g, news reports of outbreaks) Indirect indicators (school and factory closings etc.) Patterns selected manually by experts. Contingency table obtained daily: Websites (about 40) x patterns (about 25). Data Collection, Reporting.:  Data Collection, Reporting. Crawler Crawl a max of 1K pages per website Parser Each webpage parsed into sentences by a parser Index Converted to UTF-8 and indexed incrementally, lucene empowered indexing and search software Anomalies reported in a newsletter form on a web portal every morning. Slide15:  Number of crawled pages show variation, monitor rate of occurrence per page for each pattern. Notations and transformation.:  Notations and transformation. 40 days of initial data:  40 days of initial data No short term seasonal effects in the rates:  No short term seasonal effects in the rates Baseline model: State space approach:  Baseline model: State space approach QQ-plots of standardized residuals to test the conditional independence assumption in the observation equation of the baseline model.:  QQ-plots of standardized residuals to test the conditional independence assumption in the observation equation of the baseline model. State Equation, model update:  State Equation, model update Estimating Variance components.:  Estimating Variance components. Estimating “forgetting” factors:  Estimating “forgetting” factors Detecting anomalies, intervention strategy.:  Detecting anomalies, intervention strategy. Q-charting, monitor the EWMA of normal scores of p-values (Liu and Lambert, 2006). CUSUM based approach using Bayes factor (West and Harrison, 1997; Gargallo and Salvador,2003) In this application: Only detected spikes Other Issues:  Other Issues What if a spike/drop is detected? Don’t use the data point in model updating, increase the prior variance by a factor of c (c=9 has worked well for this application) Missing data: Occurs when we download very few (or no pages). Draw an observation randomly from the predictive distribution and proceed as usual. Deleting uninformative series, adding new ones: Delete a series if 95th percentile < 1/1000. Multiple testing: Bayesian Approach.:  Multiple testing: Bayesian Approach. Monitoring large number of independent streams need to correct for multiple testing Main idea: Derive empirical null based on observed deviations Flag interesting cases after adjusting for global characteristics in the system Bayesian approach: shrink residuals Shrinkage automatically builds in penalty for conducting multiple tests (Scott and Berger, 2003) Bayesian procedure.:  Bayesian procedure. Estimating hyperparameters:  Estimating hyperparameters Large number of time series Approximate likelihood: data squashing likelihood approximated by weighted likelihood MAP estimate used as plug-in Moderate number of time series Fully Bayesian Inference using Gauss-Hermite Quadrature Slide29:  Distribution of normal scores on a randomly selected day Putting it all together:  Putting it all together Build a baseline model, any model that provides a p-value of observed relative to predictive is appropriate. The state space approach provides a rich class. Declare anomalies after adjusting for multiple testing. We use a Bayesian procedure but other approaches like FDR may be used Delete time series that are uninformative (based on a user defined criteria). Add new series to the monitoring process. For missing data, draw an observation randomly from the predictive distribution. When an anomaly occurs, make appropriate adjustments to maintain the correct variance. Update the baseline distribution with arrival of new data. The updating process should be quick for large applications. Dotted solid lines: Days when reports appeared in mainstream media Dotted gray lines: Days when our system found spikes related to the reports that appeared later.:  Dotted solid lines: Days when reports appeared in mainstream media Dotted gray lines: Days when our system found spikes related to the reports that appeared later. Rough validation using actual media reports.:  Rough validation using actual media reports. July 24th : mystery illness kills 17 people in China, we noticed several spikes on July 17th and 18th Sept 29th and Dec 7th : On Sept 29th , news reports of China carving out emergency plans to fight bird flu and prevent it from spreading to humans. On Dec 7th , a confirmed case of bird flu in humans reported. We reported several spikes on Sept 12th and 14th, Nov 2nd, 7th, 11th, and 16th mostly for the pattern influenza, flu, pneumonia, meningitis. On Nov 21st , four big spikes on bf3.syd.com.cn on influenza, flu, pneumonia, meningitis; emergency, disaster, crisis; prevention and quarantine. Ongoing work:  Ongoing work Monitoring hierarchical streams Applications at Yahoo! Correlation structure induced partly by hierarchy Concise description of anomalies ICDM ’05: for contingency tables ICDE ’06: for hierarchical data; under submission Other applications:  Other applications Monitoring emergency room visits by symptom and location (JSM, 2005). Monitoring calls to customer care centers to augment usual slice and dice dashboard statistics E.g. there was a 10% increase in Hang-ups for calls from Maryland (ICDM, 2005) Relevant articles:  Relevant articles Agarwal, D., Feng, J. , Torres, V. (2006) “Monitoring massive streams simultaneously: A holistic Approach”, Interface (refereed section) Agarwal, D. (2005) “Empirical Bayes Approach to Detect Anomalies in Dynamic Multidimensional Arrays”, ICDM, Houston Agarwal, D., DuMouchel, W , Goodall, G (2005) “KFC: A Kalman filtering appraoch to detect anomalies in Massive contingency tables”, JSM, Minneapolis Problem 2: Building Efficient Representation for Massive Dynamic Networks:  Problem 2: Building Efficient Representation for Massive Dynamic Networks Context:  Context Transactional data: Time stamped records of interaction between pairs of entities, e.g., telephone calls, credit card purchases, e-mail exchanges, hyperlinks etc. Gives rise to a dynamic graph, nodes represent transactors, edges represent interactions over time. Goal: Find a lossy but efficient representation No unique solution, depends on objective Our application:  Our application Directed graph: phone calls on AT&T’s network. Massive : millions of nodes and edges Dynamic: lose nodes and edges, get new ones Heterogeneous: biz,res,cell,800 etc. Sparse: the 80/20 rule; power law Incomplete: don’t see all calls, miss calls originating in competitors’ network, cell calls, local calls etc Applications:  Applications Fraud detection ( fraudsters compromising 800 numbers, international numbers etc.) Marketing (viral marketing: market to people with strong network influence) Repetitive debtors (catching subscription fraud) Key observations: Analysis at local transactor level useful; global not needed Facilitate near-real time applications Representation:  Representation Create local graph centered around each node captures interaction with the rest of the network Approximation: Graph at time t union of local sub-graphs Capturing dynamics of graph over time (Cortes et al) Smoothing each local subgraph over time Smoothed local subgraph around node X: Community of interest (COI) of X. Smoothing :  Smoothing Based only on yesterday’s data? Too narrow Union of all time periods? Too broad A moving average of the t most recent time periods? Better but does not capture slow drifts well, logistically difficult Exponentially weighted moving average (EWMA): G(t)=θG(t-1)+(1- θ)g(t) Gives more weight to recent data Easy to maintain and update Weighting past calls: choosing the forgetting factor:  Weighting past calls: choosing the forgetting factor Calls fade out over time; The larger θ is , the longer the call has non-negligible weight Selecting θ: Standard problem in time series: Derive estimates from Kalman filter or ARIMA (0,1,1) But what’s our loss function here? Graph provided by Chris Volinsky Reducing redundancy :  Reducing redundancy Only a small fraction have high degrees Introduce a parameter k (positive integer) COI of X is smoothed subgraph centered around X Top k called by X + other Top k calling X + other Weights on the edges are those derived from EWMA. Still not done: this will lead to gain in more and more edges over time: introduce a third parameter ε such that an edge below this threshold gets dropped. Helps with noise reduction; storage savings. COI of X :  COI of X X Other outbound Other inbound How to select parameters?:  How to select parameters? Select L pre and post periods, maximize an average similarity measure Similarity functions:  Similarity functions p2 pother Pre-period Post-period p1 TN1 TN2 TN1 TN2 Slide49:  Hellinger Wdice Selecting theta and k Selecting epsilon.:  Selecting epsilon. Repetitive Debtor:  Repetitive Debtor Thomas Hanley 62 Rio Robles, San Jose CA …………… Disconnected; non-payment ……………… Deborah Hanley 62 Rio Robles, San Jose CA …………… ……………… ?? Key: Calling patterns don’t change Compare new connections with fraudulent numbers using a similarity function. Validation with labels from experts:  Validation with labels from experts Enhancing COI: My friends’ friends’:  Enhancing COI: My friends’ friends’ Impute calls not seen on the network exploiting social structure Other issues Quantify social characteristics like tendency to call; tendency to receive calls; tendency to return calls for each node. Extended COI :  Extended COI X other other X x d=0 d=1 d=2 d=3 Approach:  Approach Extended COI -> social network representing the node’s interaction with the network Developed a rich class of statistical models to do inference. Parameters:  Parameters Node i: αi: expansiveness (tendency to call) βi: attractiveness (tendency of being called) Global parameters: θ: density of COI (reduces with increasing sparseness) ρ: reciprocity of COI (tendency to return calls) λs: “caller” specific effect λr: “cal lee” specific effect γ: “call” specific effect Slide57:  COI : {(wij,wji): i < j} Covariates: Σαβ density Coefficients for edge covariates Hyperparameters k wij wji tij tij Imputing tij’s Example:  Example COI with 117 nodes, 172 edges. 14 missing edges, local calls from14 non ATT local customers to seed node . One node attribute:biz/cell/res gives rise to 9 edge attributes. cell->biz, cell->cell, cell->res collapsed into one block. M1: uniform reciprocity, M2: differential reciprocity Latter gives better fit, edge covariates were statistically significant. Results in Table 2 of paper. Relevant Papers:  Relevant Papers S.Hill, D.Agarwal, R.Bell, C.Volinsky (2006) “Building an Effective Representation of Dynamic Networks”, Journal of Computational and Graphical Statistics(to appear) C.Cortes, D.Pregibon, C. Volinsky (2003) “Computational Methods for Dynamic Graphs”, Journal of Computational and Graphical Statistics, 12, 950-970 D.Agarwal, D. Pregibon (2004) “Enhancing Communities of Interest using Bayesian Stochastic Blockmodels”, Siam Data Mining Conference, Orlando

Related presentations


Other presentations created by Tarzen

FastFood
24. 02. 2008
0 views

FastFood

Money Banking
14. 04. 2008
0 views

Money Banking

Nature nurture
04. 09. 2007
0 views

Nature nurture

tolkein presentation
04. 09. 2007
0 views

tolkein presentation

chapter08
16. 06. 2007
0 views

chapter08

Data Storage 1
16. 06. 2007
0 views

Data Storage 1

Best Practices
17. 04. 2008
0 views

Best Practices

Community Workshop Seminar
17. 04. 2008
0 views

Community Workshop Seminar

mohamed ariff and greg lopez
13. 04. 2008
0 views

mohamed ariff and greg lopez

20011105e1
10. 04. 2008
0 views

20011105e1

s2 dewulf
09. 04. 2008
0 views

s2 dewulf

AAAS
07. 04. 2008
0 views

AAAS

D200510 06 GermanETRforJapan
30. 03. 2008
0 views

D200510 06 GermanETRforJapan

sez10
28. 03. 2008
0 views

sez10

SouthAfrica
21. 09. 2007
0 views

SouthAfrica

Scanning Tunneling Microscope
21. 09. 2007
0 views

Scanning Tunneling Microscope

chem lab bonding 05
12. 10. 2007
0 views

chem lab bonding 05

ClassifyAnimals
12. 10. 2007
0 views

ClassifyAnimals

desc fashion en
18. 10. 2007
0 views

desc fashion en

36 Isaiah and the Mountains
04. 09. 2007
0 views

36 Isaiah and the Mountains

Lecture 6
05. 10. 2007
0 views

Lecture 6

Berlin Conference
26. 10. 2007
0 views

Berlin Conference

Vietnam
07. 12. 2007
0 views

Vietnam

mobisys tutorial hardware
29. 10. 2007
0 views

mobisys tutorial hardware

ne tutorial
01. 11. 2007
0 views

ne tutorial

Koranteng MFRD
04. 09. 2007
0 views

Koranteng MFRD

korea
07. 11. 2007
0 views

korea

Salma Hayek
12. 11. 2007
0 views

Salma Hayek

global stanton
22. 10. 2007
0 views

global stanton

Success risk factor
16. 11. 2007
0 views

Success risk factor

0738001
19. 11. 2007
0 views

0738001

Funny Not So
18. 08. 2007
0 views

Funny Not So

earthhistory
03. 10. 2007
0 views

earthhistory

Justice and Equality
14. 12. 2007
0 views

Justice and Equality

peda2
22. 10. 2007
0 views

peda2

Gavin cholera risk prediction
04. 09. 2007
0 views

Gavin cholera risk prediction

Oak Hill2
04. 01. 2008
0 views

Oak Hill2

informedegestion2005
22. 10. 2007
0 views

informedegestion2005

ClosingSession2005
07. 01. 2008
0 views

ClosingSession2005

vogel games
07. 11. 2007
0 views

vogel games

Obesity and Risk Factor
07. 08. 2007
0 views

Obesity and Risk Factor

Linda Simoni Wastila
07. 08. 2007
0 views

Linda Simoni Wastila

mbti students intro only team
07. 08. 2007
0 views

mbti students intro only team

Module14
07. 08. 2007
0 views

Module14

mmo 42
07. 08. 2007
0 views

mmo 42

Localization Week 2
27. 11. 2007
0 views

Localization Week 2

Clinica
28. 12. 2007
0 views

Clinica

GMV DP Wizard
21. 09. 2007
0 views

GMV DP Wizard

Milner China IndiaGEP
16. 10. 2007
0 views

Milner China IndiaGEP

Lecture 31 Power Point
28. 12. 2007
0 views

Lecture 31 Power Point

sessionr 9 12 13 1107
24. 02. 2008
0 views

sessionr 9 12 13 1107

Making Your Home Safer
26. 02. 2008
0 views

Making Your Home Safer

contextualized learning
28. 02. 2008
0 views

contextualized learning

disarmament
04. 03. 2008
0 views

disarmament

ogot
07. 08. 2007
0 views

ogot

installing globus
19. 06. 2007
0 views

installing globus

JPW Talk AHM2005 MAIN
19. 06. 2007
0 views

JPW Talk AHM2005 MAIN

jensen rockwell seesymp02
19. 06. 2007
0 views

jensen rockwell seesymp02

ipv6 workshop plzak 28oct02
19. 06. 2007
0 views

ipv6 workshop plzak 28oct02

ipfrr kvalbein
19. 06. 2007
0 views

ipfrr kvalbein

Clast Mind L M 2210
19. 06. 2007
0 views

Clast Mind L M 2210

brochure 2
19. 06. 2007
0 views

brochure 2

Brandmeldeanlage
19. 06. 2007
0 views

Brandmeldeanlage

birnecker cms
19. 06. 2007
0 views

birnecker cms

belleville ppthistory
19. 06. 2007
0 views

belleville ppthistory

belleville ppt2K
19. 06. 2007
0 views

belleville ppt2K

launchevent final fnr
15. 11. 2007
0 views

launchevent final fnr

IETF 68 P2PSIP Agenda
19. 06. 2007
0 views

IETF 68 P2PSIP Agenda

MSM Slides2005
07. 08. 2007
0 views

MSM Slides2005

lacnic V APNIC Report
19. 06. 2007
0 views

lacnic V APNIC Report

learning style roles 2007 05 15
07. 08. 2007
0 views

learning style roles 2007 05 15

icse
19. 06. 2007
0 views

icse

Global Johnny Appleseed Project
02. 11. 2007
0 views

Global Johnny Appleseed Project

belleville flash
19. 06. 2007
0 views

belleville flash

NaTang Inverse
07. 08. 2007
0 views

NaTang Inverse

Ramos Lab Tour
04. 10. 2007
0 views

Ramos Lab Tour

interims 2003b
19. 06. 2007
0 views

interims 2003b

kao tai BASS 2004
07. 08. 2007
0 views

kao tai BASS 2004

Rhew02
03. 01. 2008
0 views

Rhew02

Dale Tongue
16. 06. 2007
0 views

Dale Tongue

CS598STK Terra
16. 06. 2007
0 views

CS598STK Terra

Computer Essentials
16. 06. 2007
0 views

Computer Essentials

Coming SL 7 Overview
16. 06. 2007
0 views

Coming SL 7 Overview

CLI220 Mark Minasi
16. 06. 2007
0 views

CLI220 Mark Minasi

bpm keynote
16. 06. 2007
0 views

bpm keynote

beyond soa
16. 06. 2007
0 views

beyond soa

Intro OO Neuf
19. 06. 2007
0 views

Intro OO Neuf

PBOCFrance051219alsd1
23. 11. 2007
0 views

PBOCFrance051219alsd1

MOS384 Sel 07 test
07. 08. 2007
0 views

MOS384 Sel 07 test

NT 512B 2
07. 08. 2007
0 views

NT 512B 2

CP1 Commo
16. 06. 2007
0 views

CP1 Commo

Mark Carpenter
07. 08. 2007
0 views

Mark Carpenter

ma chap8
07. 08. 2007
0 views

ma chap8

Basic2
03. 10. 2007
0 views

Basic2

IPAM concluding workshop
19. 06. 2007
0 views

IPAM concluding workshop

michael cantor
07. 08. 2007
0 views

michael cantor

ietf 63 enum validation v01
19. 06. 2007
0 views

ietf 63 enum validation v01

KSU Speech1
07. 08. 2007
0 views

KSU Speech1

Mumulus Presentation
07. 08. 2007
0 views

Mumulus Presentation

bethel
19. 06. 2007
0 views

bethel

Mgt 201 Personality Differences
07. 08. 2007
0 views

Mgt 201 Personality Differences

ISSUES presentation
19. 06. 2007
0 views

ISSUES presentation

internet sec
19. 06. 2007
0 views

internet sec

Sathaye
24. 11. 2007
0 views

Sathaye

kline
07. 08. 2007
0 views

kline

wg2 t6
23. 10. 2007
0 views

wg2 t6

KNIT
20. 11. 2007
0 views

KNIT

Multicam2
07. 01. 2008
0 views

Multicam2

bilingual poll march2004
19. 06. 2007
0 views

bilingual poll march2004

NASC1 Lucken
07. 08. 2007
0 views

NASC1 Lucken

Corel Corporation
16. 06. 2007
0 views

Corel Corporation