Published on September 25, 2014
slide 1: Mobile Subscriber Fingerprinting: A Big Data Approach Jobin Wilson RD Flytxt Technology Pvt. Ltd. Trivandrum India jobin.wilsonflytxt.com Vikram Garg RD Flytxt Technology Pvt. Ltd Trivandrum India vikram.gargflytxt.com Abstract—Mobile advertising campaigns must use subscriber data to target subscribers relevant to the product or service being recommended. However the matching of recommendations to subscribers is most often inexact and no straightforward “attribute-value” matching algorithm suffices. Arbitrary matching degrades subscriber experience and results in lower conversion rates. One viable solution is to cluster subscribers by common attributes and then match recommendations with the same attributes to subscribers in the associated cluster. This approach presents a few significant challenges in the data architecture: the clusters are extremely large and need not have exact boundaries and their representation must facilitate real-time recommendations. We describe here a novel architecture which we call subscriber “fingerprinting”. The proposed system is capable of analyzing extremely large volumes of data in the order of terabytes using sophisticated large scale distributed Extract-Transform-Load ETL operations followed by distributed data analytics involving statistical models. The insights generated from this process can be used to serve personalized recommendations in real-time. The proposed system uses big data analytics built over a Hadoop ecosystem and leverages a private cloud infrastructure for deployment. The system also includes a simple secure and light weight integration API using REST protocols. Keywords- recommendation systems big data analytics distributed computing cloud computing content recommendations service personalization mobile advertisements I. INTRODUCTION Mobile telecommunication service providers are in a unique position to provide their subscribers quality of service in the form of accurate service personalization customized personal service recommendations and contextual advertisements. The service provider may also want to predict and influence subscriber behavior for example in churn prediction and management. This requires the operator to maintain a 360 degree view of the subscriber including his/her behavior preferences and service usage patterns which need to be mined and learned by the system automatically 12 along with ensuring strict privacy of subscribers. Rich noise- free metadata in the form of transactions logs which get generated from a subscribers interaction with a telecom network are an apt source of input for this process. Further mobile devices provide multiple modalities of real-time content delivery such as SMS MMS outbound calls and the mobile internet thus making mobile phones a promising advertisement channel. Inherently PC-based internet usage is hard to link to the subscriber identity and hence provides limited and noisy metadata information for personalization and has only limited modalities for real time advertisement delivery making it a less preferred channel over mobiles for the advertiser. 9 The subscriber fingerprinting model that we present here assists the mobile service provider in personalization and it is used in the following fashion: a Classifying subscribers and generating a holistic view of them automatically in near-real-time b Providing a secure application programming interface to target micro segments as well as for seamless integration with third-party systems c Allowing marketers the flexibility to define classification criteria and market segments that they are interested in based on their specific requirements d Providing service personalization and recommendations based on subscriber’s preferences and behavior e Ensure strict privacy for the subscribers. The work presented here is based on a live system that we have developed for one of the major telecom service provider. The presented system addresses the privacy concerns using non invertible hashing based approach to generate the insights. These insights can’t be used to deduce the raw facts that resulted in a specific classification of any subscriber. We also find it important to mention that real-world data is confidential and we have masked identities wherever necessary. The paper is organized as follows. Section II provides a brief study of the proposed solution. Section III presents a live example of our system usage. A conclusion will be presented in the Section IV. II. OUR SOLUTION We propose a system which provides a near real-time subscriber intelligence and service personalization to any Touch-point 1 System. Using continuously updating streams of various data from service provider’s network business infrastructure our system maintains a real-time unified profile for each subscriber. This consists of both the static information slide 2: regarding the subscriber as well as learned meaningful insights such as his/her socio economic profile data usage pattern personas and general preferences. The proposed system consists of the components shown in the figure 1. Figure 1 Product Architecture A. Data Stream Up loader Engine DSUE The uploader engine connects to live data streams and pulls raw data files over FTP/SFTP and stores relevant data into the distributed data warehouse. This task can be time-consuming since the file sizes could be large in the range of few hundred Gigabytes. This is a distributed engine having a pool of worker nodes over a private cloud infrastructure 410. The master identifies any free worker node at the time of a data upload and assigns the task to it. In case of any failure the task gets reassigned to the next free worker node. Apache hive 6 serves as the distributed data warehouse where we run data warehousing workloads during insight generation. B. Continuous Insight EngineCIE The CIE is the intelligent component of the system which generates meaningful insight about the subscriber in near-real time. It consists of models that continuously analyze data using a massively distributed cluster of nodes deployed over a private cloud infrastructure. The CIE is massively scalable and flexible to adapt to changing file formats and to new requirements. Models are triggered by a data-driven framework which means that models run only when a fresh data set is made available to it. An important aspect of CIE is managing job schedules and actions within a job. We use the concept of job workflows to generate subscriber insights. Workflows would have multiple actions which get executed over a Hadoop 3 cluster as one or more map-reduce jobs. Multiple actions are possible within a job workflow which is a directed acyclic graph. There are multiple actions possible each performing task like data pre- processing and validation advanced analytics and statistical modeling persistence of insights into a data store etc. Yahoo Oozie 7 is used as the workflow and scheduling engine. 1 Canonical Models These models are basically content based filtering models. The insights generated using these models are deterministic in nature and requires rule based calculations. 2 Custom Models: Context based clustering These are more complex content based filtering models. The insights generated using these models are probabilistic in nature and we use unsupervised machine learning algorithm to generate the insights. After analyzing the telecom Call Data Records CDR we introduced an unsupervised Gaussian Mixture Modeling technique as one of our custom model to figure out the natural clusters present in the telecom data. A MapReduce based multivariate GMM is designed and implemented over Hadoop in lines of Mahout Libraries 5 to address the problem of scalability. The feature vector constitutes subscribers’ call and data usage parameters such as average monthly revenue generated by that user average number of SMSs per day minutes of usages per day or night Amount of GPRS usages etc. Soft clusters thus generated allow mapping of a user into a specific market segment along with a confidence measure. This insight can be leveraged to provide personalized recommendations and campaigns. 2.1 Map-Reduce paradigm for Gaussian Mixture Model GMM is generally learned using expectation maximization algorithm 5 11. We found that the expectation and maximization steps in this process can directly be mapped onto the map and the reduce phases respectively of a Map-Reduce paradigm. Let’s assume we have a data set x 1 . . . x N consisting of N observations of a random D-dimensional variable x. The goal is to maximize the likelihood function with respect to the parameters of GMM. 1. Initialize the means μ k covariance matrixes Σ k and mixing coefficients π k and evaluate the initial value of the log likelihood. The means will act as the keys while the data points will act as the value in MapReduce’s key/value framework. 2. Expectation/Map step: Evaluate the posterior probabilities using the current parameters as defined below Ν Σ Ν Σ 1 2.1 where Ν is a Multivariate Gaussian pdf 3. Maximization/Reduce step: Re-estimate the parameters using the current posterior probabilities using the equations given below 1 Ν 1 2.2 Σ 1 Ν − 1 − 2.3 2.4 ℎ 1 4. Evaluate the log likelihood and check for convergence of either the parameters or the log likelihood. If the convergence criteria is not satisfied then return to step 2. ln | Σ ln Ν Σ 1 1 2.5 1 Touch-points: Systems with which a subscriber interacts with for e.g. the WAP portal self care portal IVR systems etc slide 3: Figure 2 An example GMM over Monthly Revenue from subscribes C. Tag Store A tag is a concise piece of information consisting of attributes such as name value timestamp confidence measure associated with it. A subscriber’s fingerprint consists of such tags. Tag store is a distributed NoSQL columnar database running on a cloud. The tag store is decentralized and extremely scalable. HBase 12 serves as the foundation for the tag store. It allows a low latency scalable model of data access along with versioning. Since the number of attributes differs between subscribers a relational model of data organization would not be scalable and suited for this problem. Consistency and availability is ensured along with failover mechanisms. Hadoop DRBD 8 is leveraged to counter the issue of name node being a single point of failure. Data replication is handled at the Hadoop level. D. Tag Serving Cluster The tag serving cluster is an array of web servers behind a load balancer software or hardware which provides secure access our system API to external touch point systems. The incoming request for tags and recommendations could come as an HTTP REST call or a SOAP call. E. Policy Manager The Tag Policy Manager authenticates and authorizes all incoming requests. III. A LIVE EXAMPLE System provides a short code based number where the user can call and ask for best p recharge offers for him/her considering his network usage trends out of k recharge plans 2 the system is providing at that time. This way a busy user wouldn’t need to surf through all the recharge plans and make a decision. We used live data files generated by a 50 million subscriber base. It was observed that nearly 4 million subscribers recharge every day. We process the recharge-CDR files to get the recharge values of these subscribers. Using Gaussian Mixture based clustering we find the best k recharge options for each subscriber as described. 1 We find the k-mode GMM and map users into each such segment with a confidence measure. 2. Confidence measure is assessed based on the probability of that subscriber being a member of a specific recharge segment. 3. These confidence measures are used to order the best p k recharge options to be given as a personalized recommendation. Figure 2 demonstrates one of our experimental results in applying a GMM model to recommend the best recharge plan when the provider was offering 3 plans only. Where p 1 and k 3 IV. CONCLUSIONS In this paper we demonstrated our subscriber fingerprinting model - a mobile service personalization and recommendation system built on a distributed framework. This distributed shared-nothing architecture scales to the large volumes of subscriber data in order of tens of millions and is capable of delivering low latency real-time recommendations. The underlying cloud infrastructure makes the platform elastic and future-proof to accommodate workloads of varying complexity. We also show the utility of statistical models such as Gaussian Mixtures Models for recommendation system. We also provide empirical results using this model on real-world data to demonstrate an improved matching of recommendations to subscribers. REFERENCES 1 Ho and Ho "The Attraction of Personalized Service for Users in Mobile Commerce: An Empirical Study" ACM sigecom Exchanges Vol. 3 No. 4 January 2003 Pages 10-18. 2 Kurkovsky and Harihar "Using ubiquitous computing in interactive mobile marketing" Personal and Ubiquitous Computing Vol. 10 No. 4. 1 May 2006 pp. 227-240. 3 Dean and Ghemawat “Mapreduce: simplified data processing on large clusters” Opearting Systems Design Implementation2004 pp.10–10. 4 Ananthanarayanan et.al. "Cloud analytics: Do we really need to reinvent the storage stack" Workshop on Hot Topics in Cloud Computing 2009. 5 Xu and Jordan “On convergence properties of the em algorithm for gaussian mixtures” Neural Computation 8:129–151 1996 6 Thusoo et. al. "Hive a warehousing solution over a Map-Reduce framework" VLDB2009. 7 Nguyen and Halem "A MapReduce workflow system for architecting scientific data intensive applications" workshop on Software engineering for cloud computing 2011 8 Philipp R. “DRBD” UNIX en High Availability2001 Ede. 93 - 104. 9 Ducoffe R. “Advertising value and advertising on the Web. Journal of Advertising Research" 1996Page 21–35. 10 VOGELS “Head in the Clouds- The Power of Infrastructure as a Service” workshop on Cloud Computing and in Applications 2008 11 Bishop C. M. “Pattern Recognition and Machine Learning” Springer 2006 12 A. Khetrapal and V. Ganesh “HBase and Hypertable for large scale distributed storage systems” Dept. of Computer Science Purdue University 2008 2 Recharge Plan: A prepaid scheme provided by service providers to allow subscribers to pay its products and services.