Published on January 2, 2018
1. Stevens Institute of Technology School of Business Business Intelligence & Analytics Program A Snapshot of Data Science Student Poster Presentations Corporate Networking Event – November 28, 2017
2. This document reproduces the posters presented by students of the Business Intelligence & Analytics (BI&A) program at a Corporate Networking event held at Stevens Institute on November 28, 2017. The event was attended by over 80 company representatives and approximately 150 students and faculty members. The posters were presented by students at all stages in their academic programs from their first semester through their final semester. The research described in each poster was conducted under the guidance of a faculty member. The broad range of research topics and methodologies exhibited by the posters in this document reflects the diversity of faculty research interests and the practical nature of our program. For background, the first poster describes the BI&A program. Founded in spring 2012, with just 4 students, the program now has over 220 full-time and part-time masters of science students, 80 graduate certificate students and is ranked 7th in the nation by The Financial Engineer. As illustrated in the first poster, a distinctive feature of the program is its three-layer structure. In the professional skills layer, business and communication skills are developed through workshops, talks by industry leaders and an active student club. In the second layer, the 12-course curriculum covers the concepts and tools associated with database management, data warehousing, data and text mining, web mining, social network analytics, optimization and risk analytics. The curriculum culminates in a capstone course in which students work on a research project – often in conjunction with industry associates. Finally, in the technical skills layer, students attend a series of free weekend boot weekend camps that provide training in industry-standard software packages, such as SQL, R, SAS, Python and Hadoop. The 76 student posters in this document represent a broad array of research projects. We are proud of the quality and innovativeness of our students’ research and of their hard work and enthusiasm without which this event would have been impossible. Chris Asakiewicz, Ted Stohr and Alkis Vazacopoulos Business Intelligence & Analytics Program Stevens Institute of Technology www.stevens.edu/business/bia Forward
3. INDEX TO POSTERS * Indicates the poster was accompanied by a live demo No. Title Student Authors 0 BI&A Curriculum The faculty 1* Google Online Marketing Challenge 2017 – True Mentors AdWords Campaign Philippe Donaus, Rush Kirubi, Salvi Srivastava, Thushara Elizabeth Tom, Archana Vasanthan 2 Multivariate Testing to Improve a Non-Profit’s Home Page Rush Kirubi, Thushara Elizabeth Tom 3 Analyzing the Impact of Earthquakes Fulin Jiang, Ruhui Luan, Zhen Ma, Gordon Oxley 4* Real Time Health Monitoring System Ankit Kumar, Khushali Dave, Shruti Agarwal, Nirmala Seshadri 5 Zillow Home Value Prediction Yilin Wu, Zhihao Chen, Jiaxing Li, Zhenzhen Liu, Ziyun Song. 6 Analysis of Opioid Orescriptions and Drug- related Overdoses Nishant Bhushan, Sunoj Karunanithi, Pranjal Gandhi, Raunaq Thind 7 UK Traffic Flow & Accidents Analysis Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian Xu, Jiahui Tang 8 US Permanent Visa Application Process Jing Li, Qidi Ying, Runtian Song, Jianjie Gao,Chang Lu 9 Climate Change Since 1770 Yilin Wu, Ziyun Song, Zhihao Chen, Jiaxing Li, Zhenzhen Liu 10 Predicting Customer Conversion Rate for an Insurance Company Yalan Wang, Cong Shen, Junyuan Zheng, Yang Yang 11 Determine Attractive Laptop Features Among College Students Liwei Cao, Gordon Oxley, Salman Sigari, Haoyue Yu 12 Clustering Large Cap Stocks During Different Phases Of Economic Cycle Nikhil Lohiya, Raj Mehta 13 Predicting interest level in apartment listings on Renthop Cristina Eng, Ying Liu, Haoyue Yu, Salman Sigari, 14 Deep Learning Vs Traditional Machine Learning Performance on NLP Problem Abhinav S Panwar 15 Wine Recognition Biying Feng, Ting Lei, Jin Xing 16 Projects Assignment Optimization Based on Human Resource Analysis Jiarui Li, Siyan Zhang, Siyuan Dang, Xin Lin, 17* Short Term Load Forecasting Using Artificial Neural Networks Bhargav Kulkarni, Ephraim Schoenbrun 18 Optimizing Travel Routes Ruoqi Wang, Yicong Ma, Jinjin Wang, Shuo Jin 19 Exploring and Predicting Airbnb Price in NYC Ruoqi Wang, Yicong Ma, JIahui Bi, Xin Chen 20* The Public’s Opinion of Obamacare: Twitter Data Analyses Saeed Vasebi 21 S&A Processed Seafood Company Redho Putra, Neha Bakhre, Harsh Bhotika, Ankit Dargad, Dinesh Muchandi 22 Portfolio Optimization of Cryptocurrencies Yue Jiang, Ruhui Luan, Jian Xu, Shaoxiu Zhang, Tianwei Zhang 23 Fantasy Premier League Soccer Team Optimization Haoran Du, Xiang Yang, Ruiwen Shi
4. 24 Predicting Movie Rating and Box Office Gross by PCA and LR Model Yunfeng Liu, Erdong Xia, Yash Naik 25 Student Performance Prediction Abineshkumar, Sai Tallapally, Vikit Shah 26 Classifying Iris Flowers Xi Chen, Shan Gao, Lan Zhang 27 Using Data Analytics to Retain Human Resources Aditya Pendyala, Rhutvij Savant 28 Using Financial Models to Construct Efficient Investment Portfolios Cheng Yu, Tsen-Hung Wu, Ping-Lun Yeh 29 Pima Indians Diabetes Analysis Junjun Zhu, Jiale Qin, Yi Zhang 30 Google Online Marketing Challenge: Aether Game Café Jaya Jayakumar, Sunoj Karunanithi, Saketh Patibandla, Ephraim Schoenbrun 31 Experiment for Apartment Rental Advertisements Shavari Joshi, Nirmala Seshadri, Nishant Bhushan 32 A Tool For Discovering High Quality Yelp Reviews Zijing Huang, Po-Hsun Chen, Hao-Wei Chen, Chao Shu 33* Worlds Best Fitness Assistant Cognitive Computing Anand Rai, Jaya Prasad Jayakumar, Saketh Patibandla 34 Zillow’s Home Value Prediction Wenzhuo Lei, Chang Xu, Juncheng Lu 35 Web Traffic Time Series Forecasting Jujun Huang, Peimin Liu, Luyao Lin 36* Portfolio Optimization with Machine Learning Chao Cui, Huili Si, Luotian Yin, Qidi Ying, Yinchen Jiang 37 Developing a Supply Chain Methodology for an Innovative Product Akshay Sanjay Mulay 38 Bike Sharing Optimization Bai, Jiahui Nai, Yuankun Wang, Yuyan Zhou, Yanru 39 Crime in the U.S Minyan Shao, Yuyan Wang, Yuankun Lin 39A Porto Seguro Safe Driver Prediction Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian Xu, Jiahui Tang 40 Identifying Mushroom: Safe to eat or deadly poison Xiang Yang, Haoran Du, Shikuang Gao, Ruiwen Shi 41* Student Grade Prediction Gaurav Sawant, Vipul Gajbhiye, Vikram Singh 42 NBA 2018 Play-offs & Champion Prediction Amit Kumar, Jayesh Mehta , Xiaohai Su 43 AI integrated interactive search interface for Biomedical literature search Akshay Kumar Vikram, Vishnu Pillai, Satya Peravali 44* Text Mining of intellectual contribution information from a corpus of CVs Nishant Bhushan, Neha Mansinghka , Nirmala Seshadri, Arpit Sharma 45 Analysis on Chase Bank Deposits Lulu Zhu, Xinlian Huang, Junxin Xia, Rauhl Nair 46 Predicting Songs Hit on Billboard Chart Siyan Zhang, Yifeng Liu, Yuejie Li
5. 47 Predicting Results of Premier League Contest Hantao Ren, Lanyu Yu, Siyuan Dang, Jiarui Li 48 NLP Meets Yelp Recommendation System for Restaurant Rui Song 49 WSDM — KKBox’s Churn Prediction Challenge Caitlyn Garger, Yina Dong, Shuo Jin 50 Data Mining of Video Game Sale Xin Lin, Fanshu Li, Jingmiao Shen 51 Dengu AI: Predicting Disease Spread Vicky Rana, Pradeepkumar Prabakaran 52 Predict Tesla Model 3 Production Volume Wangming Situ, Liwei Cao, Bohong Chen, Tianyu Hao 53 Vehicle Routing Problem using NYC TLC Data Adrash Alok, Garvita Malhan, Ephraim Schoenbrun, Abhir Yadava 54 Credit Rating for a Lending Club Rui Song, Huili Si, Xiao Wan, Lulu Hu 55 Routify: Personalized Trip Planning Minzhe Huang, Bowan Lu, Jingmiao Shen, Xiaohai Su, Abhitej Kodali 56 Uncover World Happiness Patterns Rui Song, Xiao Wan, Xiaoyu Zhang 57* Data Centers – Where to Locate? Smriti Vimal, Sanjay Pattanayak, Kumar Bipulesh, Nitin Gullah, Souravi Sudamme 58 Drone Optimization in Delivery Industry Ni Man, Xinlian Huang, Xuanyan Li 59 Performance Evaluation of Machine Learning Algorithms on Big Data using Spark Neha Mansinghka, Madhuri Koti, Prathamesh Parchure 60* Duck Wisdom A Personal Portfolio Optimization ToolTed Stohr_Nov 20 2017 Taranpreet Singh, Shivakumar Barathi, Ramona Lasrado, Nikhil Lohiya 61 Porto Seguro’s Safe Driver Prediction Boren Lu, Lanshi Li, Xiaoming Guo, Dingmeng Duan 62 Hospital Recommendation System for Patients Abdullah Khanfor, Danilo Brandao and Pedro Sa 63 Customer Segmentation for B2B Sale of Fitness Gear Juhi Gurbani, Arpit Sharma, Neha Mansinghka 64 Predicting Vehicle Collisions & Dynamic Assignment of Ambulances in NYC Divya Rathore, Dhaval Sawlani, Nitasha Sharma, Shruti Tripathi 65 Iceberg Classifier Challenge Chang Lu, Jing Li, Luotian Yin, Runtian Song, 66 Predicting Movie Success Jialiang Liu, Huaqing Xie, Xiaohai Su, Liang Ma, Lanjun Hui 67 Stock Prediction Based on News Titles Jianuo Xu, Minghao Guo, Simin Liang, Yudong Cao, Yunzhe Xu 68 How Consumer Reviews Affect a Star’s Ratings Jinjin Li, Prabhjot Singh, Yutian Zhou, Xuetong Wei, Xiaoyu Zhang
6. 69 Student Alcohol Consumption: Predicting Final Grades Ping-Lun Yeh, Zhuohui Jiang, Gaurang Pati 70 Mobile Banking Fraud Detection Junyuan Zheng, Ke Cao, Miaochao Wang, Tuo Han 71 Subway Delay Dilemma Smit Mehta, Nishita Gupta, Matthew Miller, Jianfeng Shi 72 Integrated Digital Marketing Studies on Hoboken Local Restaurant Shuting Zhang, Yalan Wang, Haoyue Yu, Liyu Ma, Christina Eng 73* AI Academic Advisor Vaibhav Desai, Piyush Bhattad 74 JFK Airport – Flight Delay Analysis Praveen Thinagarajan, Arun Krishnamurthy, Thushara Elizabeth Tom, Sunoj Karunanithi 75 Machine Learning on Highly Imbalanced Manufacturing Data Set Liyu Ma 76* Duck Finder Salman Sigari, Shankar Raju and Team
7. Master of Science Business Intelligence & Analytics Business Intelligence & Analytics http://www.stevens.edu/bia CURRICULUM Organizational Background • Financial Decision Making Data Management • Strategic Data Management • Data Warehousing & Business Intelligence Data and Information Quality * Optimization and Risk Analysis • Optimization & Process Analytics Risk Management Methods & Apps.* Data Mining • Knowledge Discovery in Databases Statistical Learning & Analytics* Statistics • Multivariate Data Analytics • Experimental Design Social Network Analytics • Network Analytics • Web Mining Management Applications • Marketing Analytics* • Supply Chain Analytics* Big Data Technologies • Data Stream Analytics* • Big Data Seminar* • Cognitive Computing* Practicum Projects with industry * Electives - Choose 2 out of 8 Social Skills Disciplinary Knowledge Technical Skills • Written & Oral Skills Workshops • Team Skills • Job Skills Workshops • Industry speakers • Industry-mentored projects • SQL, SAS, R, Python Hadoop • Software “Boot” Camps • Course Projects • Industry Projects Curriculum Practicum MOOCs Infrastructure Laboratory Facilities • Hadoop, SAS, DB2, Cloudera • Trading Platforms: Bloomberg • Data Sets: Thomson-Reuters, Custom PROGRAM ARCHITECTURE Demographics 2013F 2014F 2015F 2016F 2017F Applications 101 157 351 591 725 Accepted 48 84 124 287 364 Rejected 34 34 186 257 307 In system/other 19 39 41 46 53 Admissions Full-time/Part-time Full-time 201 Part-time 21 Gender Female 41% Male 59% Placement Starting Salaries (without signing bonus): $65 - 140K Range $84K Average $90K (finance and consulting) Data Scientists 23%: Data Analysts: 30% Business Analysts: 47% Our students have accepted jobs at for example: Apple, Bank of America, Blackrock, Cable Vision, Dun & Bradstreet, Ernst & Young, Genesis Research, Jeffreys, Leapset, PriceWaterhouse, Morgan Stanley, New York Times, Nomura, PriceWaterhouse Coopers, RunAds, TIAA- CREF, Verizon Wireless Hanlon Lab -- Hadoop for Professionals The Masters of Science in Business Intelligence and Analytics (BI&A) is a 36-credit STEM program designed for individuals who are interested in applying analytical techniques to derive insights and predictive intelligence from vast quantities of data. The first of its kind in the tri-state area, the program has grown rapidly. We now have approximately 222 master of science students and another 79 students taking 4-course graduate certificates. The program has increased rapidly in quality as well as size. The average test scores of our student body is top 75 percentile. We are ranked #7 among business analytics programs in the U.S. by The Financial Engineer. STATISTICS PROGRAM PHILOSOPHY/OBJECTIVES • Develop a nurturing culture • Race with the MOOCs • Develop innovative pedagogy • Migrate learning upstream in the learning value chain • Continuously improve the curriculum • Use analytics competitions • Improve placement • Partner with industry
8. Google Online Marketing Challenge 2017 True Mentors AdWords Campaign Team: Philippe Donaus, Rush Kirubi, Salvi Srivastava, Thushara Elizabeth Tom, Archana Vasanthan Instructor: Theano Lianidou Business Intelligence & Analytics November 28, 2017 1 Motivation • Google Online Marketing Challenge is an unique opportunity for students to build online Marketing Campaigns on Google AdWords for a business or a non-profit. $250 budget is provided by Google to run these campaigns live for 3 weeks • We worked with TRUE Mentors, a non-profit based in Hoboken,NJ. Built a marketing strategy on Google AdWords to achieve goals like Creating Brand Awareness, promoting Fundraising Events and Volunteer opportunities and Donations. • Technologies Used: Google AdWords, Google Analytics, Google Search Console, Facebook Insights Design of Campaigns • Conducted Market Analysis-Competitors, current market position and platforms used, USP • Analyzed existing data available on Google Analytics, Google Search console, Facebook Insights and established marketing goals • Designed campaigns on Search and Display Ads Performance Results • 23 Ad groups with 206 Ads with 700 keywords were used in total • Text Ads appear for people’s search terms across Google Search and Search partner sites. • Display Ads appear on relevant pages across the display network • The team finished as a “Finalist” in the Social Impact Award category. • The team finished as a “Semi-Finalist” in the Business Award category. • Ranked among the Top-10 teams in the Social Impact Award category. • Ranked among the Top-15 teams in the Business Award Category. • Ranked among the Top-5 teams in the Americas region. • The results can be found at: https://www.google.com/onlinechallenge/past/winners-2017.html • Team ID: 234-571-4266 Targeting and Bidding Target Goals (set before running the campaigns) End Results • Campaigns ran from 24th April 2017-14th May 2017 • KPI’s were monitored and optimized continuously over the 3 weeks using insights drawn from various AdWords Reports, Search Term reports, Google Analytics and Keyword reviews. CAMPAIGN LEVEL Targeting/Bidding TM_Brand TM_Events TM_Donations TM_Volunteers TM_DisplayCampaign Location Hudson County-NJ, New York County-NY Hudson County-NJ Hudson County-NJ Hudson County-NJ Hudson County-NJ, New York County-NY Bidding Strategy Manual CPC Manual CPC Manual CPC Manual CPC Manual CPC Daily Budget? Yes Yes Yes Yes Yes ADGROUP LEVEL Max CPC? Set for all adgroups Set for all adgroups Set for all adgroups Set for all adgroups Set for all adgroups Demographic No No No Yes - Male and Female were targetted seperately No KEYWORD LEVEL Max CPC? Yes Yes Yes Yes No Topics No No No No Yes - Charity & Philanthropy and Fast Foods
9. Improving a Non-Profit’s Home Page Team: Rush Kirubi, Thushara Elizabeth Tom Instructor: Chihoon Lee Business Intelligence & Analytics November 23, 2017 2 Experiment Design Methodology: Full factorial design with blocking. Factors & Levels: Responses: Drop Off rate Conclusion • The best setting is Purple Donate Button and No Slider relative to the other settings. • However the above is not substantial at 5% level of significance. • Since none of the factors are significant we opted to select those settings that minimize the page loading speed : No Slider, Testimonial with Text, Purple Donate Button. Data Time (blocking factor) was confounded with the 3-factor interaction, ABC. And thus assumed that this 3-factor interaction is negligible. Result • It was found that no factor significantly stood out. • The results of the experiment are shown below: ● Effect Test ● Normal Plot Motivation • Goal: Optimize True Mentor’s homepage to reduce the drop-off rate. • In turn, improving the quality score of the AdWords bids, leading to more ad exposures at the same or lesser expense. Limitations • We did not get enough time for replication due to the competition deadlines. • Blocking variable was difficult to accommodate. We manually took down the values at certain times of the day and night. Participating in Google's Online Marketing Challenge, we selected a nonprofit to run a digital marketing campaign. Part of our efforts involved optimizing the organization's home page to boost user engagement as measured by drop-off rates. We set up a full-factorial experiment (2 ^ 3) with time of day as the block variable. Put simply, we tested the donate button color, the presence of a slider and the type of testimonial. The type of testimonial was between one that was predominantly text and the other simply a photo with a caption. All the three factors were blocked on the time of day (daytime or nighttime). Empirically, the best setting was having no slider with a purple donate button. However, they were not robust enough to pass a statistical inference test. With the aforementioned settings, we additionally selected the no-slider option, as it reduces page speed, and its presence does not impact user-drop off rates.
10. Analyzing the Impact of Earthquakes Team: Fulin Jiang, Ruhui Luan, Zhen Ma, Gordon Oxley Instructor: Prof. Alkis Vazacopoulos Business Intelligence & Analytics Motivation • Earthquakes are one of the most destructive natural forces in the world that ravish entire cities with little notice • We wanted to analyze and visualize patterns in how earthquakes strike and damage specific locations historically in order to provide information on high risk areas and under prepared areas of the world • Earthquake features that were used include magnitude, source, focal depth, date, and type (nuclear or tectonic activity) • We also used an earthquake’s damaging effects using damage in US dollars, deaths, amount of houses damaged and destroyed, and injuries Financial Cost of EarthquakesCasualties due to Earthquakes Technology Utilized •Tableau was used for our analysis of earthquakes •We found tableau especially useful when visualizing the latitude and longitude data clearly identifying trends in the way earthquakes affect certain parts of the world •With the creation of 8 dashboards, we were able to analyze and visualize many different features of earthquakes including depth, source, and more Earthquake Map of the World • Based on the analysis of the number of deaths due to earthquakes, it is clear that a majority of high casualty events happen to coastal regions and many in the Indian subcontinent • We see a peak in deaths in 2010 due to the unfortunate number of casualties in Haiti underlining the fact that certain underdeveloped regions will suffer from increased casualties • We observe the same trend in high cost areas although the largest cost even occurred in Japan where the tsunami caused massive damages in 2011 Source: National Earthquake Information Center 3
11. Real Time Health Monitoring System Team: Ankit Kumar, Khushali Dave, Shruti Agarwal, Nirmala Seshadri Instructor: Prof. David Belanger m Architectural Approach The architecture flow will consist of following steps: 1. Data will be generated and stored in a file 2. Data will be stream lined using apache Kafka 3. A real-time data visualization will be setup using any of the visualization tool Tools used 1. JSON file parsing for initial data analysis 2. Apache Kafka for real time streaming 3. Arduino programming for pulling temperature data in real time 4. Python for data cleaning 5. Tableau for visualization Variables 1. Heart rate - Through Smart watch sensor 2. Steps Count – Through Smart watch 3. Body Temperature – Through Arduino Trigger Cases 1. Fever – High temp, low heart rate and steps 2. Long term unconsciousness – low heart rate , body temperature and steps count 3. Heart attack - high heart rate in a very short time interval Problem Statement To create a real-time health monitoring system which is user specific using sensors from a smart watch and/or an Arduino device. This application should be able to monitor health features like heart rate, steps count and body temperature in real time and should be able to warn the user or any emergency services of any undesired or serious condition. Results The last part of the project is to be able to visualize results in real time and to be able to plot streams of data and show a trigger result if any abnormal use case occurs. Business Intelligence & Analytics 4 Business Impact Our health data analysis application can pull data in real-time from device sensors. Using this system authorities, friends and relatives can easily monitor health of a loved one and can respond immediately in case of an emergency. From a business perspective this can attract customers suffering from a medical condition.
12. Zillow’s Home Value Prediction Team: Yilin Wu, Zhihao Chen, Jiaxing Li, Zhenzhen Liu, Ziyun Song Instructor: Prof. Alkiviadis Vazacopoulos Business Intelligence & Analytics 5 Motivation •Zillow Prize is challenging the data science community to help push the accuracy of the Zestimate even further(improving the median margin of error). •What we should do in this competition is to develop an algorithm that makes predictions about the the logerror for the months in Fall 2017 Technology •Python, Watson, Tableau for exploratory data analysis(EDA). •Python for data preprocessing and feature engineering. •Python to build the model. Competition Process Feature engineering Feature engineering part is the most important part of this competition. It is very important to find out feature importance to keep valuable features and drop useless ones. We also created new features that might make machine learning algorithms work better. EDA Data preprocessing Feature engineering Use 2016 data to train model Test predicted logerror of 2016 data Improve feature engineering part Figure out the best model and adjust the parameters Use both 2016 and 2017 data to train model Several improvements Final submission Modeling Somehow we found a magnificent gradient boost machine to deal with this problem(compared with Xgboost and LightGBM). It is called Catboost. Catboost builts oblivious decision trees and to prevent overfitting includes algorithm, which in some magical way reduces bias of the residuals. In addition it uses another scheme for calculating values in leaves. Also supports several options for converting categorical features based on statistics counting. In general Catboost is declared as an algorithm that can work with categorical features without preprocessing, is resistant to overfitting, and it is supposed that it can be used without wasting time and effort to hyperparameters selection. And of course most interesting that it can be more accurate. But learning is still very slow. On average, 7-8 times longer than LightGBM, 2-3 times longer than XGBoost. Before we train the model, it is necessary to define categorical features. In this case, we have 26 categorical features that can be trained in “catpool” Then we adjusted the parameters to get a better(not the best) result. Conclusion The final submission made a top 11% rank. It is pity that we are so close to the bronze medal which is top 10%. What we can do in the future is to play with the categorical features and keep adjusting the model.
13. Analysis of Opioid Prescriptions and Deaths Team: Pranjal Gandhi, Nishant Bhushan, Sunoj Karunanithi, Raunaq Thind Instructor: Alkis Vazacopoulos Business Intelligence & Analytics 6 The objective is to find the correlation between the prescription of drugs containing opioids and drug related deaths in the USA. What are opioids? Opioids are a class of drugs that include the illicit drug heroin as well as the licit prescription pain relievers oxycodone, hydrocodone, codeine, morphine, fentanyl and others. • Tableau for Visualizations. • R & Excel for cleaning the data and exploratory analysis. • The dataset is a subset of data sourced from cms.gov and contains prescription summaries of 250 common opioid and non-opioid drugs written by the medical professionals in 2014. • Number of deaths due to drug overdoses is leading the deaths occurring due to car accidents by a staggering 11,102 as per a report by the DEA. • In 2014, there were 4.3 Million people aged 12 years or older that were using Opioid based painkillers without prescriptions. • This led to substance abuse among almost 50% of the consumers. • 94% of respondents in a 2014 survey of people in treatment for opioid addiction said they chose to use heroin because prescription opioids were “far more expensive and harder to obtain.” ffmfnfn Analysis & Visualizations Results and Conclusion Facts & Figures Objective & Motivation Tools Used • We found the opioid prescriptions were too high for the prescribers following specialties: • Female Nurse Practitioners • Female Physician Assistants • Female and Male Family Practices • Female and Male Internal Medicines • Male Dentists The top 5 states with most %age of deaths due to overdoses are California, Ohio, Philadelphia, Florida and Texas. All of them had significantly high prescriptions of Hydrocodone- Acetaminophen followed by Oxycodone-Acetaminophen. Results and Conclusion
14. UK Traffic Flow & Accident Analysis Team: Xiaoming Guo, Weiyi Chen, Dingmeng Duan, Jian Xu, Jiahui Tang Instructor: Alkis Vazacopoulos Business Intelligence & Analytics Technology • Python for integrating data for analysis. •Tableau for data visualization and extracting data insights. Current & Future Work •Generating different plots from data and discovering relationships between variables. •Plan to find relationships between traffic flow and accidents. Motivation • Visualization of a Dataset of 1.6 million accidents and 16 years of traffic flow. 7
15. US Permanent Visa Application Visualization Team: Jing Li, Qidi Ying, Runtian Song, Jianjie Gao,Chang Lu Instructor: Alkiviadis Vazacopoulos Introduction Develop a descriptive analysis based on US permanent application data from 2012 to 2017 in Tableau and provide insights into visa decisions. Data explanation: 374363 applicants from 203 countries in 22 occupations. Descriptive Analysis Employer & Economic Sector Conclusion Business Intelligence & Analytics April 30th, 2015 8 Applications by State Applications by Country Top 10 companies that submit permanent visa applications. Education & Occupation Certified rate and denied rate among all education degrees We can observe that a high-school degree has significant denial rate in the graph. Doctorate’s degree has the lowest denial rate. Applicants with master's and bachelor’s are mostly working in Computer and Mathematical fields. Occupations which have highest certification rates. While High School applicants most working in Production Occupations that have certification rates lower than 5%. Our team used different attributes to analyze the relationships between visa applications and certification rates directly and indirectly. Application decisions are correlative with many vectors, such as education level, income, occupation and so on. In conclusion, applicants with higher education (Bachelor’s, Master’s, Doctorate’s) are mostly working in Computer and Mathematic areas, which has higher income is more likely to be certified. Applicants come from the countries which dominate the occupation have higher certification rate rate when go for in the related jobs. we found out the certification rate increased as time varied from 2012 to 2017. Applicants from different occupations have different nationality. Take the computer science and construction occupation distributed maps for example, we could see that certified applicants in computer domain mainly come from Indian and China, while construction occupations mainly come from Mexico. Nationality & Occupation
16. Business Intelligence & Analytics 9 • Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. • Personally speaking, we feel climate change problem is much more severe in these years. Global warming takes responsibility for the disastrous hurricane recently to some extent. • So we are going to do some descriptive data visualization to see how climate changes since 1970 and make some analysis. • Excel for data cleaning and data filtering. • Tableau to do the data visualizations and create interactive graph. • Tableau and Watson to build some regression models and conduct the analysis. •Illustrate the world’s climate change trend starting from 18th century in the line chart. •Specify the trend of climate change for each country and its average temperature in the whole period. •Extract data from original data file to show how much each country’s temperature has increased and compare this percentage change with one another. •Customize the period to show trends of climate change in a certain period of time. •Try to get more data sources to dig in deeply in order to find the factors leading to the climate change. Climate Change Since 1770 Current & Future Work Team: Yilin Wu, Ziyun Song, Zhihao Chen, Jiaxing Li, Zhenzhen Liu Instructor: Alkis Vazacopoulos Motivation Technology Is worldwide mercury really rising? Insights into customized periods Explore average temperature by country Recent 100 years Climate Change
17. Predicting Customer Conversion Rate for an Insurance Company Team: Yalan Wang, Cong Shen, Juanyuan Zheng, Yang Yang Instructors: Alkis Vazacopoulos and Feng Mai Business Intelligence & Analytics 10 Motivation •Use the dataset which contained contact information to predict the customer who would like to purchase the insurance •Help insurance company understand the characteristics of their customers in making the purchasing decisions of their insurance Technology •Used Python to analyze imbalanced data on customers from insurance company •Applied Synthetic Minority Over-sampling Technique (SMOTE) algorithm to balance the data •Built predictive models (Logistic Regression, Random Forest and XGBoost) to predict conversion rate Data Summary Learning Model •Logistic Regression: This was chosen because it is known to serve as a benchmark with which other algorithms are compared. •Random Forests Classifier: Random Forests Classifier is a type of decision tree algorithm. •XGBoost: This model is short for “Extreme Gradient Boosting” The tree ensemble model: which is a set of classification and regression trees (CART) Tree Ensemble: which sums the prediction of multiple trees together. Raw Data • Dataset shape: 1,892,888 records and 50 variables in the dataset • Features Type: 5 columns are int64, 12 columns are float64, and 33 are object • Missing Value: 42 columns contains NA values Clean Data • Convert data format to train the model • According to correlation matrix to eliminate features highly correlated but irrelevant to target label • Applied SMOTE Algorithm to balance the dataset Imbalance data: A dataset is imbalanced if the classes are not approximately equally represented Correlation Matrix 0: Contacting without purchase 1: Contacting with Purchase Processed Dataset(1885774,135) Raw Dataset(1892888,50)Consider a sample (6,4)and let (4,3)be its nearest neighbor. (6,4)is the sample for which k-nearest neighbors are being identified. (4,3)is one of its k-nearest neighbors. Let: f1_1 = 6 f2_1 = 4 f2_1 - f1_1 = -2 f1_2 = 4 f2_2 = 3 f2_2 - f1_2 = -1 The new samples will be generated as (f1’,f2’)= (6,4)+ rand(0-1)* (-2,-1) rand(0-1)generates a random number between 0 and 1. Processed Data Split data Testing Set 25% Training set 75% Model Fitted model Predictions Results Conclusion & Future Work Model Accuracy LGR 83.6% RFR 94.6% XGB 79.8% Feature Importance From the Random Forest, we get the top 50 features which paly significant role in our model. 'RQ_Flag', 'Original_Channel_Broker', 'First_Contact_Date_month', 'First_Contact_Time_Hour', 'PDL_Special_Coverage', 'RQ_Date_month', 'Inception- First_Contact', 'Original_Channel_Internet', 'PPA_Coverage', 'Inception_Date_month', 'Mileage', 'Region_(03)関東', 'Original_Channel_Phone', 'License_Color_(02) Blue', 'Previous_Insurer_Category_(02) • Comparing the accuracy of the three models, we choose the Random Forest as our final model. • According to the feature importance, digging the business insights from these features and giving the suggestions on what characteristics of insurance’s customers in making the purchasing decisions of their insurance After training model, we get the results of the three models: Logistic Regression, Random Forest and XGBoost
18. Determining Attractive Laptop Features for College Students Team: Liwei Cao, Gordon Oxley, Haoyue Yu, Salman Sigari Instructor: Chihoon Lee Business Intelligence & Analytics November 28, 2017 11 Experiment Design Stage 1: Plackett-Burman Design Objective: Identify the most important factors early in the experimentation Factors & Levels: Stage 2: Fractional Factorial Design Objective: study the effects and interactions that several factors have on the response. Factors & Levels: Blocks: Conclusion • Price and operating system play really important roles when it comes to laptop purchase • To maximize Probability of Purchasing, price at plus level (<750) and operating system at minus level (Windows) would be chosen. Our maximum predicted probability of purchasing would be Probability of Purchasing = 51.77 + (11.36/2)*(1) - (7.23/2)*(1)*(-1) = 61.065 Data Collection We handed out slips to Stevens students randomly and recorded their responses Stage 1 (32 observations): Stage 2 (64 observations): ● Effect Test ● Pareto Plot ● Normal Plot Motivation • Laptops have become a staple in our lives as we use them for work, entertainment, and other daily activities • From a marketing perspective, it is critical to find the factors that interest consumers in order to produce and sell a laptop that will be successful . • Survey conducted by Pearson says 66% of undergraduates use their laptop every day in college • We wanted to find what drives laptop demand among college students Result Stage 1 Stage 2 Probability of Purchasing = 51.77 + (11.36/2)*Price - (7.23/2)*Price*Operating system =51.77 + (5.68)*Price - (3.615)*Price*Operating system Response: Probability of Purchasing (0 – 100%)
19. Clustering Large Cap Stocks During Different Phases of the Economic Cycle Students: Nikhil Lohiya, Raj Mehta Instructor: Amir H. Gandomi Results Clustering of Stocks during Recovery phase Clustering of Stocks during Recession phase K means plot shows that the stocks are clustered with similarities by their Sharpe ratio, volatility, and average return. There are 9 graphs in total, and 2 of them are displayed above for the expansion and recession phase. The x-axis shows the ticker/symbol of snp500 and the Y- axis shows the Cluster number. If we hover on the dot on the graph, it shows the ticker along with its cluster number, and the variable used for clustering. We used Silhouette and visually inspected the data points to find the optimal value of k, which turns out to be 22. Introduction OBJECTIVE We tried to provide a set of securities that behave similarly during a particular phase of the economic cycle. For this project, the creation of sub-asset classes is done only for large-cap Stocks. BACKGROUND Over time, developed economies such as the US are becoming more volatile and hence the underlying risk of securities has risen. This project aims to identify the risks & potential returns associated with different securities and to cluster similar stock similarities according to their Sharpe ratio, volatility, and an average return of stocks for a better analysis of the portfolio. Business Intelligence & Analytics 12 Data Acquire • Data of Large Cap Stocks & US Treasury Bonds is gathered directly using APIs. • The Data potentially consists of 2 time frames i.e. Recessionary & Expansionary Economy. Data Preprocessing • This segment included the application of formulae to calculate the pre-required parameters. (Eq 1,2,3,4). Analysis • This segment consists of K means clustering Analysis done on the Large Cap Stocks. (K = 22) (500 Stocks) • The clustered securities is then further tested for their correlation among the sub asset classes. Results • The results of K means clustering varies in the range (9 to 45) • There were some outliers in our analysis as well. Flow - Project Conclusion & Future Scope • With the above methodology, we have been able to develop a set of classes which behave in a similar fashion during each phase of the economic cycle. • The same methodology can be extended to different asset classes available online. • Application of Neural Networks can significantly reduce the error in cluster formation. • Also, application of different parameters such as Valuation, Solvency or Growth potential factors can be included for clustering purposes. • Next, we plan to add leading economic indicator data to identify the economic trend and to perform the relevant analysis. Mathematical Modelling • Here we take daily returns for all the 500 securities.