**Assignment module 5**

Overview: In this assignment, you will learn how to apply supervised data mining techniques in business cases.

Prompt: For this assignment, you will analyze the three case studies below and address the questions associated with each.

For all the cases, you will first partition data sets into 50% training, 30% validation, and 20% test and use 12345 as the default random seed. If the predictor variable values are in the character format, then treat the predictor variable as a categorical variable. Otherwise, treat the predictor variable as a numerical variable.

**Case 1: For this case, first download the data: Social Media data (available in Blackboard).**

Next review the following case study:

A social media marketing company is conducting consumer research to see how the income level and age might correspond to whether consumers respond positively to a social media campaign. Aliyah Turner, a new college intern, is assigned to collect data from past marketing campaigns. She compiled data on 284 consumers who participated in the marketing campaigns in the past, including income (in $1,000s), age, and whether everyone responded to the campaign (1 if yes, 0 otherwise).

Then complete the actions below and record your answers in a Microsoft Word document.

Note: For step-by-step instructions on how to use Excel and Analytic Solver to estimate and predict with KNN method and how to interpret results, refer to the following videos from Lesson

1: Supervised Data Mining – KNN Algorithm:

● K – Nearest Neighbors (KNN) – Introduction (7:51)

● KNN Model with Analytic Solver (12:57)

1. Perform KNN analysis to estimate a classification model for the social media campaign using the data. What is the optimal value of k?

2. Report the overall accuracy, specificity, sensitivity, and precision rates for the test data set (for Analytic Solver) or validation dataset (for R) using the cutoff value of 0.50. Explain each of them with one sentence.

3. What is the area under the ROC curve (or the AUC value)?

4. Comment on the performance of the KNN classification model. Is the KNN method an effective way to predict whether a consumer responds positively? Note: Interpret the results of ROC curve, lift chart, and decile-wise lift chart.

5. What is the predicted outcome for the first new consumer record?

**Case 2: For this case, first download the data: Mobile Banking data (available in Blackboard). **

Next review the following case study: Sunnyville Bank wants to identify customers who may be interested in its new mobile banking app. The worksheet called Mobile Banking Data contains 500 customer records collected from a previous marketing campaign for the bank’s mobile banking app. Each observation in the data set contains the customer’s age (Age), gender (Male/Female), education level (Edu, ranging from 1 to 3), income (Income in $1,000s), whether the customer has a certificate of deposit account (CD), and whether the customer downloaded the mobile banking app (App equals 1 if downloaded, 0 otherwise). Create a classification tree model for predicting whether a customer will download the mobile banking app.

Then complete the actions below and record your answers in a Microsoft Word document.

Note: For step-by-step instructions on how to use Excel and Analytic Solver to estimate and predict with a classification tree, and how to interpret results, refer to the following video from

Lesson 2: Supervised Data Mining – Decision Trees: Using Analytic Solver to Build a Classification Tree (8:17).

1. How many leaf nodes are in the best-pruned tree? What are the predictor variables and split value for the root node of the best-pruned tree?

2. What are the accuracy rate, specificity, sensitivity, and precision of the best-pruned tree on the test data? Explain each of them with one sentence.

3. Generate the ROC curve. What is the area under the ROC curve?

4. Score the 20 customers in the data set you have downloaded using the best-pruned tree. How many of the 20 new customers will likely download the mobile banking app based on your classification model? What is the probability of the first new customer downloading the app?

**Case 3: For this case, first download the data: Electricity data (available in Blackboard). **

Next review the following case study:

Kyle Robson, an energy researcher for the U.S. Energy Information Administration, is trying to build a model for predicting annual electricity retail sales for states. Kyle has compiled a data set for the 50 states and the District of Columbia that contains average electricity retail price (Price in cents/kWh), per capita electricity generation (Generation), median household income (Income), and per capita electricity retail sales (Price in MWh). Create a regression tree model for predicting per capita electricity retail sales (Sales).

Then complete the actions below and record your answers in a Microsoft Word document.

Note: For step-by-step instructions on how to use Excel and Analytic Solver to estimate and predict with a regression tree and how to interpret results, refer to the following video from Lesson 2: Supervised Data Mining – Decision Trees: Using Analytic Solver to Build a Prediction Tree (5:53). 2 BUA 6315: Business Analytics for Decision Making

1. How many leaf nodes are in the best-pruned tree and minimum error tree?

2. What are the predictor variables and split value for the first split of the best-pruned tree? What are the rules that can be derived from the root node?

3. What are the RMSE and MAD of the best-pruned tree on the test data?

4. What is the predicted per capita electricity retail sales for a state with the following values: Price = 11, Generation = 25, and income = 65,000?

__In these cases, you will learn how to apply three unsupervised data mining techniques using country-level health and population measures data and social media usage patterns data.__

**Case 4: For this case, first download the data: Health Population data (available in Blackboard).**

Next review the following case study:

The data set Health Population contains country-level health and population measures for 38 countries from the World Bank’s 2000 Health Nutrition and Population Statistics database. For each country, the measures include death rates per 1,000 people (Death Rate, %), health expenditure per capita (Health Expend, in US$), life expectancy at birth (Life Exp, in years), male adult mortality rate per 1,000 male adults (Male Mortality), female adult mortality rate per 1,000 female adults (Female Mortality), annual population growth (Population Growth, in %), female population (Female Pop, in %), male population (Male Pop, in %), total population (Total Pop), size of labor force (Labor Force), births per woman (Fertility Rate), birth rate per 1,000 people (Birth Rate), and gross national income per capita (GNI, in US$)

Then complete the actions below and record your answers in a Microsoft Word document.

Note: For step-by-step instructions on how to use Excel and Analytic Solver to estimate and predict with both clustering methods and how to interpret results, refer to the following videos from the module’s lesson:

● Hierarchical Cluster Analysis – Introduction (5:14)

● Using Analytic Solver to Perform Agglomerative Clustering (4:42)

● Using Analytic Solver to Perform K-Means Clustering (3:15)

**Section 1: Hierarchical Clustering:**

1. Perform agglomerative clustering to group 38 countries according to their health measures listed below. Use the Euclidean distance and the average linkage clustering method (Group average linkage) to cluster the data into three clusters. Is data standardization necessary in this case?

**Explain.** Use the following measures only: Death Rate, Health Expend, Life Exp, Male Mortality, and Female Mortality, Fertility Rate and Birth Rate.

Describe the characteristics of each cluster by comparing the averages of GNI per capita, Population Growth, Labor Force, Fertility Rate and Birth Rate of each group and report your findings in a table

**Case 5: For this case, first download the data: Social Media Usage data (available in Blackboard).**

Next review the following case study:

Adrian Brown is a researcher studying social media usage patterns. In his research on him, he noticed that people tend to use multiple social media applications, and he wants to find out which popular social media applications are often used together by the same user. I surveyed 100 users about which social media applications they use on a regular basis.

Then complete the actions below and record your answers in a Microsoft Word document.

**Note:** For step-by-step instructions on how to use Excel and Analytic Solver to estimate and predict with association rule and how to interpret results, refer to the following videos from the module’s lesson:

● Association Rule Analysis (9:11)

● Using Analytic Solver to Perform Association Rule Analysis (3”41)

1. When you select the data ignore User ID and select the data with labels. Do not forget to select. “First Row Contains Headers” option. Generate association rules with a minimum support of 20 and minimum confidence of 60%. How many rules are generated?

2. Sort the rules by lift ratio. What is the top rule? Report and interpret the lift ratio of the top rule.

Note About Grading:

Submission Guidelines: Your completed assignment must be submitted as a Microsoft Word document, 1-2 pages in length, double spacing, 12-point Times New Roman font, and 1-inch margins. The submission must be accompanied by three Microsoft Excel spreadsheets showing your work. Only the Word document will be assessed for grading purposes, however the case spreadsheets are required and must be submitted to show your work. Note: No tables or charts need to be included in your Word document for this assignment. Note About Grading: This assignment will be assessed based on the accuracy of your responses to each question in the worksheet.