Is Honesty Really the Best Policy?

I am an honest person. A good person. Nice, perhaps, as well. When I write, it is an extension of myself, therefore by default my writing is honest. After a long period of waffling, procrastination…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Bank Institution Term Deposit Predictive Model

You successfully finished up to your rigorous job interview process with Bank of Portugal as a machine learning researcher. The investment and portfolio department would want to be able to identify their customers who potentially would subscribe to their term deposits. As there has been heightened interest of marketing managers to carefully tune their directed campaigns to the rigorous selection of contacts, the goal of your employer is to find a model that can predict which future clients who would subscribe to their term deposit. Having such an effective predictive model can help increase their campaign efficiency as they would be able to identify customers who would subscribe to their term deposit and thereby direct their marketing efforts to them. This would help them better manage their resources (e.g human effort, phone calls, time)

The Bank of Portugal, therefore, collected a huge amount of data that includes customers profiles of those who have to subscribe to term deposits and the ones who did not subscribe to a term deposit. As their newly employed machine learning researcher, they want you to come up with a robust predictive model that would help them identify customers who would or would not subscribe to their term deposit in the future.

Your main goal as a machine learning researcher is to carry out data exploration, data cleaning, feature extraction, and developing robust machine learning algorithms that would aid them in the department.

This is the first five rows in the dataset
Continuation of the first five rows in the dataset

In the above dataset, the numerical variables are,

And the categorical variables are,

Data Description:

# bank client data:

1 — age (numeric)

2 — job : type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)

3 — marital : marital status (categorical: ‘divorced’,’married’,’single’,’unknown’; note: ‘divorced’ means divorced or widowed)

4 — education (categorical): ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)

5 — default: has credit in default? (categorical: ‘no’,’yes’,’unknown’)

6 — housing: has housing loan? (categorical: ‘no’,’yes’,’unknown’)

7 — loan: has personal loan? (categorical: ‘no’,’yes’,’unknown’)

# related with the last contact of the current campaign:

8 — contact: contact communication type (categorical: ‘cellular’,’telephone’)

9 — month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)

11 — duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

# other attributes:

12 — campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 — pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 — previous: number of contacts performed before this campaign and for this client (numeric)

15 — poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)

# social and economic context attributes

16 — emp.var.rate: employment variation rate — quarterly indicator (numeric)

17 — cons.price.idx: consumer price index — monthly indicator (numeric)

18 — cons.conf.idx: consumer confidence index — monthly indicator (numeric)

19 — euribor3m: euribor 3 month rate — daily indicator (numeric)

20 — nr.employed: number of employees — quarterly indicator (numeric)

Output variable (desired target):

21 — y — has the client subscribed to a term deposit? (binary: ‘yes’,’no’)

The .dtypes method to identify the data type of the variables in the dataset.

We can get the size of the dataset using the .shape method

Pandas describe() is used to view some basic statistical details like count, percentiles, mean, std and maximum value of a data frame or a series of numeric values. This gives the count of each variable.

When we import our dataset from a CSV file, many blank columns are imported as null values into the Data Frame which can later create problems while operating that data frame. Pandas isnull() method is used to check and manage NULL values in a data frame.

We can see that there are no null values in the dataset.

Histograms are one of the most common graphs used to display numeric data. Histograms two important things we can learn from a histogram:

Lets plot histogram for the ‘age’ feature in our dataset

Here, the distribution is skewed to the right.

A count plot can be thought of as a histogram across a categorical, instead of numeric, variable. It is used to find the frequency of each category.

Count to see if the clients have subscribed to a term deposit:

Here, we can see that University Degree has the highest number of counts in the Education dataset.

Count of Ages in the dataset:

Marital status count in the dataset:

A Box Plot is the visual representation of the statistical summary of a given data set.

Add a comment

Related posts:

Listening With Curiosity

One of the sincerest forms of respect you can give to someone else is to listen to them. After all, it’s a powerful tool to help you build quality relationships — whether it’s in a personal or…