Classification.

Let’s now apply a classification method called SVC to a simplified credit scoring scenario.

 

In this scenario, you need to work out if an applicant should be approved for credit based on

 

•               Saving

•               Earnings

•               Acceptance/rejection data for previous applications

Let’s first look at the data in the matrix below.

Figure 47. Data for use in an SVC model

We can see that we have training data given as x and y coordinates. We then have a category with binary numbers – 0 for approved, 1 for not approved.

 

So how do we take this data and turn it into an application that could be used to make credit worthiness decisions?

 

First, we should visualise the data that we are dealing with.

 

Then, we need to categorise the training data into the right groups.

 

Next we can show the "best fit" diving line between the categories

 

Finally, we need to predict whether an application should be approved or not by inputting test data, graphing and printing the prediction.

Step 1. Visualise the Training Data.

Open folder 5.3 Credit Scoring, then 1. Credit scoring 1.1.py

 

We can see the x, y and z arrays correspond with the structure of the table set out in the ‘island’ example.

 

x = np.array((1,1,2,5,8,9))

y = np.array((1,2,2,8,8,9))

z = np.array((0,0,0,1,1,1))

 

We next need to combine the x and y arrays into columns to enable the data to be plotted.

 

X = np.column_stack((x, y))

 

We call this combination of x and y ‘capital X’. The rest of the code is there just to plot the data.

Looking at it all together -

 

Figure 48. Training data plotted

Figure 49. Training data plot output

We can now build on this code and add a dividing line to separate the ‘approved’ and ‘not approved’ categories.

Step 2. Separate the Data with the SVC Algorithm.

Open and run 2. Credit scoring 1.2.py

 

Note these lines of code

 

w = clf.coef_[0] 

a = -w[0] / w[1] 

 

‘w’ and ‘a’ are needed to get the y positions of the line.

 

The line’s x axis is first bounded between 0 and the highest x axis value.

 

The line’s y axis boundary is defined as a function of w, a and xx.

 

Let’s look at a bit more code in detail –

Figure 50. Credit score algorithm code analysis

Looking at the code all together in IDLE –

Figure 51. SVC credit scoring model code

Running this code we can plot the boundary line itself and give it a label.

Figure 52. SVC classification showing the best fit hyperplane

Looking at it all together –

Figure 53. SVC classification code, analysis and outcome

This concludes what we need to do with the training data.

 

Now we have trained the credit score classifier, we can start feeding it with test data to determine whether an application will be approved or not.

Step 3. Predict Outcome for Applicants.

Figure 54. Credit scoring video

Credit scoring

 

CS1 and CS2 is the data for the applicant. This is the test data.  

 

Their scores for earnings are plotted on the ’y’ axis, and their scores for savings are plotted on the along the ’x’ axis.

 

Let’s look at some of the code in detail -

Figure 55. Adding test data for an applicant

Outcome -

Figure 56. Prediction outcome

This application lands close to the dividing line, but on the right side of it to be accepted.

 

Try changing the application data by changing the numbers next to cs1 and cs2.

 

cs1 = (you decide)

cs2 = (you decide)

Credit Scoring With More Data.

In reality we'd need to work with many more datapoints.

 

Let’s now take this scenario further by using a lot more training data.

 

Previously we wrote the training data into the program in x, y and z arrays. However, the more data we can use for training, the more accurate the program’s predictions are going to be.

 

Also, training data may need to be changed frequently as more data becomes available, and credit worthiness levels shift with prevailing conditions.

 

Therefore, a better method for handling the training data is to bring it into the program from a data file. Let’s now illustrate this.

Credit scoring imported data

 

Open and run 4. Credit scoring imported data.py

 

Before looking at the scatterplot, let’s look at the data and the code.

 

Data

 

The data, which is in the file “creditscoredata.txt” is presented at numbers in a text (.txt) file.

 

Figure 58. Credit score data in a .txt file

Figure 59. Code for SVC model with imported data

Let's look at some of the code in detail.

Figure 60. Code analysis

The main differences between the code in ‘4. Credit scoring imported data.py’ and the previous file – ‘3. Credit scoring 1.3.py’ are:

 

1.    The code includes instructions to get external data

2.    It visually differentiates both the training data and the candidate data, using shape and colour to show rejected and accepted applications.

 

First, notice below that we have a new library import, which allows us to load a .txt data file into the program.

 

from numpy import loadtxt, where

 

Next, we tell the program where to get the .txt file data from, and transform it from horizontal to vertical format.

 

data = np.genfromtxt("creditscoredata.txt", delimiter=",").T

 

Once imported we need to tell the program what each line of data means, so we assign functions to each column.

 

x = data[0]

y = data[1]

z = data[2]

 

Then we print the first 5 rows of the x and y axis data check that it’s what we require. If its correct, then we can progress.

 

print(X[[0,1,2,3,4],:])

 

When you run the code, the resulting plot has different coloured training data points for ”approved” and “not approved”.

 

We can see how ‘Approved’ data points are ‘blue o’s’ and ‘Not approved’ are ’red x’s’.

Figure 61. Prediction outcome

Let's look at some more of the code in detail.

 

Next, we do something similar for the candidate data, using an ’if statement’.

 

We do something similar for the appicant's data, using an ’if statement’.

 

if P==1:

  plt.scatter(cand[:, 0], cand[:, 1], s=100, c='g')

else:

  plt.scatter(cand[:, 0], cand[:, 1], s=100, c='r')

 

Here we can see that the size of the data point is 100 and the depending on whether the prediction is 1 (approved) or not, the colour will be either green or red.

Complete and Continue