Machine Learning Applications.

Now we have learned about data handling and manipulation in Python, lets now turn to the application of machine learning principles and the methods these applications employ.

 

We will focus on two methods – Regression and Classification

 

Then well look at three different application scenarios – classifying words and images, and credit scoring.

 

1.    Methods

·      Regression

·      Classification

 

2.    Applications

·      Credit Scoring

·      Classifying Images

·      Classifying Words

Regression.

A regression line is simply a single line that best fits the data.

 

Linear regression is a method for predicting values.

 

Let’s suppose that we have a set of prices for the past 5 months, and we want to estimate what the price will be this month, i.e. month 6.

 

We can use linear regression to make this prediction.

 

Using an array, we can first put our month numbers into an x axis, then the prices into the y axis and display this in a time series graph.

 

From this we can use the data as a basis for inferring a trend. To infer a trend, we draw a line that fits the data we have.

Figure 37. Linear regression


Let’s now explore how we can draw a regression line. We start with some datapoints then draw a horizontal line.

Figure 38. Datapoints

If we measure the gap between the datapoints and the line we’ve drawn, we can see that most points are far from the line, and the distances get progressively longer.

Figure 39. Distance between datapoints and X axis

Clearly this isn’t going to work as a prediction tool.

 

However, lets add up all the distances between the points and the lines and call this ‘x’. Now, let’s attempt to make the distances between the points and the line as small as possible. We now draw a line as close as we can to all the prediction points.

 

Finally, we add up all the distances between the points and the lines and see if its greater or less than our previous sum ‘x’.

 

Clearly, the closer our line gets to all the points, the less the value of x will be.

 

So, the goal of drawing a linear prediction line is to get the sum of the distances between the points and the regression line as low as possible.

 

Now, a simple linear regression line could be drawn on a graph with a pencil and ruler. More complex graphs can be produced in Excel.

 

But what happens when you have massive numbers of datapoints to process, or you want to incorporate the method into a larger software process or solution? This is where programming a solution makes sense.

 

In Python, we can use the “least squares” method by calling the python ‘polyval’ and ‘polyfit’ functions. The goal of the algorithm is to reduce the sum of the gaps between the sample points and the line. The best fit line is when this sum is at its lowest.

 

A ‘least squared’ algorithm keeps drawing and redrawing lines until it arrives at the fit with the lowest sum of the distances.

Figure 40. Datapoints with least squared algorithm applied

So, let’s now put the theory into action. Let’s start with some data, plug it in, and apply the method.

In the diagram below, first we have the array with 1-11 in the x axis representing 11 months.

 

Then we have a set of 11 numbers in the y axis that represent price data for each of the 11 months along the x axis.

Figure 41. Datapoints for 11 months

With these numbers in an array, we can use Python’s ‘polyfit’ function to find the line that has the lowest sum of distance between the data points and the line.

 

This line can be seen in blue in the diagram below.

Figure 42. Datapoints with polyfit function applied

So the first part of the puzzle is solved – we have a line to work from. We now need to extrapolate this line.

 

We can to do this using the formula ‘y = mx + c’.

 

Here,

·      c is the ‘y intercept’, ie the value of y when x=0.

·      m is the slope of the line - the change in y over 1 unit of x.

 

E.g. a 45 degree line rising from x=0, y=0 to x =1, y= 1 would have a slope of 1, whereas a slope of 22.5 degrees would have a slope of 0.5

 

If we know the slope and starting point of the line on the y axis, we can now plot a point on the y axis for any point on the x axis.

 

If we are dealing with time, the x axis is just the measure of time we are using – in this case months - incrementing in single units.

 

So, here, m is 5.58 and c is 4.5.

 

But, we only have 11 months, so we need to extrapolate the line by another month To do this we introduce two new values, x2 and y2.

 

We don’t know what y2 is yet, but we know that x is 11 months so we need to add one month on which makes it 12 months. x2 is now 12.

 

To find y2, then, we multiply the slope m (which is 5.58) by 12 (months) and add c (which is 4.5).

This gives us a value of 71.

 

So, given the data we have got, we can anticipate that the price in month 12 is 71.

 

Following this argument in an algorithm we have –

 

Data -

x = [ 1, 2,  3,  4, 5,  6,  7, 8,  9,  10, 11] 

y = [11,12,22,22,31,44,46,51,54,60,61]

 

Apply ’polyfit’ in the code to get a best fit line

Extrapolate the line using y = mx + c 

where c is the value of yat x=0,

and 

m is the slope (change in y for each increment of x)

Call the extrapolated point x2 and y2

Add 1 (month) to x: 11+1 = 12, so x2=12

y2 = m*x2 + c

 

Plug in the values

 

y2 = 5.58*12 + 4.5 = 71

 

Therefore the price value in month 12 is predicted to be 71

 

Let’s now see how this works in Python code. 

Open the folder 5.1 Regression

 

Open and run 1. Simple_predict.py 

 

Figure 44. 12th month value predicted

Exploring the code line by line:

 

First, import all libraries from pylab (which includes 'polyfit')

from pylabimport *

 

Add the data

x = [1,2,3,4,5,6,7,8,9,10,11]   

y = [11,12,25,21,31,40,48,55,54,60,61]   

 

Define x and y as a scatter plot

 

scatter (x,y)

 

Call the polyfit function which takes two variables and a degree as input. Here, the degree is 1 for a linear function. The results goes to the two variables m (for the slope) and c (y for x=0), in the equation y = mx + c.

 

(m,c)=polyfit(x,y,1) 

 

Tell it what to print

 

print ("y-intercept (m),", m)

print ("Slope(c),", c)

 

Use 'polyval' draw the line by calculating a y value for every x data point from the newly calculated slope (m) and c values.

 

yp=polyval([m,c],x) 

 

Once we have our line defined, we now need to set up two new variables to extrapolate from it – x2 and y2.

 

To this point, x is 11 months so we need to add one month on which makes it 12 months. So x2 is now given as 12.

 

We calculate y2 by re-applying the slope formula - this time plugging in x2.

 

x2 = 12

y2 = m*x2 + c

 

The rest of the code handles plotting, laying out the graph, labelling the grid, and printing key data.

Putting all together.

Figure 45. Polyfit linear regression algorithm code analysis


Result.

The red dot in top right-hand shows the price at 71 for month 12

Figure 46. Linear regression prediction output

Your Task.

Analyse the code in 5.1 Regression, 1. Simple_predict.py and try different values for y. 

 

Next, we'll take a look at Classification.

Complete and Continue