Learning from data

There are two ways in which machine learning can learn from data. The first is ”Supervised”. We used a supervised learning approach to the facial recognition scenario that we used earlier. Here the machine learns through rules-based training. The desired output in training is known in advance, and is based on labelled features, in columns, in the dataset.

The main types of supervised learning are:

Classification Analysis – this is used for functions such as image analysis, fraud detection, and spam filters. Classification algorithms are used for predicting responses that can have just a few known values—such as ‘married’, ‘single’, or ‘divorced’.

Regression Analysis – this is used to define the dependency between variables, and can, for example, predicting prices, or how people or systems will behave. Regression algorithms can predict one or more continuous variables, such as profit or loss.

The other learning method is ”Unsupervised”

Using a system of rewards, unsupervised learning infers hidden structures from "unlabelled" data.

For example, Cluster Analysis is used to find hidden patterns or groupings within data, clusters of customers with similar buying behaviour. These patters can, for example, reveal relationships between:

• Socioeconomic tiers

• Psychographics

• Social networks

• Purchasing patterns

The opposite of finding patterns is finding items that don’t fit with patterns in a dataset. This is called Anomaly Detection - another use of unsupervised learning. Uses of anomaly detection include structural defects, medical problems or errors in a text.

Figure 24. Four main kinds of machine learning models

Let’s explore a practical example.

What method(s) could be used for fraud detection?

Ans - Fraud detection could either use classification, clustering or anomaly detection.

If we have fully labelled transaction data and a known outcome against each transaction we could classify transactions as fraudulent or otherwise.

If our datasets are extremely large and less clearly labelled we could use clustering to identify large groups of transaction types, or if fraud shows up as a relative rarity in the data, we could use anomaly detection.

Preparing Data.

Data preparation takes 60 to 80 percent of the whole analytical pipeline in a typical machine learning project. Tasks include selecting, preparing, cleaning and validating data.

A key question is how do we get data into the machine learning environment?

There are three main principle methods of getting data into a machine learning environment –

· Cloud

· Live data feeds

· Data files from PCs or network drives – which is what we will focus on here.

Figure 25. Different ways of getting data into a machine learning system

As the figure below shows, prepared data is stored as a table. If you’re lucky, the raw data you’re working with is already stored in a table, such as with a relational database system. If you’re not, the raw data might be stored in several different ways. To transform this raw data into prepared data requires reading the raw data from multiple data sources (step 1), then running it through various data pre-processing modules to create prepared data.

Figure 26. All data should end up in a table

Data processing Steps.

Identify The Problem.

You first need to establish what problem you are trying to solve, and whether to approach it with Classification, Regression, Clustering, Anomaly Detection, or another method.

Mechanise Data Collection.

Establish a strategy to gather data efficiently from the different sources that you need to use. Internal, 3rd party and open sourced data need to be brought together into a meaningful structure. This may involve 'mechanising' data collection through Robotic Process Automation.

Data Wrangling.

Another important task is “Data wrangling” - the process of transforming data from one format into another. For example, you may need to simplify multiple formats of database and spreadsheet data files, by ‘munging” data from different file types together into, say, .CSV (comma separated value) format.

Standardise.

If data comes in from multiple sources, make sure that all the features are labelled consistently. For example, currency or dates can be represented in many ways so standardise how these will be presented in the data.

Reduce.

Just because big data is available, you don’t necessarily need to use it all. Choose which attributes are going to add the most value to your and weed out the features and records that add the least value to the solution.

Clean.

It’s very important to fill in the blanks where there is missing data. Assumed or approximated values are “more right” for an algorithm than just missing ones. This could involve substituting missing values with mean numbers, most frequently occurring items, zeros, or ‘n/a’.

Decompose.

At this stage you may need to add new attributes – for example you may want to find out what happens in a particular month in the year, in which case you’d call out the month as well as the date as an additional feature.

Normalise.

To prevent a single feature skewing the output prediction, you will need to normalise the numbers in your dataset. For example, currency rates versus the dollar can vary wildly in scale. Here, for example, you could convert numbers into maximum and minimum ranges.

Split, Train and Evaluate.

When your data is ready to be used by the machine learning system, it needs to be split into training and evaluation subsets, usually with a ratio of 60-80 percent for training and 20-40 percent for evaluation.

The machine learning system uses the training data to train models to see patterns and uses the evaluation data to evaluate the accuracy of the trained model.

Figure 27. Stages in data preparation