Data Formats

In theory, any digital data can be used in machine learning.

 

We are going to focus on Python-based machine learning, and the file formats shown here can be used.

 

When choosing data formats to work with, you need to ask the following questions:

 

•               What tools will you be using to work on your data?

•               Will your data structure change over time?

•               How important is file format “splittability”?

•               Does block compression matter?

•               How big are your files?

File formats for Python-based machine learning.

Figure 28. File formats that can be used with Python

As you can see from this, Python is able to handle a full spectrum of data types making it an ideal tool for machine learning.

 

Let’s now look at some of the main data types in more depth.

CSV - Tables.

‘CSV’ is a simple file format used to store tabular data.

 

Files in the CSV format can be imported to and exported from programs that store data in tables, arrays and Data Frames.

Figure 29. Data in a CSV file

.txt - Text.

Text files are identified with the .txt extension. A text file is a computer file that is structured as a sequence of lines of electronic text. They only contain text and have no special formatting such as bold text, italic text, images, etc.

 

The ASCII character set is the most common format for English-language text files, and is generally assumed to be the default file format in many situations. Another format is UTF-8, and every ASCII text file is also a UTF-8 text file by default.

Figure 30. Data in a .txt file

Serial.

Sensors generate huge amounts of data, and one of the most common methods of receiving that data on a computer is via the Serial method.

 

Serial communication is the process of sending data one bit at a time, sequentially, over a communication channel.

 

Serial data from sensors can arrive at a computer live via wireless or cable, or it can be stored locally first, then transferred.

 

In its raw state, Serial Data is seen as a stream of numbers, often mixed with labels or tags.

Figure 31. Serial data

Figure 32. Data describing an image in a .png file

Import Data Into Python.

Let’s demonstrate how we can bring data into a program by loading a simple text file into a ‘list’ in Python. Whilst what we see here is not a machine learning program as such, the principle is the same, and the code we see here could easily be incorporated into a larger ML program.

Practical Work.

Figure 33. Loading Data video

Open folder 4. Loading Data

 

Note that there are 4 files here (close and reopen if it doesn’t show the files first time).

 

Open and Run ‘1. Load text.py’ – this will read and print the data in the file called ‘textfile.txt

Figure 34. Loading data into a Python algorithm

Now let’s do the same for numbers.

 

Open 2. Load numbers.py’ – this will read and print the data in the file called ‘numbersfile.csv’

Again, whilst this is not a machine learning program per se, the code that we us to load numbers in a CSV file into a machine learning program is the same.

Figure 35. Printing the contents of a data file

Complete and Continue