Classifying Words
If we can bring images into the machine learning environment, we can do the same for other forms of data.
Sound, for instance, can be turned into digital profiles which in turn can be turned into arrays and manipulated.
Figure 78. Converting sound to numbers
In this way, sounds can be treated the same way as images. Sound recognition would work in a very similar way to how the image recognition we've just explored works.
It’s worth noting also, that once in digital format, data can be converted into binary -
Figure 79. Converting sound values into binary. Image source, Wikipedia
So naturally, we can do the same kind of digital processing for written words.
One method for turning words into data that can be used by machine learning is called One Hot Encoding.
This method takes a word and converts it into a binary sequence, and then puts that sequence into an array.
Once the representation of each word in is in an array, a full range of machine learning methods can be applied to it.
This method can be applied as part of ‘Text Mining’ processes, and can be used to study word frequency distributions, sentiment analysis, tagging/annotation, and information extraction.
Figure 80. 'One Hot Encoding' code
Figure 81. Words encoded into a binary sequence, from which an analysis can be performed
Lets examine the code –
The first 3 lines open the library and calls for a binary encoding process
from sklearn.preprocessingimport OneHotEncoder, LabelEncoder
lbl= LabelEncoder()
enc= OneHotEncoder()
These are the words that we want to encode into binary
qualitative = ['To', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']
This sets up the array
labels = lbl.fit_transform(qualitative).reshape(10,1)
This outputs the array
print(enc.fit_transform(labels).toarray())
Once we have words converted into values in an array, we can perform a range of machine learning tasks such as frequency analysis, feature extraction, capturing meaning through 'vectors', and combination analysis. The resulting AI applications could be, for example, translation, information retrieval, sentiment analysis, information extraction, question answering.