Looking at my posts so far, it seems that a little “What is … ?” series is emerging (“What is AGI?”, “What are conceptual spaces?”). Today I’d like to add another post to this series – this time about the term “machine learning” and about three different types of machine learning algorithms one can distinguish.
As already discussed earlier, “good old fashioned AI” is based on manually writing rules and having some sort of inference system that applies these rules in a given situation. Machine learning is more about discovering rules from a (usually quite large) number of examples.
One can distinguish three types of machine learning: supervised, unsupervised and semi-supervised.
In supervised machine learning, we have a so called “labeled” data set. That is, there is a category label attached to each example in our data set. For instance, if you would like to distinguish cat pictures from dog pictures, you could create a data set of images and annotate each image with its corresponding category: All cat pictures would get a label “cat”, and all dog pictures would get a label “dog”. Whenever we want to learn a rule for such a distinction, we talk about a “classification” task. This task consists of finding some sort of rule or function that takes an example as input and returns its category as output. In our example, we would look for some function that takes a picture as input and returns either “cat” or “dog” as output. There are many supervised machine learning algorithms, ranging from “k nearest neighbors” (where we look for each example at the k most similar other examples) over “decision trees” (where category membership is determined by a sequence of yes/no questions) to “artificial neural networks” (which are inspired by networks of biological neurons). In order to evaluate a supervised machine learning algorithm, one can use a second labeled data set, the so called “test set”: The algorithm is “trained” on the first data set (i.e., the examples from the first data set are used to infer the above mentioned classification function) and is then tested on the second data set. This testing is done by giving each example to the algorithm and comparing its response with the example’s actual label. In our example, we would present the algorithm with additional pictures of cats and dogs and would check whether it outputs “cat” for the cat pictures and “dog” for the dog pictures. We can compare two supervised machine learning algorithms by comparing how many images in the test set they classified correctly – the more, the better of course.
In unsupervised machine learning on the other hand, we deal with an “unlabeled” data set. This simply means that there are no category labels in our data set. The main goal of clustering algorithms (which are the most common type of unsupervised machine learning algorithms) is to discover groups of similar examples in the data set. These groups are called “clusters”, hence the name “clustering”. One could say that in clustering we try to find the categories that are already given in supervised machine learning. Evaluation of clustering algorithms is a bit more difficult – often, there are many possible groupings that are meaningful. If we take the cats-and-dogs data set from above, remove all the labels and hand it to a clustering algorithm, we would probably expect it to find exactly two clusters – one for cats and one for dogs. It might however also be fine if it ends up with three clusters, e.g., one for cats, another one for big dogs (like Golden Retrievers) and yet another one for small dogs (like Chihuahuas). It is also a bit difficult to compare two different clustering algorithms: If in the above case, one clustering algorithm creates two clusters and the other one three, which of them should be considered better?
Finally, semi-supervised machine learning lies in the middle between supervised and unsupervised machine learning: For some examples, there exist labels, but for others not. On the one hand, there is semi-supervised classification, where we face a classification task and want to use additional unlabeled data in order to improve our performance. In our example, we would still be interested in automatically distinguishing cats from dogs, but in addition to our labeled data set we might want to use some additional pictures that do not have any labels attached.
On the other hand, there is also semi-supervised clustering, where we want to discover clusters in the data set and use some rarely occurring labels to improve our set of clusters. For instance, one could try to make sure that two data points with the same label end up in the same cluster. Basically, both semi-supervised classification and semi-supervised clustering aim at the same thing: Using both labeled and unlabeled examples in order to find a function that assigns examples to groups. They however put a different emphasis on what they want to achieve, based on whether they are closer to supervised or unsupervised machine learning.
Of course there is a lot more to say about machine learning than the distinction of algorithms into the three types described above. Maybe I’ll dive a bit deeper into this topic in the future.