Applying Logic Tensor Networks (Part 2)

In my last LTN blog post, I introduced the overall setting of my experiment. Before I can report on first results, I want and need to describe how we can evaluate the performance of the classifiers in this multi-label classification setting. This is what I’m going to do today.

The learning problem we consider is multi-label classification. This means that every training example has to be assigned to at least a single one of multiple classes, but that it can be assigned to multiple classes as well. When evaluating the performance on such a task, the classical performance metrics like accuracy cannot be used right away: Accuracy assumes that there are only correct and incorrect predictions, but in multi-label classification predictions can also be partially correct.

In order to illustrate how different metrics work, we’ll use a simplified version of our real setting where we assume that there are only four labels (namely, “Action“, “Comedy“, “Drama“, and “Fantasy“).

Metrics from the Literature

In his overview paper on multi-label classification [1], Sorower lists different performance metrics. From his paper, we adopt the following metrics (all of which are of course averaged across all data points):

One Error

The one error measures how often the highest-ranked prediction (i.e., the label to which the classifier assigned the highest confidence) is incorrect. We clearly want to avoid a large one error. However, a one error of zero does not necessarily tell us that our classifier has a perfect performance: Let us assume that a given data point has two labels (e.g., “Action” and “Comedy“) and that our classifier outputs the following confidence values for this data point:

1) “Action”: 0.9238    “Comedy”: 0.1234    “Drama”: 0.5801    “Fantasy”: 0.0025

As the prediction with the highest confidence is correct, we have a one error of zero. However, the classifier ranks “Drama” higher than “Comedy“, which is a mistake and therefore something we don’t want.

Coverage

The coverage computes the average position of the lowest-ranked ground truth label. It basically answers the question “How far do you need to go down the ordered list of predictions in order to find all ground truth labels?”. Clearly, a smaller value for the coverage is preferable. Let us again take our example data point with the labels “Action” and “Comedy” and the following two predictions:

1) “Action”: 0.9238    “Comedy”: 0.1234    “Drama”: 0.5801    “Fantasy”: 0.0025
2) “Action”: 0.3355    “Comedy”: 0.2486    “Drama”: 0.8824    “Fantasy”: 0.1870

For both predictions, “Comedy” is the ground truth label with the lowest position in the ranking (namely, position 3). In both cases, the coverage is therefore 3. However, the first prediction is preferable, because it puts “Action” on top of the list, whereas the second prediction has the highest confidence for “Drama” which is not in the ground truth labels. Therefore, coverage alone is also not enough for evaluating a multi-label classifier.

Ranking Loss

The ranking loss counts the number of label pairs that are incorrectly ordered according to the classifier’s predictions. It answers the question “How often does a false label have a higher confidence than a label from the ground truth?”. The ranking loss lies always in the interval [0,1] where a ranking loss of zero indicates perfect performance.

Let us again consider the two predictions from above for the data point with ground truth labels “Action” and “Comedy“:

1) “Action”: 0.9238    “Comedy”: 0.1234    “Drama”: 0.5801    “Fantasy”: 0.0025
2) “Action”: 0.3355    “Comedy”: 0.2486    “Drama”: 0.8824    “Fantasy”: 0.1870

For the first prediction, there is one pair of labels that is incorrectly ordered: “Drama” has a higher confidence than “Comedy“, but it should be the other way around. All other pairs of labels are correctly ordered. One out of six pairs was incorrectly ordered, which leads to a ranking loss of 1/6.

For the second prediction, there are two pairs that are incorrectly ordered: “Action“-“Drama” and “Comedy“-“Drama“. Therefore, we get a ranking loss of 1/3. As you can see, ranking loss prefers prediction 1) over prediction 2) because it makes less mistakes.

Average Precision

The average precision looks at each ground truth label and answers the question “What fraction of the labels ranked at least as high as this label is in the ground truth?”. It gives a higher weight to labels toward the top of the ranking, therefore enforcing correct orderings at the top. Average precision lies always in the interval [0,1] where an average precision of one indicates perfect performance.

Let us again consider the two predictions from above for the data point with ground truth labels “Action” and “Comedy“:

1) “Action”: 0.9238    “Comedy”: 0.1234    “Drama”: 0.5801    “Fantasy”: 0.0025
2) “Action”: 0.3355    “Comedy”: 0.2486    “Drama”: 0.8824    “Fantasy”: 0.1870

In both cases, we have two ground truth labels that we need to take care of.

Let us look at the first prediction: For “Action“, we observe that all of the labels with a confidence at least as large as “Action” (which is only the label “Action“) are in the ground truth. For “Comedy“, this only holds for two thirds of the labels (“Action” and “Comedy“, but not “Drama“). The average precision is in this case 0.8333.

Let us now look at the second prediction: For “Action“, we find that 50% of the labels with a confidence at least as large as the one for “Action” are in the ground truth (namely, “Action“, but not “Drama“). For “Comedy“, this again holds for two thirds of the labels like above. Overall, the average precision is 0.5833. Also the average precision metric therefore prefers prediction number 1).

In addition to the metrics I took from the literature, I have devised three more ways of measuring classification performance on our data set:

Exact Match Prefix

The exact match prefix counts how often all ground truth labels are ranked higher than all other labels. It takes values in the range [0,1] with larger values being better. Let us again look at the data point with labels “Action” and “Comedy” and at the following two predictions:

3) “Action”: 0.7658    “Comedy”: 0.8203    “Drama”: 0.5484    “Fantasy”: 0.1185
4) “Action”: 0.7658    “Comedy”: 0.5484    “Drama”: 0.8203    “Fantasy”: 0.1185

As you can see, they are almost identical – only the confidences for “Comedy” and “Drama” have been switched. For prediction 3), the exact match prefix is one, because the two highest-ranked labels (i.e., “Comedy” and “Action“) are exactly the ground truth labels. For prediction 4), the exact match prefix is zero, because the two highest-ranked labels (namely “Drama” and “Action“) are not the two ground truth labels.

On a single data point, the exact match prefix is binary, but by averaging over all data points in the data set, we get a number between zero and one.

Cross Entropy Loss

The cross entropy loss is a typical loss function that is minimized for classification tasks (see here). It operates directly on the confidence values  – not like all the other metrics we discussed before which use the ranking induced by these confidence values.

The standard cross entropy loss punishes small confidence values for labels that are in the ground truth. We have added a second term to the formula which also punishes large confidence values for labels not in the ground truth. This is necessary because in our multi-label classification setting the confidences do not need to sum up to one, which is usually assumed when cross entropy loss is used. Overall, the cross entropy loss measures how far away the confidence values are numerically from a perfect binary classification. For the cross entropy loss, smaller values are preferable.

Let us look at our old example predictions 1) and 2) again:

1) “Action”: 0.9238    “Comedy”: 0.1234    “Drama”: 0.5801    “Fantasy”: 0.0025
2) “Action”: 0.3355    “Comedy”: 0.2486    “Drama”: 0.8824    “Fantasy”: 0.1870

Their cross entropy losses are 4.3884 and 6.9704, respectively. Again, prediction 1) is being preferred.

Label-Wise Precision

The label-wise precision is computed for each of the ground truth labels individually. It considers only the data points for which this label was part of the ground truth and counts how often this label had a strictly higher confidence than the highest-ranked invalid label. It basically answers the question “How good are we in spotting this label if it belongs to the ground truth?”. In order to assess the overall performance of a classifier, we use the minimum and the average value across all labels. The label-wise precision ranges from zero to one, with higher numbers being preferable.

Let us again consider our example data point and the prediction 1):

1) “Action”: 0.9238    “Comedy”: 0.1234    “Drama”: 0.5801    “Fantasy”: 0.0025

In this case, the label-wise precision for “Action” is one (it is ranked before both “Drama” and “Fantasy“, i.e., all invalid labels). The label-wise precision for “Comedy” however is zero, because it is ranked below “Drama” which does not belong to the ground truth. The label-wise precision for “Drama” and “Fantasy” cannot be computed on this data point, because they don’t belong to the ground truth.

The minimum label-wise precision is in this case zero, the average is 0.5. Again, as with the exact match prefix, we have to aggregate across all data points in order to get meaningful numbers.

Outlook

This was a quick overview of the different evaluation metrics I will use in the classification setting. I have deliberately not shown the underlying mathematical equations – this would have made the text unnecessarily long without adding much clarity.

Now we’re all good to go for evaluating the performance of both the kNN and the LTN on this multi-label classification task. So stay tuned for my next LTN blog post!

References

[1] Sorower, M. S.: “A literature survey on algorithms for multi-label learning”. Oregon State University, Corvallis, Citeseer, 2010, 18