Last time, I have introduced the evaluation metrics used for the LTN classification task. Today, I will show some first results of the k nearest neighbor (kNN) classifier which will serve as a baseline for our LTN results.
But first, let me introduce another very simple baseline. Why do we need another baseline, you ask? Well, we want to make sure that the conceptual space of movies indeed contains some structure that helps to make predictions. In order to check whether this is true, we compare the results of kNN (which is allowed to use the feature vectors from the conceptual space) to a simple baseline which only operates on the labels and ignores the conceptual space completely.
This baseline simply computes the frequency of every label in the training set and uses this frequency as its prediction. It makes the same constant prediction for every data point. As it simply reflects the distribution of labels in the training set, I call this the “distribution” baseline. If the kNN classifier can make more accurate and more fine-grained predictions than this naive approach, this indicates that the conceptual space contains meaningful structure.
I have evaluated the kNN classifier on all of the eight different spaces provided by Derrac and Schockaert, but I don’t want to bore you with too many numbers, therefore I’ll only show the results on the 50-dimensional space here.
Table 1: Optimal kNN results and baseline results.
|Metric||Distribution Baseline||Optimal kNN Performance||Optimal value of k|
|One Error||0.4804||0.1452||k = 17|
|Coverage||9.1011||4.9012||k = 151|
|Ranking Loss||0.2097||0.0245||k = 1|
|Cross Entropy Loss||9.7393||6.7444||k = 503|
|Average Precision||0.5378||0.8134||k = 67|
|Exact Match Prefix||0.0877||0.1945||k = 4|
|Minimal Label-Wise Precision||0.0000||0.2124||k = 15|
|Average Label-Wise Precision||0.0646||0.5103||k = 30|
Table 1 takes a look at the performance on the validation set for both the distribution baseline and the k nearest neighbor algorithm. For each of the evaluation metrics, it shows the best performance possible for any value of k. This optimal value of k is also given in the table.
What do we observe? Well, on the one hand, the kNN classifier is able to beat the baseline with respect to all of the evaluation metrics – and all the time by a relatively large margin. That’s good news because it means that the conceptual space contains enough information to make reasonable predictions. On the other hand, we also observe that the optimal value of k differs quite drastically between the evaluation metrics – it seems that there is no unique value of k that yields optimal performance with respect to all or most of the evaluation metrics. That’s a bummer, because we need to fix k to one specific number when we want to use a k nearest neighbor classifier for making predictions.
Therefore, I looked for promising values of k that yield good (but not necessarily optimal) performance with respect to most of the metrics. For the 50-dimensional space, one of the good candidates is k = 30. Table 2 shows the performance of this “30 nearest neighbor” classifier in comparison to the distribution baseline and in comparison to the optimal performance on the validation set.
Table 2: Performance of selected k that performs well on multiple evaluation metrics.
|Metric||Distribution baseline||k = 30||Optimal kNN performance|
|Cross Entropy Loss||9.7393||78.2626||6.7444|
|Exact Match Prefix||0.0877||0.1711||0.1945|
|Minimal Label-Wise Precision||0.0000||0.2045||0.2124|
|Average Label-Wise Precision||0.0646||0.5103||0.5103|
As one can see, for k = 30, we beat the baseline on almost all metrics. The only exception is the cross entropy loss. Moreover, we are reasonably close to the optimum for most of the evaluation metrics. The strongest differences concern the coverage and the cross entropy loss. This is however not a very big surprise, given that the optimal values for these metrics required quite large numbers of k. We can make a similar argument for the ranking loss, where our selected k is much larger than the optimal choice of k = 1. Overall, however, our choice of k = 30 seems to be a good trade off between the individual metrics.
How well does this configuration of k = 30 generalize to the test set? This is shown in Table 3, where we compare the performance on the two sets.
|Metric||Validation set performance||Test set performance|
|Cross Entropy Loss||78.2626||82.3033|
|Exact Match Prefix||0.1711||0.1672|
|Minimal Label-Wise Precision||0.2045||0.1931|
|Average Label-Wise Precision||0.5103||0.4931|
For all of the metrics, performance gets a bit worse on the test set than on the validation set. This is however something we should expect and which usually happens when you optimize a machine learning algorithm and apply it to a different data set. The important point to note here is that the difference is relatively small. This shows that our kNN classifier is not overfitting and that our choice for k is reasonable. If the performance on the test set was much worse than the performance on the validation set, this would mean that the k we chose on the validation set is not very useful on the test set. Therefore, we could also not expect that any k yielding good performance on our validation set would also yield good results on new unseen examples. As we did not observe much overfitting, we can safely assume that the kNN classifier with k = 30 will also perform reasonably well on unseen data points.
Overall, we have seen that the kNN classifier is better than a simple baseline, so there is some learnable structure in the conceptual space. We are able to choose one configuration (i.e., one value for k) that does reasonably well with respect to most metrics on the validation set. The performance of this configuration does generalize to the test set, so it seems to be a good choice in general.
Now the remaining part of my quest consists in training the Logic Tensor Network and showing that its performance is at least comparable to the k nearest neighbor classifier. I’m currently running the experiments – which unfortunately takes quite long, as LTNs are neural networks trained by gradient descent and as the number of hyperparameters to tune is quite large. So it might take some time until I’m back with some updates. Stay tuned!