Applying Logic Tensor Networks (Part 4)

After having already written a lot about Logic Tensor Networks, today I will finally share some first results of how they perform in a multi-label classification task on the conceptual space of movies.

Remember from this post and this post that I want to experiment with a total number of five different LTN membership functions:

  • The original membership function
  • A modification of the original membership function that ensures convexity
  • A radial basis function
  • A prototype-based membership function
  • A membership function based on my mathematical formalization of conceptual spaces

That’s actually quite a large number of experiments and it will take some time (probably months) until I am done with everything. So far, I only have results for the first bullet point from above – the original membership function used in the vanilla form of LTN. This is what I’m going to focus on today. Results for the other membership functions will follow over time.

As the LTN has quite a large number of hyperparameters (e.g., the number of epochs and the learning algorithm), I have conducted a grid search over the hyperparameter space: For each hyperparameter,  I selected at least three options. Then, I generated all possible combinations of different hyperparameter settings. Each of these configurations was trained and evaluated 10 times and their performance was averaged in order to approximate the expected value.

Table 1: How good can an LTN maximally get?

MetricDistribution baselineOptimal kNN performanceOptimal LTN performance
One Error0.48040.14520.1989
Coverage9.10114.90124.9845
Ranking Loss0.20970.02450.0593
Cross Entropy Loss9.73936.744411.0847
Average Precision0.53780.81340.7925
Exact Match Prefix0.08770.19450.2100
Minimal Label-Wise Precision0.00000.21240.4794
Average Label-Wise Precision0.06460.51030.6458

Table 1 shows the optimal performance obtainable by the LTN, the kNN, and the distribution baseline on the validation set for the 50-dimensional movie space. The numbers for the distribution baseline and the kNN have been shown before and are shown again as a frame of reference. Remember that the numbers in this table show the optimal performance achievable by any of the tested LTN configurations when optimizing only a single evaluation metric. The numbers in the row “one error” therefore are the answer to the following question: “If we wanted to optimize only the one error and if all other evaluation metrics did not matter, how good can we get with the LTN?”

In practice, however, we want to find a hyperparameter configuration of the LTN that works well for most of the evaluation metrics, not a single one. The numbers in Table 1 therefore only give us an upper bound on the performance achievable by the LTN.

Now what do we observe?

Like the kNN, the LTN is able to clearly beat the distribution baseline on almost all evaluation metrics. The only exception is the cross entropy loss, where the LTN is slightly worse than the distribution baseline.

Moreover, we can observe that the kNN is better than the LTN with respect to one error, coverage, ranking loss, cross entropy loss, and average precision. On the other hand, the LTN performs better with respect to exact match prefix, minimal label-wise precision, and average label-wise precision. If you need to look up what these evaluation metrics measure, take a look at this old blog post.

A higher one error means that the LTN makes more mistakes at the very top of the list than the kNN. This of course also influences coverage, ranking loss, and average precision, because the ground truth labels will be found further down in the list. On the other hand, the considerably better values for minimal and average label-wise precision indicate that the LTN is better able to deal with rare genres than the kNN: Rare genres receive a higher weight in these two evaluation metrics than in the other ones. This comes as no big surprise as the LTN models each movie genre by its own dedicated membership function, whereas the kNN builds a global model of the genre distribution.

Table 2: Performance of the best "allrounder" LTN configuration

MetricDistribution baselinekNN performanceLTN performanceOptimal LTN performance
One Error0.48040.14990.20180.1989
Coverage9.10115.44125.05984.9845
Ranking Loss0.20970.05340.07410.0593
Cross Entropy Loss9.739378.262611.719011.0847
Average Precision0.53780.81160.79170.7925
Exact Match Prefix0.08770.17110.15270.2100
Minimal Label-Wise Precision0.00000.20450.45810.4794
Average Label-Wise Precision0.06460.51030.64060.6458

After having looked at the best performance achievable by any of the LTN configurations for individual evaluation metrics, let us now look at a single LTN configuration that is able to score well with respect to most evaluation metrics. This configuration was selected based on validation set performance. Its performance on the validation set is shown in Table 2.

Let us first compare the performance of this best LTN configuration (column “LTN performance”) with the optimal LTN performance from Table 1 (column “Optimal LTN performance”). For most of the evaluation metrics, our best LTN configuration yields a performance that is quite close to the optimal performance. This means that we were able to find an “allrounder” configuration that is performing good with respect to most aspects at the same time. The only exceptions are the ranking loss and the exact match prefix, where our “allrounder” LTN is not very close to optimal. In order to understand what’s going on there, further investigations are under way.

For all metrics but the cross entropy loss, our selected LTN configuration easily beats the distribution baseline. When comparing the selected LTN configuration to the best kNN configuration (column “kNN performance”) we discussed in the last blog post, we make the following observations:

The kNN is better with respect to one error, ranking loss, average precision, and exact match prefix. On the other hand, the LTN is better with respect to coverage, cross entropy loss, minimal label-wise precision, and average label-wise precision. This again tells the story of the LTN messing up the top of the ranking more frequently, but being more precise on the rare genres.

Although the kNN can in theory outperform the LTN with respect to coverage and cross entropy loss (as seen in Table 1), the value of k that yielded the best overall performance with respect to all metrics is worse than the selected LTN configuration. On the other hand, while the LTN can in the optimal case yield a higher exact match prefix than the kNN, the selected LTN configuration fails to do so. This highlights again that the numbers from Table 1 describe only an upper performance bound. What however really counts is the performance of a single configuration.

Table 3: How well does this generalize to the test set?

MetricValidation set performance LTNTest set performance LTNTest set performance kNN
One Error0.20180.21500.1597
Coverage5.05985.19725.5876
Ranking Loss0.07410.08060.0558
Cross Entropy Loss11.719011.992682.3033
Average Precision0.79170.77720.8014
Exact Match Prefix0.15270.14390.1672
Minimal Label-Wise Precision0.45810.45460.1931
Average Label-Wise Precision0.64060.62910.4931

Of course, we now need to check how well the selected LTN configuration generalizes from the validation set to the test set. This is shown in Table 3.

For all of the metrics, we observe a slight performance decrease when going from the validation set to the test set. This is however what you would expect in a machine learning setting: We selected the LTN configuration by looking at the best performance we could find on the validation set. Some small part of the observed performance might simply have been based on random noise in the validation data – maybe the validation set contained a hand full of data points that were especially easy to classify for the given LTN configuration. Then the values obtained for the evaluation metrics might have been a bit too optimistic. On the other hand, the test set might contain some data points with which the selected LTN configuration cannot deal very well, thus lowering the observed performance. Overall, as the performance drop is relatively mild, we can say that the LTN generalizes reasonably well from the validation set to the test set.

Also shown in Table 3 is the test set performance of our optimal kNN configuration. A comparison between LTN and kNN again confirms what we have observed on the validation set: kNN is better on one error, ranking loss, average precision, and exact match prefix, whereas the LTN is better on coverage, cross entropy loss, minimal label-wise precision, and average label-wise precision.

Bottom Line

So what’s the bottom line of all of this? What do all of these numbers tell us?

Well, they unfortunately don’t tell us that LTN is always better than kNN. But on the other hand, they also don’t tell us that LTN is considerably worse than kNN. It seems that both classifiers perform on a competitive level – in some aspects kNN works better, in other aspects the LTN seems to have an advantage.

So far, it seems that I can’t recommend using an LTN over a kNN classifier – but I can conclude that using an LTN is a reasonable option. Maybe this statement will change after looking at the other membership functions – we’ll see.

One thought on “Applying Logic Tensor Networks (Part 4)”

Leave a Reply