After having already written a lot about Logic Tensor Networks, today I will finally share some first results of how they perform in a multi-label classification task on the conceptual space of movies.

Remember from this post and this post that I want to experiment with a total number of *five* different LTN membership functions:

- The original membership function
- A modification of the original membership function that ensures convexity
- A radial basis function
- A prototype-based membership function
- A membership function based on my mathematical formalization of conceptual spaces

That’s actually quite a large number of experiments and it will take some time (probably months) until I am done with everything. So far, I only have results for the first bullet point from above – the original membership function used in the vanilla form of LTN. This is what I’m going to focus on today. Results for the other membership functions will follow over time.

As the LTN has quite a large number of hyperparameters (e.g., the number of epochs and the learning algorithm), I have conducted a grid search over the hyperparameter space: For each hyperparameter, I selected at least three options. Then, I generated all possible combinations of different hyperparameter settings. Each of these configurations was trained and evaluated 10 times and their performance was averaged in order to approximate the expected value.

## Table 1: How good can an LTN maximally get?

Metric | Distribution baseline | Optimal kNN performance | Optimal LTN performance |
---|---|---|---|

One Error | 0.4804 | 0.1452 | 0.1989 |

Coverage | 9.1011 | 4.9012 | 4.9845 |

Ranking Loss | 0.2097 | 0.0245 | 0.0593 |

Cross Entropy Loss | 9.7393 | 6.7444 | 11.0847 |

Average Precision | 0.5378 | 0.8134 | 0.7925 |

Exact Match Prefix | 0.0877 | 0.1945 | 0.2100 |

Minimal Label-Wise Precision | 0.0000 | 0.2124 | 0.4794 |

Average Label-Wise Precision | 0.0646 | 0.5103 | 0.6458 |

Table 1 shows the optimal performance obtainable by the LTN, the *k*NN, and the distribution baseline on the validation set for the 50-dimensional movie space. The numbers for the distribution baseline and the *k*NN have been shown before and are shown again as a frame of reference. Remember that the numbers in this table show the optimal performance achievable by any of the tested LTN configurations when optimizing only a *single *evaluation metric. The numbers in the row “one error” therefore are the answer to the following question: “If we wanted to optimize only the one error and if all other evaluation metrics did not matter, how good can we get with the LTN?”

In practice, however, we want to find a hyperparameter configuration of the LTN that works well for most of the evaluation metrics, not a single one. The numbers in Table 1 therefore only give us an upper bound on the performance achievable by the LTN.

Now what do we observe?

Like the *k*NN, the LTN is able to clearly beat the distribution baseline on almost all evaluation metrics. The only exception is the cross entropy loss, where the LTN is slightly worse than the distribution baseline.

Moreover, we can observe that the *k*NN is better than the LTN with respect to one error, coverage, ranking loss, cross entropy loss, and average precision. On the other hand, the LTN performs better with respect to exact match prefix, minimal label-wise precision, and average label-wise precision. If you need to look up what these evaluation metrics measure, take a look at this old blog post.

A higher one error means that the LTN makes more mistakes at the very top of the list than the *k*NN. This of course also influences coverage, ranking loss, and average precision, because the ground truth labels will be found further down in the list. On the other hand, the considerably better values for minimal and average label-wise precision indicate that the LTN is better able to deal with rare genres than the *k*NN: Rare genres receive a higher weight in these two evaluation metrics than in the other ones. This comes as no big surprise as the LTN models each movie genre by its own dedicated membership function, whereas the *k*NN builds a global model of the genre distribution.

## Table 2: Performance of the best "allrounder" LTN configuration

Metric | Distribution baseline | kNN performance | LTN performance | Optimal LTN performance |
---|---|---|---|---|

One Error | 0.4804 | 0.1499 | 0.2018 | 0.1989 |

Coverage | 9.1011 | 5.4412 | 5.0598 | 4.9845 |

Ranking Loss | 0.2097 | 0.0534 | 0.0741 | 0.0593 |

Cross Entropy Loss | 9.7393 | 78.2626 | 11.7190 | 11.0847 |

Average Precision | 0.5378 | 0.8116 | 0.7917 | 0.7925 |

Exact Match Prefix | 0.0877 | 0.1711 | 0.1527 | 0.2100 |

Minimal Label-Wise Precision | 0.0000 | 0.2045 | 0.4581 | 0.4794 |

Average Label-Wise Precision | 0.0646 | 0.5103 | 0.6406 | 0.6458 |

After having looked at the best performance achievable by *any* of the LTN configurations for *individual* evaluation metrics, let us now look at a *single* LTN configuration that is able to score well with respect to most evaluation metrics. This configuration was selected based on validation set performance. Its performance on the validation set is shown in Table 2.

Let us first compare the performance of this best LTN configuration (column “LTN performance”) with the optimal LTN performance from Table 1 (column “Optimal LTN performance”). For most of the evaluation metrics, our best LTN configuration yields a performance that is quite close to the optimal performance. This means that we were able to find an “allrounder” configuration that is performing good with respect to most aspects at the same time. The only exceptions are the ranking loss and the exact match prefix, where our “allrounder” LTN is not very close to optimal. In order to understand what’s going on there, further investigations are under way.

For all metrics but the cross entropy loss, our selected LTN configuration easily beats the distribution baseline. When comparing the selected LTN configuration to the best *k*NN configuration (column “*k*NN performance”) we discussed in the last blog post, we make the following observations:

The *k*NN is better with respect to one error, ranking loss, average precision, and exact match prefix. On the other hand, the LTN is better with respect to coverage, cross entropy loss, minimal label-wise precision, and average label-wise precision. This again tells the story of the LTN messing up the top of the ranking more frequently, but being more precise on the rare genres.

Although the *k*NN can in theory outperform the LTN with respect to coverage and cross entropy loss (as seen in Table 1), the value of *k* that yielded the best overall performance with respect to all metrics is worse than the selected LTN configuration. On the other hand, while the LTN can in the optimal case yield a higher exact match prefix than the *k*NN, the selected LTN configuration fails to do so. This highlights again that the numbers from Table 1 describe only an upper performance bound. What however really counts is the performance of a single configuration.

## Table 3: How well does this generalize to the test set?

Metric | Validation set performance LTN | Test set performance LTN | Test set performance kNN |
---|---|---|---|

One Error | 0.2018 | 0.2150 | 0.1597 |

Coverage | 5.0598 | 5.1972 | 5.5876 |

Ranking Loss | 0.0741 | 0.0806 | 0.0558 |

Cross Entropy Loss | 11.7190 | 11.9926 | 82.3033 |

Average Precision | 0.7917 | 0.7772 | 0.8014 |

Exact Match Prefix | 0.1527 | 0.1439 | 0.1672 |

Minimal Label-Wise Precision | 0.4581 | 0.4546 | 0.1931 |

Average Label-Wise Precision | 0.6406 | 0.6291 | 0.4931 |

Of course, we now need to check how well the selected LTN configuration generalizes from the validation set to the test set. This is shown in Table 3.

For all of the metrics, we observe a slight performance decrease when going from the validation set to the test set. This is however what you would expect in a machine learning setting: We selected the LTN configuration by looking at the best performance we could find on the validation set. Some small part of the observed performance might simply have been based on random noise in the validation data – maybe the validation set contained a hand full of data points that were especially easy to classify for the given LTN configuration. Then the values obtained for the evaluation metrics might have been a bit too optimistic. On the other hand, the test set might contain some data points with which the selected LTN configuration cannot deal very well, thus lowering the observed performance. Overall, as the performance drop is relatively mild, we can say that the LTN generalizes reasonably well from the validation set to the test set.

Also shown in Table 3 is the test set performance of our optimal *k*NN configuration. A comparison between LTN and *k*NN again confirms what we have observed on the validation set: *k*NN is better on one error, ranking loss, average precision, and exact match prefix, whereas the LTN is better on coverage, cross entropy loss, minimal label-wise precision, and average label-wise precision.

## Bottom Line

So what’s the bottom line of all of this? What do all of these numbers tell us?

Well, they unfortunately don’t tell us that LTN is always better than *k*NN. But on the other hand, they also don’t tell us that LTN is considerably worse than *k*NN. It seems that both classifiers perform on a competitive level – in some aspects *k*NN works better, in other aspects the LTN seems to have an advantage.

So far, it seems that I can’t recommend using an LTN over a *k*NN classifier – but I can conclude that using an LTN is a reasonable option. Maybe this statement will change after looking at the other membership functions – we’ll see.

## One thought on “Applying Logic Tensor Networks (Part 4)”