Learning To Map Images Into Shape Space (Part 4)

In my last blog post, I gave an overview of the experiments I intend to conduct. Before, I had described the data set and the network architecture. Today, I finally report the first results. However, instead of first talking about the baseline for the mapping task (as originally intended), I will start with the classification results. The reason for this is that it makes more sense to discuss the baseline and my own transfer learning results together, since it is then much easier to compare them. Before I can talk about my own transfer learning results, I however first need to introduce the classification network on which they are based. So let’s focus focus today on the classification results that I was able to obtain with respect to the sketch data sets.

Starting Point

In order to obtain sketch-based features, I trained my network exclusively on the classification task. I started my experiments by considering a default setup of the hyperparameters, which is based directly on Sketch-a-Net [1] and AlexNet [2]: I used a weight decay of 0.0005, dropout in the first fully connected layer, and 512 neurons for the bottleneck layer. Moreover, I used a relatively small amount of 10% of salt and pepper noise. As part of the cross-validation procedure, I trained five independent versions of the network and computed the average performance across all folds.

As it turns out, I was able to obtain classification accuracies of about 63% and about 79% on TU Berlin and Sketchy, respectively. This is notably lower than both human performance (which was reported at about 73% for TU Berlin [3]) and the performance of Sketch-a-Net [1] (which reached up to 78% on TU Berlin). The difference to Sketch-a-Net cannot be based on the network architecture (which is pretty much identical), and is thus probably rooted in the different augmentation techniques: My augmentation pipeline is less sophisticated and produces a smaller number of additional examples than the augmentation procedure used in the original Sketch-a-Net study. Since I am however not interested in state-of-the-art performance on sketch recognition anyways, the performance level I reached seems to be satisfactory.

When measuring the correlation between distances in the feature space learned by the network and the original dissimilarity ratings for the Shapes stimuli, the default setting results in τ ≈ 0.27, which is considerably lower than the correlation of τ ≈ 0.39 I observed for the pre-trained inception network [4] in our Shapes study.

Varying Individual Hyperparameters

Based on this default setup, I considered varying individual hyperparameters in order to change the strength of regularization and to improve the network’s performance on the test set.

For the weight decay, the default setting of 0.0005 is optimal with respect to classification performance. A weight decay setting of 0.001 leads to a notable improvement with respect to the correlation without the heavy downside of reduced prediction accuracy observed for other values.

Disabling dropout led two somewhat contradictory observations: On the one hand, it increases the correlation to the dissimilarity ratings. On the other hand, we also get a considerable drop in classification performance.

An increase of the noise level to 25% or 55% improves the correlation to human dissimilarity ratings. On the other hand, classification accuracy suffers from stronger input noise. While this effect is relatively mild for 25% noise, it is quite pronounced for 55% noise. I therefore consider 10% and 25% noise as promising candidates.

Somewhat surprisingly, changes to the size of the bottleneck layer, does not influence classification performance much, all the way down to a bottleneck with only 64 units. The correlation to the dissimilarity ratings however seems to be more sensitive to the size of the bottleneck layer. I therefore consider 512 units and 256 units as interesting candidate settings. Using a smaller bottleneck size results in a smaller feature space and may thus be beneficial for the mapping task, since overfitting is less likely to occur.

A Small Grid Search

After having selected the two most promising settings for each hyperparameter, I conduced a small grid search based on two settings per hyperparameter (thus looking at 2⁴ = 16 combinations in total). The results are somewhat disappointing, since the best classification performance is still obtained by our default setup. However, a considerably higher correlation to the dissimilarity ratings of τ ≈ 0.33 can be obtained by disabling dropout and increasing the weight decay.

The results of the grid search also show that correlations of τ ≥ 0.3 are only achieved if dropout is disabled, which tends to lead to a very small number of epochs. This effect may simply be based on the missing regularization effect from dropout: The network quickly starts to overfit the training data, hence the lowest loss value on the validation set is observed quite early. The increased correlation reported for these configurations may thus be an artifact of simply terminating the training procedure much earlier.

It seems that as soon as dropout is disabled, the classification performance considerably drops (independent of the number of epochs used during training). Also higher noise levels lead to a slight deterioration of classification performance. As the grid search revealed, halving the size of the bottleneck layer only slightly decreases performance.

Selected Configurations

Overall, I will consider the following configurations in the subsequent experiments:

Default: My default setup (bottleneck size 512, weight decay 0.0005, dropout, 10% noise), which has yielded the best classification accuracies.
Correlation: The configuration with the highest correlation to the dissimilarity ratings, using a bottleneck size of 512, a weight decay of 0.001, no dropout, and 10% noise.
Small: Bottleneck size of 256, otherwise identical to the default setup. A smaller bottleneck should reduce the risk of overfitting in downstream tasks.
Large: Bottleneck size of 2048, otherwise identical to the default setup. Intended as a “fair” comparison to the feature space extracted by the inception-v3 network [4], that also uses 2048 units and will serve as our baseline.

It is probably worth noting that also for the Large configuration, performance did not differ much from the Default setting. When making our comparisons on the transfer learning task later, the main difference between the configurations Default, Large, and Small is therefore the size of the extracted feature space.

And now?

Well, what have we achieved and learned today? We’ve trained our encoder network on the sketch classification approach and achieved reasonable classification performances of 63% on TU Berlin (human level: 73%, SOTA: ≥78%) and 79% on Sketchy. Our default setting of the hyperparameters proved to be optimal with respect to the classification performance. However, by disabling dropout and increasing the weight decay, we were able to improve the correlation to the psychological dissimilarities to τ ≈ 0.33 (which is however still lower than the correlation of τ ≈ 0.39 reported earlier for the inception-v3 network). The size of the bottleneck layer does not seem to have a large impact on classification performance, but reducing it too much results in poorer correlation values.

So far, we have not yet considered the mapping task, but we have selected four promising network configurations based on the classification task. These configurations will then compete against the pre-trained inception-v3 network in both a transfer learning and a multi-task learning setting. The results of the transfer learning experiments will be shared in the next blog post of this series.

References

[1] Yu, Q.; Yang, Y.; Liu, F.; Song, Y.-Z.; Xiang, T. & Hospedales, T. M. Sketch-a-Net: A Deep Neural Network that Beats Humans International Journal of Computer Vision, Springer, 2017, 122, 411-425

[2] Krizhevsky, A.; Sutskever, I. & Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks Advances in Neural Information Processing Systems, Curran Associates, Inc., 2012, 25, 1097-1105

[3] Eitz, M.; Hays, J. & Alexa, M. How Do Humans Sketch Objects? ACM Trans. Graph., Association for Computing Machinery, 2012, 31

[4] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 2818-2826.