Learning To Map Images Into Shape Space (Part 3)

It’s been a while since my last blog post on this subject. The reason for that is simply that the neural network did not give me the results I wanted. But now it seems that I’m on a better track, so let me give you a quick update on what has changed and an overview of my next steps.

What has changed?

In my first blog post on this study, I talked about the data augmentation strategy I used. There, I decided to use input images of size 128 x 128. Moreover, the only augmentation steps I applied to the images were resizing and translating the object.

As it turned out, the input size was not large enough for the network to learn well, so I went back to the original input size used by AlexNet [1] and Sketch-a-Net [2], namely 224 x 224 pixels. This considerably improved classification performance and thus seems to be necessary.

Originally, I did not use horizontal flips, rotations, and shears as part of my augmentation pipeline since I was afraid that this would negatively impair the mapping task: For instance, there’s a ORIENTATION direction in the similarity space and if we rotate the input, then we would also expect that its location in the space would change along this ORIENTATION direction. However, for the sake of simplicity, we use the same target coordinate for all augmented inputs that are based on the same original image. Thus, introducing rotations may make the mapping task much harder to learn, since the ground truth coordinates do not match with the augmented image any more.

However, I realized that while I should not apply flips, rotations, and shears to the inputs for the mapping task, I can still apply them to the TU Berlin and Sketchy inputs, which are only used in the classification and reconstruction task. Doing so further increases the variety in the data set, which seems to be desirable.

In part 2 of my current series, I presented the architecture of my neural network. The encoder was presented as a down-sized version of Sketch-a-Net [2], motivated by the smaller input size. However, since I’m now using input images with a size of 224 pixels, I decided to use the original Sketch-a-Net architecture with its 8.5 million parameters as an encoder. Of course, also the architecture of the decoder needs to be adapted to the increased image size. For now, I’m however focusing exclusively on the classification task, so I’ll postpone sharing the updated decoder structure to the point where we actually need it.

I also already mentioned that I use salt and pepper noise on the inputs right before they are fed into the encoder. Computing this noise on the fly ensures that the network never sees the exact same example twice and can thus counteract overfitting (i.e., memorizing the inputs without being able to generalize). In my first trial runs, I also applied noise during evaluation, but I noticed that this makes the evaluation results nondeterministic – which is not something one typically wants. I’m therefore only going to use salt and pepper noise on the training set, but not on the validation or test set.

How to train

Okay, now that we’ve covered the major changes of the setup described so far, let’s finally take a look at how the network will be trained. Overall, I’m going to use the Adam optimizer with a minibatch size of 128 and train the network for up to 200 epochs (i.e., passes over the complete data set). The loss function being minimized is simply a linear combination for the three tasks of classification, reconstruction, and mapping:

For classification, we minimize the categorical cross-entropy over all 277 class outputs (i.e., the difference between the probability distribution output by the model and the actual label assignment). For reconstruction, we use the binary cross-entropy for each of the output pixels. We can do this, since our images are only greyscale and most pixels are either white or black, thus being relatively close to being binary. Finally, for the mapping task, we use the mean squared error between the predictions and the actual coordinates.

How to evaluate

For each of the three tasks, we need some evaluation metrics that tell us how good we are. For the classification task, we simply compute the accuracy with respect to both the TU Berlin and the Sketchy data set. By reporting separate accuracy values, we can easily compare our performance to other approaches from the literature. For the reconstruction task, we re-use the binary cross-entropy loss that has also been used for training the network. For the mapping task, we report the mean squared error, the mean Euclidean distance between the prediction and the ground truth, and the coefficient of determination R² (cf. the prior study on the NOUN data set).

Since we have only 60 images which are labeled with coordinates in the similarity space, I decided to use a five-fold cross-validation for evaluating performance: The data set is split into five parts before we apply the image augmentation, thus ensuring that two inputs based on the same original image always end up in the same fold. We then train the network on three of the five folds, use one fold as validation set (in order to pick the epoch with the best generalization performance) and the remaining fold as test set. This is done five times (such that each fold is used exactly once for testing and also exactly once as validation set) and the results we report are averaged across all of these runs.

Hyperparameters for regularization

Neural networks in general come with a large number of hyperparameters. While we keep the overall structure of the network as well as the input data set fixed, I experiment with various settings for the following hyperparameters, which can be used to “regularize” the model (i.e., prevent it from overfitting):

Weight decay adds another term to the overall loss function which is based on the overall size of the network’s weights. We can control how much large weights are penalized and hence incite the network to find solutions with small weights.
Dropout is used in the first fully connected layer of the encoder. This technique randomly turns off 50% of the units in this layer for each training example, thus requiring the network to learn a somewhat redundant representation that does not depend on individual neurons.
The noise level (i.e., the proportion of pixels to corrupt) for the salt and pepper noise can also serve as another level of regularization, since it increases the variance of the inputs.
Finally, the size of the bottleneck layer also constrains how much information the network is able to represent. A smaller bottleneck layer forces the network to find a more efficient representation, thus inciting it to better compress the input.

Experimental steps

Okay, that’s already been quite a lot of information. But which exact experiments am I going to run? Well, here’s a preliminary list, along with a short explanation for each of the steps:

Transfer learning from ImageNet: Use a neural network that has been trained on ImageNet, take its hidden representation, and train a linear regression on top of that. This is essentially the approach used in the NOUN study and should serve as a lower performance bound.
Classification on sketches: Train the network by only using the classification task. I don’t expect to reach state of the art performance levels, but the accuracies reached in this setup can tell me whether the overall setup is sound (i.e., whether the network is able to learn something and performs reasonably well).
Transfer learning from sketches: Use the network trained in step 2, take its hidden representation, and train a linear regression on top of that. This is the same setup as in step 1, but this time we use a network that has been trained on data much more similar to the line drawings (namely, sketches instead of photographs). The difference between the results of step 3 and the ones of step 1 tell us something about the impact of the data set used for pre-training.
Multi-task learning with sketches: Train the network with both the classification and the mapping objective at the same time. A comparison of the results to the ones from step 3 can tell us how much we profit from joint training in contrast to two-phase training.
Generalize to other target spaces: Remember from my blog posts about the Shape spaces that we have target spaces of different dimensionality. In all of the above experiments, we only consider the four-dimensional target space. In this final experimental step, we can now check to what extent performance depends on the number of dimensions that the similarity space has.

Sounds like a plan – but what about the reconstruction task? Well, for now, I’m going to leave it out. If time permits, I’ll follow a similar procedure for the reconstruction task as well (reconstruction only, transfer learning, multi-task learning, generalization to other target spaces). But for now, I think there’s already enough stuff to do with respect to the classification, so I’ll first try to get some results for the above mentioned experiments before starting to consider the reconstruction task. Stay tuned!

References

[1] Krizhevsky, A.; Sutskever, I. & Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks Advances in Neural Information Processing Systems, Curran Associates, Inc., 2012, 25, 1097-1105

[2] Yu, Q.; Yang, Y.; Liu, F.; Song, Y.-Z.; Xiang, T. & Hospedales, T. M. Sketch-a-Net: A Deep Neural Network that Beats Humans International Journal of Computer Vision, Springer, 2017, 122, 411-425