// Add Diagrams
Fully Convolutional Networks
Long et al. showed that by adjusting state of the art convolutional neural network architectures traditionally used for image classification a network capable of per pixel labelling of images could be derived that enabled better than state of the art performance with a fraction of the inference time. This finding has been central to high performance semantic segmentation tasks since its publication.
Semantic Segmentation vs Image Classification
Traditional image classification problem involves ascribing a single label to an image that best encapsulates the object in the images name. Semantic Segmentation takes this classification one step further, extending the labelling to each pixel in the image. In this way output from a classifier will be an image per label which will indicate whether each pixel either belongs to that label or it does not. These channels which relate to each specific class can then be interpolated into a final image mapping the different pixels belonging to each class in the original image. Generally displayed with different colours relating to each label. The increased complexity of Semantic Segmentation is present in both the multi-label nature of the problem and that it much not only be able indicate that an object is present but maintain information about where that object is in the picture.
From CNN to FCN
Adaptation of a convolution neural network to fully convolutional network requires modification to the classification layers in the original neural network. This is achieved by removing the fully connected layers and replacing them with convolution layers. In doing this the restriction on input size is lifted as a convolution transform can be applied to any tensor, generating an output corresponding to the kernel size and stride. Modification of this nature is applied to a pre-trained image classification model; exploiting transferring learning to provide a prior for relevant feature extracting filters that are fine tuned to the semantic segmentation task.
As input is passed through subsequent convolution and pooling layers, the tensor size shrinks. Due to this the forward pass is often referred to as the down sampling path. This pathway is predominantly determining what objects are in the image. The down sampled output from the full convolutional network is then subjected to an up-sampling path to refund tensors matching the size of the original image. Restoring information related to the localisation of the objects in the image. These tensors are then combined forming a version labelled version of the original image with each pixel belonging to an object of a class.
Up-sampling and Skips
The up-sampling path can be as simple as subjecting the final convolution layer output to a transposed convolution layer. Transposed convolution acts as a learned method of preforming interpolated up-sampling where filters that when applied to smaller input produce larger outputs. It is also important to note this process goes by several names including up convolution, deconvolution and a special case of the technique is fractional stride convolution. By feeding the transposed convolution layer the last layer output it is responsible for up-sampling from the smallest output in the network back to the original image size. Consider the FCN derived from AlexNet shown above, the up-sampling performed by the transpose convolution layer has to interpolate 32x in order to form the desired labelled output. Long et al. found that the segmented images generated using this naive approach where often coarse and over represented the region of objects in the scene.
To remedy this addition between intermediate layers is preformed prior to the final transpose convolution being applied. For AlexNet this created three variants, FCN-32, FCN-16, FCN-8. Each suffix number representing the amount of up-sampling the transpose convolution layer is responsible for. This enables some of the localisation information contained in shallower output to be combined with the deeper outputs that are more decided about class labels. This is referred as a skip as the outputs skip around further layers to be combined in the up-sampling path. Diagrammatically this is shown below, the output is up-sampled by 2x then added elementwise to the skipped output.
In further research it has been found that for simple segmentation, where there is only one transpose convolution layer per class, it does not affect segmentation performance to substitute learned interpolation techniques for bilinear interpolation.
References
Fully Convolutional Networks for Semantic Segmentation Jonathan Long, Evan Shelhamer, Trevor Darrell CoRR, 2014, abs/1411.4038, arXiv:1411.4038