On the Utility of Generic ConvNets Visual Representations
We consider the utility of global image descriptors given by Deep Convolutional Networks (ConvNets) for visual recognition tasks. Given a ConvNet which has been trained with a large labeled data set, the feed-forward units activation at a certain layer can be used as a generic representation of a new input image for a target task. We will highlight three aspects of this common scenario in the context of transfer learning. We will first visit several factors affecting the transferability, including those for learning such as network design and distribution of training data as well as post-learning factors such as layer choice of the trained ConvNet. By optimising these factors, we see that significant improvements can be achieved on various standard visual recognition tasks. Then, we will explore what information resides in such representations; interestingly we find strong spatial information implicit, which was unexpected in a network trained for classification problems. We will finally introduce an efficient pipeline in an application to visual instance retrieval where spatial search is enabled by ConvNet representations. Work presented was performed with the computer vision group of CVAP.
Atsuto Maki is a Docent at the Royal Institute of Technology (KTH), Sweden. He received the BE degree from Kyoto University, the ME degree from the University of Tokyo, and the PhD degree from KTH (in 1996). After serving as a researcher at Toshiba Corporate R&D Center, a senior researcher at Toshiba Research Cambridge, U.K., and an associate professor at Kyoto University, he moved to KTH in 2013. His research interests cover a broad range of topics in machine learning and computer vision, including motion and object recognition, clustering, subspace analysis, stereopsis, and representation learning. He is currently a board member of Swedish Society for Automated Image Analysis (SSBA).