Thomas Mensink | Efstratios Gavves | Zeynep Akata | Cees Snoek

We live in the age of Big Data, featuring huge image and video datasets. Despite their size, however, we cannot guarantee sufficient annotations for all possible concepts. Moreover while annotations are easy to obtain for common object concepts, such as ball or helicopter, this is not straightforward for more exotic concepts like a “lagerphone” (a percussion musical instrument): not only the available images do not suffice, but often the annotations can be made only be experts. In the absence of annotations we promote zero-shot learning, where the combination of a) existing classifiers and b) semantic, cross-concept mappings between these classifiers allows for building novel classifiers without resorting to any visual examples. From a more philosophical point-of-view zero-shot learning relates to the ability to “learn new things” and to “reason over what is learned”. While a DeepNet can reason (almost) perfectly over the 1,000 concepts it is trained on, it can not reason over any new concept, nor explain novel concepts in terms of what is already known. In this tutorial we focus on zero-shot learning for Computer Vision.

We start the tutorial with an in-depth discussion of visual knowledge transfer, followed by discussing different application domains for zero-shot learning, such as classification, localisation, retrieval, and interaction. While these applications have been studied from different communities (machine learning, computer vision, and multimedia), the future progress is when these insights are combined.