Weakly Supervised Machine Learning Algorithms for Object Recognition in the Wild and Entity Linking in Videos
Author:
Keywords:
Entity linking, Vision & Language, Multimedia Indexing,, Weakly supervised machine learning algorithms, Object recognition,, Vision & Language, Multimedia Indexing,
Abstract:
With the proliferation of video-rich data on the Internet, there is a pressing need for search tools that can retrieve not only relevant videos from a corpus, but also relevant snippets within a video. For retrieving relevant videos, current search technologies hinge on labor-intensive manual annotation of tags, which are subjective and often incomplete. To fully automate search and retrieval systems, we need tools that can understand the content presented in videos and automatically generate labels that accurately describe them. Towards that goal, we consider a video with subtitles, and focus on two problems: a) What/ who are in the video key frames? and b) What do the textual entity mentions in subtitles refer to? State-of-the-art methods largely adopt a supervised paradigm, relying on expensive manually created training examples to indicate the mapping between the visual and textual entities. In contrast, we address these questions using a weakly supervised paradigm, where the text may provide some clues on the vision and vice versa. We further apply it to the problem of wildlife recognition in nature documentaries. In a weakly supervised setting, the problem of recognizing entities in vision and language presents a host of challenges for vision, text and the association of text and vision. On the vision side, we deal with a scenario where there are no visual demarcators to indicate the location of an animal. In fact, it is not even known if there are animals at all in a certain key frame. Additionally, since we are dealing with animals shot in their natural habitat, there are challenges due to self-occlusion, camouflage, illumination etc. On the textual side, while we have tools to detect entity mentions in the text, not all of them are pertinent to animals. Even when the mentions refer to animals, they are often so ambiguous that it is impossible to resolve them correctly without a holistic understanding of the context. As far as the linking of text and vision is concerned, the absence of visual demarcators in the visual data coupled with the presence of ambiguous mentions in text makes it harder to reliably tie together the entities in vision and language. That is, there are no ready examples to show the association in a limited, diverse dataset. In this thesis, we present three major contributions that address these challenges. First, we present a multi-modal domain adaptation framework for multi-label classification. Here, we propose an algorithm to learn from an external labeled source dataset, and iteratively adapt to a target dataset, by leveraging the weakly associated textual subtitles that come with the video. We prove that this approach is significantly better than a) a purely vision-based approach or b) purely text-based approach or c) an approach that uses both text and vision, but without labeled examples or d) an approach that uses both text and vision, and labeled (out-of-domain) examples, but without the adaptive learning. Next, we investigate image representation and object recognition models learned from video documentaries by using the weak supervision of the textual subtitles. In particular, we study a support vector machine on top of activations of a pre-trained convolutional neural network, as well as a Naive Bayes framework on a ‘bag-of-activations’ image representation, where each element of the bag is considered separately. On testing the models on a target dataset shot in entirely different conditions, we found that the ‘bag-of-activations’ based model outperformed classical models by a huge margin. The third and final contribution capitalizes on the inherent characteristics in the video, such as the temporal coherence in video frames, and the dependencies within and across the visual and textual modalities. We prove that this integrated modelling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The methods proposed here take us a step closer to object recognition in the wild and automatic video indexing. While the methods presented here have been validated on wildlife documentaries, they are all quite generic and can be applied to a plethora of other genres, beyond wildlife, beyond subtitles, or even beyond video documentaries.