Exploiting Appearance-based Representations for Recognition

Ghodrati, Amir

Author:

Ghodrati, Amir

Keywords:

computer vision, image representation, convolutional neural networks, action recognition, image generation, object proposals, viewpoint estimation, fisher vector, PSI_VISICS

Abstract:

It is widely accepted that the success of vision algorithms depends to a large extent on the chosen image or video representation. Different representations can capture different semantic factors of the data and bring robustness to various factors such as image noise, clutter, blur, etc. In this thesis, we concentrate our efforts to find state-of-the-art appearance-based image and video representations for various computer vision tasks namely action recognition, viewpoint estimation, object proposal generation, and image generation. We start the thesis by making use of bag-of-words, one of the most commonly used representations in object and video recognition, for the task of action recognition. We propose and evaluate different ways to integrate motion segmentation and action recognition. We obtain state-of-the-art results on two benchmarks, and show that these two tasks are interdependent and an iterative optimization of the two gives best results. Afterwards, we exploit the next generation of representations based on Fisher encoding to estimate viewpoint of objects. We show how solely using such 2D representations, if properly tuned, enables us to effectively bypass 3D models and obtain promising results in estimating viewpoint of faces, cars and general objects. Next, we make use of data-driven, learning-based representations for generating a set of object proposals in an image. Particularly, we propose an efficient coarse to fine cascade on multiple layers of a deep convolutional neural network that acts strongly on object and action locations. Our method in most of the cases is comparable or better than state-of-the-art approaches in terms of both accuracy and computation. Finally, observing the power of the deep learning paradigm, we tackle the problem of image generation to evaluate to what extent a representation is inversely convertible to an image. To this end, we define the new problem of generating modified images using a deep encoder-decoder architecture. We obtain good qualitative and quantitative results on a face dataset on three sub-tasks, that is, rotating faces, changing illumination and image inpainting.