Self-Supervised Visual Representation Learning: Exploiting Spatial Structure and Bootstrapping Representations
Author:
Abstract:
While supervised deep learning has revolutionized computer vision, its reliance on labeled datasets presents critical challenges, including high annotation costs, inherent biases, and limited transferability of learned representations. Self-supervised learning is a compelling alternative, where meaningful representations are learned directly from unannotated data. By leveraging structural patterns and correlations, self-supervised methods unlock the potential to create more adaptable, scalable, and unbiased learning algorithms. This thesis introduces novel approaches to improve self-supervised learning by extracting richer supervision signals from raw data. Specifically, it explores two key strategies in the context of visual data: (1) leveraging spatial structure and (2) bootstrapping past representations, hence the thesis subtitle. Broadly, machine learning pipelines consist of three fundamental components: the dataset, the hypothesis class, and the learning algorithm. While improvements can be made to any of these elements, this manuscript focuses exclusively on the learning algorithm, assuming a fixed dataset and hypothesis class. Each chapter and corresponding appendix present a new learning algorithm, varying both the data combinations used for self-supervision and the granularity at which self-supervision is applied.