Structured Representations for Joining Visual and Linguistic Data

Milewski, Victor; Moens, Marie-Francine

Author:

Milewski, Victor

Moens, Marie-Francine

Abstract:

In the last couple of years, there has been a significant increase in the capabilities of Artificial Intelligence (AI) models due to the release of large language models and foundational models. Thanks to this increase and by making the models widely available through web-interfaces, people have started to use and probe them for all types of knowledge. The most common example is ChatGPT, where people can converse with the language model. However, people expect these models to have a contextual or structural understanding of the world, which most models do not have. With structure, we mean a representation that describes something using nodes (or objects) and edges (or relationships) between them. In this dissertation, we explore the structural representations of visual data and investigate how to join them with structural representations of textual data. Our goal is to help the field of computer science move towards a joined structural representation between the modalities. First, we explore how to use structural representations and contextual knowledge to guide language generation. We do this globally by using a Scene Graph Generation (SGG) model to generate a scene graph (SG) for an image. Using the SG, our proposed captioning model can aggregate information in the graph, allowing the captions to describe the relationships in the image better. We also explore the local syntax of the objects to generate a unique description for each of the objects of the same class in the image. By extracting the attributes for each object and providing them to the expression generation model, it can selectively use the attributes to generate unique descriptions. Second, we explore how well large pretrained visio-language multimodal models learned to encode the latent structures of the image in its embeddings. To explore this, we proposed the scene tree, a novel hierarchical structure describing the objects in the image. It is constructed based on a mapping from a caption's dependency tree mapped to image regions. Third, we work on the tasks of SGG to generate a structured representation of the content from an image. We propose that the model should create an SG as close as possible to the ground-truth human-constructed graph since we want the graph to describe as much relevant information as possible. To this end, we create an evaluation setup where the model can remove objects and relationships from the graph it finds unnecessary. We design several losses that use a soft alignment with language descriptions of the image to guide the model in focusing on the most relevant content. Finally, we investigate the existence of structures across modalities and fields. From the fields of psychology and neuroscience, we acquire inspiration since humans tend to have syntactical and semantical processing for many modalities. Furthermore, humans assign hierarchical grammar-like structures to the physical world, e.g., when placing objects in a room. Based on this knowledge, we perform initial correlation studies, comparing distances between noun words in image captions to pixel distances in an image. The distance in the caption is determined from the dependency or constituency tree. We also compute the correlations of the image with our scene tree structure and the scene graph. These show that a positive correlation exists between every structure and the image. In conclusion, we have found that understanding structures and generating them for visual data is a difficult task. When correctly applied, a visual structure can guide language generation, but it requires a high-quality structure without too much noise. Given the rise of large pretrained models, we explored the multimodal models and found that they are not trained in a manner that promotes the learning of latent structures in the image. Others in the community have also discovered this issue, and this topic is receiving increasing attention with various newly proposed benchmarks. Finally, we found that humans tend to organize the physical world, like rooms, in hierarchical grammar-like structures. Furthermore, we discovered a correlation between the distances in linguistic graphical representations and the distances in the image. We hypothesize that a causal relation exists where human language imposes a structure on the physical world or vice versa. Hence, we hope that in the future, collaborations are formed between computer science and other fields like psychology, which can lead to the discovery of such causal effects. We hope this can lead to a better joined structural representation between the visual and linguistic modalities.