Title: Complex Semantic Concept Detection in Video (Complex semantische concept herkenning in video)
Other Titles: Complex Semantic Concept Detection in Video
Authors: Poulisse, Gerardus; S0196970
Issue Date: 7-Nov-2012
Abstract: There are various approaches to gaining semantic understanding of video. Approaches include gaining a better understanding of the underlying video structure through video segmentation, summarizing the video to report the most salient events, identifying concept (persons/objects/scene) classes present, and identifying a sequence of actions that characterize events. A substantial portion of the thesis focuses on the segmentation of video into scenes, where each scene contains a central idea or theme which serves as a component of the greater narrative that constitutes the entire video. Once identified, scenes can be indexed for later retrieval, summarized for quicker access, or used as a reference point when browsing. A typical domain is news broadcast video segmentation.Video structural analysis is known as story segmentation when referring to the specific domain of news broadcast segmentation, or scene segmentation when the domain is unrestricted. Such structural analysis is typically content-based, that is the change between stories is identified by recognized change in context as one topic segues to another. Prior knowledge of the domain permits the inclusion of domain specific knowledge when performing the structural video analysis. Domain specific knowledge in news broadcast video segmentation may take the form of anchor detection, jingle, or speaker silence detection. In the literature, a variety of multi-modal cues contribute to the segmentation decision. However, features and techniques originating in text-only segmentation research have sometimes been neglected in video analysis. Part of the work presented in this thesis describes a supervised approach to the segmentation of news broadcast video using text features, in combination with typically used features from the audio and visual modalities, and investigates each individual featuresÂ’ contribution. The thesis then focuses on scene segmentation, which is the segmentation of generic video without domain restriction. The problem is formulated as a segmentation task on long television broadcast of mixed Olympic games coverage with a duration of over 4 hours. The Olympic coverage consists of mixed sports, such that no sports specific detectors could be applied; and with visually similar and sequential races of the same sport without interruption where the goal is to isolate each race individually in the final segmentation. This requires a different approach in terms of feature selection, and the choice is made to develop an unsupervised method that is sufficiently robust to allow application in other domains. A summarization component applies labels to the ensuing scene segments by identifying persons present, salient key-words, and the type of sport by means of a classifier trained on Wikipedia articles. The next part of the thesis determines salient events within each scene, by performing temporal sequence analysis of the underlying shots via string kernels. In the context of the Olympic games, a swimming race may have underlying sequences corresponding to when swimmers line up prior to the race, actual swimming back and forth in lanes, the race conclusion and final times. The identification of such events immediately provides a deeper semantic understanding of the events taking place, but can also be used to identify similar event sequences in other video broadcasts in the same domain. Within the context of a video browser, stories and scenes are further analysed in other modules to create additional indexes for cross-referencing to identify scenes that discuss the same topic, share the same people, or have a common location. In this thesis, scenes are mined for common, semantically relevant event sequences which describe the overall scene. The work presented in this thesis on scene and story segmentation serves to partition a video into semantically coherent units which are a basic unit for browsing and retrieval. Substantive contributions are made on unsupervised semantic event detection for indexation and retrieval. The research was conducted in the context of the AMASS++ (Advanced Multimedia Alignment and Structured Summarization) research project, which had the goal of developing an advanced video archive browser and retrieval system as well as the recently launched research project on task-oriented search and content annotation for media production (TOSCA-MP).
Publication status: published
KU Leuven publication type: TH
Appears in Collections:Research Unit KU Leuven Centre for IT & IP Law (CiTiP)
Electrical Engineering - miscellaneous
ESAT - PSI, Processing Speech and Images
Faculty of Law
Informatics Section

Files in This Item:
File Status SizeFormat
thesis-10-22-2012.pdf Published 10069KbAdobe PDFView/Open


All items in Lirias are protected by copyright, with all rights reserved.