Video Interpretation: from Classification to Online Detection

De Geest, Roeland

Author:

De Geest, Roeland

Keywords:

computer vision, action recognition, video interpretation, online action detection, object recognition, PSI_VISICS, PSI_4259

Abstract:

The amount of video data has grown exponentially over the last years. It is not feasible anymore to analyze all this data by hand. Video interpretation methods can automatically find concepts in videos to summarize them or extract relevant parts. In this thesis, we present methods for action recognition, object recognition and online action detection. For action recognition, we design two methods based on 2D dense interest points: one looks for 3D dense interest points, while the other uses trajectories that start on dense interest points. These methods are a suitable alternative to standard dense methods when a lower sampling density is required. For object recognition, we show that it is beneficial to use motion features instead of simply applying still-image recognition methods on key frames. The characteristic motion of objects helps their recognition. In particular, we demonstrate the effectiveness of dense trajectories, commonly used for action recognition, on datasets with animal classes and means of transportation. In online action detection, the input is a video stream. After every frame, a decision needs to be made on what action is currently happening. The system needs to make a decision without having seen the whole action. It becomes more important to recognize the early stages of the actions. We collect a dataset and demonstrate that current standard methods are insufficient to solve this problem. An LSTM seems very suitable for online action detection: it processes the input on a per-frame basis and it can model both long-term and short-term patterns. In practice, however, the detection accuracy is low. We experiment with a series of techniques that could help the LSTM to learn long-term dependencies between actions. A two-stream feedback network, where one stream focuses on interpreting the input and the other on discovering the temporal patterns, works better than a standard LSTM on both artificial and real-life data.