Download PDF

Australian Eye-tracker Conference, Date: 2018/04/26 - 2018/04/28, Location: Sydney, Australia

Publication date: 2018-04-28

Author:

Callemein, Timothy
Van Beeck, Kristof ; Brône, Geert ; Goedemé, Toon

Abstract:

Mobile eye-trackers have been available for quite a while now and are becoming more popular in different research fields like entertainment, marketing, sociology and linguistics. While the hardware is evolving at a decent pace, the software is still lacking robustness and functionality (Broˆne et al., 2011). Due to this lack of software that is capable of extracting the necessary information from eye-tracker data, researchers resort to time-consuming manual annotation of huge amounts of data during experiments. The cost of this manual processing step severely limits the extent of each experiment. This paper focuses on post-analysis of human-human interaction eye-tracking recordings, specifically how long the subject looks at the hands and the head of his interlocutor during conversations. Hand and head detectors that utilize markers or coloured gloves (Wang and Popovic ́, 2009) show good results, for this task. However, during human-human interaction studies the use of these instruments can stimulate the awareness of being recorded and are considered to be an obstruction or bias. We want to limit the awareness by only using the available hardware present on the eye-trackers (RGB overview scene camera). While face detectors based on RGB images have been researched thoroughly, their use is often constrained to frontal faces. During conversations, we are not always facing the person and depending on our pose we only see the back of the head. Hands render an additional challenge because of their varying poses and orientations. Several hand detection techniques based on skin segmentation and DPM models are useful (De Beugher et al., 2016), but still require manual interventions to achieve the best result. Our approach focuses on replacing these computer vision algorithms with the recently demonstrated CNN- based human pose estimators. Indeed, the human pose can be found by using such a estimator that maps the anatomical key-points (head, shoulders, arms, legs, ...). The publicly available OpenPose framework, combining the work of (Cao et al., 2017; Simon et al., 2017; Wei et al., 2016), provides an accurate bottom-up pose estimation based only on 2D video. This state-of-the-art technique provides of-the-shelf frame-by-frame pose estimations in the form of key-points. We used this technique as a robust way to find hands and head locations in eye-tracker videos. By looking at the relative location of the presented points we determine a bounding box, one for the head and one for each hand (fig. 1). When the gaze overlaps with one of these boxes, we store the box-label as annotation for this frame. Our approach showed excellent results on data from three eye-tracker datasets, as demonstrated in table 1. In figure 2 we visualize some failure cases from (De Beugher et al., 2016) without manual intervention, and our used approach with OpenPose. Next to this improvement in accuracy, our technique processes images at 5 FPS, opposed to the other techniques that need at least 36.67 seconds. We are, by our knowledge, the first to use pose estimators in post-analysis tools for human- human interaction eye-tracker data and will make this tool publicly available as open source.