Deep homography estimation in dynamic surgical scenes for laparoscopic camera motion extraction

ABSTRACT Current laparoscopic camera motion automation relies on rule-based approaches or only focuses on surgical tools. Imitation Learning (IL) methods could alleviate these shortcomings, but have so far been applied to oversimplified setups. Instead of extracting actions from oversimplified setups, in this work we introduce a method that allows to extract a laparoscope holder’s actions from videos of laparoscopic interventions. We synthetically add camera motion to a newly acquired dataset of camera motion free da Vinci surgery image sequences through a novel homography generation algorithm. The synthetic camera motion serves as a supervisory signal for camera motion estimation that is invariant to object and tool motion. We perform an extensive evaluation of state-of-the-art (SOTA) Deep Neural Networks (DNNs) across multiple compute regimes, finding our method transfers from our camera motion free da Vinci surgery dataset to videos of laparoscopic interventions, outperforming classical homography estimation approaches in both, precision by 41%, and runtime on a CPU by 43%.


Introduction
The goal in IL is to learn an expert policy from a set of expert demonstrations.IL has been slow to transition to interventional imaging.In particular, the slow transition of modern IL methods into automating laparoscopic camera motion is due to a lack stateaction-pair data (Kassahun et al. 2016;Esteva et al. 2019).The need for automated laparoscopic camera motion (Pandya et al. 2014;Ellis et al. 2016) has, therefore, historically sparked research in rule-based approaches that aim to reactively center surgical tools in the field of view (Agustinos et al. 2014;Da Col et al. 2020).DNNs could contribute to this work by facilitating SOTA tool segmentations and automated tool tracking (Garcia-Peraza-Herrera et al. 2017, 2021;Gruijthuijsen et al. 2021).
Recent research contextualizes laparoscopic camera motion with respect to (w.r.t.) the user and the state of the surgery.DNNs could facilitate contextualization, as indicated by research in surgical phase and skill recognition (Kitaguchi et al. 2020).However, current contextualization is achieved through handcrafted rule-based approaches (Rivas-Blanco et al. 2014, 2017), or through stochastic modeling of camera positioning w.r.t. the tools (Weede et al. 2011;Rivas-Blanco et al. 2019).While the former do not scale well and are prone to nonlinear interventions, the latter only consider surgical tools.However, clinical evidence suggests camera motion is also caused by the surgeon's desire to observe tissue (Ellis et al. 2016).Non-rule-based, i.e.IL, attempts that consider both, tissue, and tools as source for camera motion are (Ji et al. 2018;Su et al. 2020;Wagner et al. 2021), but they utilize an oversimplified setup, require multiple cameras or tedious annotations.
In current laparoscopic camera motion automation, DNNs merely solve auxiliary tasks.Consequentially, current laparoscopic camera motion automation is rule-based, and disregards tissue.While modern IL approaches could alleviate these issues, clinical data of laparoscopic surgeries remains unusable for IL.Therefore, SOTA IL attempts rely on artificially acquired data (Ji et al. 2018;Su et al. 2020;Wagner et al. 2021).
In this work, we aim to extract camera motion from videos of laparoscopic interventions, thereby creating state-action-pairs for IL.To this end, we introduce a method that isolates camera motion (actions) from object and tool motion by solely relying on observed images (states).To this end, DNNs are supervisedly trained to estimate camera motion while disregarding object, and tool motion.This is achieved by synthetically adding camera motion via a novel homography generation algorithm to a newly acquired dataset of camera motion free da Vinci surgery image sequences.In this way, object, and tool motion reside within the image sequences, and the synthetically added camera motion can be regarded as the only source, and therefore ground truth, for camera motion estimation.Extensive experiments are carried out to identify modern network architectures that perform best at camera motion estimation.The DNNs that are trained in this manner are found to generalize well across domains, in that they transfer to vast laparoscopic datasets.They are further found to outperform classical camera motion estimators.

Related Work
Supervised deep homography estimation was first introduced in (DeTone et al. 2016) and got improved through a hierarchical homography estimation in (Erlik Nowruzi et al. 2017).It got adopted in the medical field in (Bano et al. 2020).All three approaches generate a limited set of homographies, only train on static images, and use non-SOTA VGG-based network architectures (Simonyan and Zisserman 2014).
Unsupervised deep homography estimation has the advantage to be applicable to unlabelled data, e.g.videos.It was first introduced in (Nguyen et al. 2018), and got applied to endoscopy in (Gomes et al. 2019).The loss in image space, however, can't account for object motion, and only static scenes are considered in their works.Consequentially, recent work seeks to isolate object motion from camera motion through unsupervised incentives.Closest to our work are Le et al. (2020), where the authors generate a dataset of camera motion free image sequences.However, duo to tool, and object motion, their data generation method is not applicable to laparoscopic videos, since it relies on motion free image borders.Zhang et al. (2020) provide the first work that does not need a synthetically generated dataset.Their method works fully unsupervised, but constraining what the network minimizes, is difficult to achieve.
Only (Le et al. 2020) and (Zhang et al. 2020) train DNNs on object motion invariant homography estimation.Contrary to their works, we train DNNs supervisedly.We do so by applying the data generation of DeTone et al. (2016) to image sequences rather than single images.We further improve their method by introducing a novel homography generation algorithm that allows to continuously generate synthetic homographies at runtime, and by using SOTA DNNs.

Theoretical Background
Two images are related by a homography if both images view the same plane from different angles and distances.Points on the plane, as observed by the camera from different angles in homogeneous coordinates p i = u i v i 1 T are related by a projective homography G (Malis and Vargas 2007) (1) Since the points p i and p i are only observed in the 2D image, depth information is lost, and the projective homography G can only be determined up to scale α g .The distinction between projective homography G and homography in Euclidean coordinates H = K −1 GK, with the camera intrinsics K, is often not made for simplicity, but is nonetheless important for control purposes.The eight unknown parameters of G can be obtained through a set of N ≥ 4 matching points where g holds the entries of G as a column vector.The ninth constraint, by convention, is usually to set ||g|| 2 = 1.Classically, P is obtained through feature detectors but it may also be used as a means to parameterise the spatial transformation.Recent deep approaches indeed set P as the corners of an image, and predict ∆p i = p i − p i .This is also known as the four point homography G 4point which relates to G through (2), where p i = p i + ∆p i .

Data Preparation
Similar to (Le et al. 2020), we initially find camera motion free image sequences, and synthetically add camera motion to them.In our work, we isolate camera motion free image sequences from da Vinci surgeries, and learn homography estimation supervisedly.We acquire publicly available laparoscopic, and da Vinci surgery videos.
An overview of all datasets is shown in Fig. 1.Excluded are synthetic, and publicly unavailable datasets.Da Vinci surgery datasets, and laparoscopic surgery datasets require different pre-processing steps, which are described below.

Da Vinci Surgery Data Pre-Processing
Many of the da Vinci surgery datasets are designed for tool or tissue segmentation tasks, therefore, they are published at a frame rate of 1 fps, see Fig. 1a.We merge all high frame rate (HFR) datasets into a single dataset and manually remove image  sequences with camera motion, which amount to 5% of all HFR data.We crop the remaining data to remove status indicators, and scale the images to 306 × 408 pixels, later to be cropped by the homography generation algorithm to a resolution of 240×320.

Laparoscopic Surgery Data Pre-Processing
Laparoscopic images are typically observed through a Hopkins telescope, which causes a black circular boundary in the view, see Fig. 2.This boundary does not exist in da Vinci surgery recordings.For inference on the laparoscopic surgery image sequences, the most straightforward approach is to crop the view.To this purpose, we determine the center and radius of the circular boundary, which is only partially visible.We detect it by randomly sampling N points p i = (u i , v i ) T on the boundary.This is similar to work in (Münzer et al. 2013), but instead of computing an analytical solution, we fit a circle by means of a least squares solution through inversion of where the circle's center is (x 0 , x 1 ), and its radius is We then crop the view centrally around the circle's center, and scale it to a resolution of 240 × 320.An implementation is provided on GitHub2 .

Ground Truth Generation
One can simply use the synthetically generated camera motion as ground truth at train time.For inference on the laparoscopic dataset, this is not possible.We therefore generate ground truth data by randomly sampling 50 image sequences with 10 frames  each from the Cholec80 dataset.In these image sequences, we find characteristic landmarks that are neither subject to tool, nor to object motion, see Fig. 2b.Tracking of these landmarks over time allows one to estimate the camera motion in between consecutive frames through (2).

Deep Homography Estimation
In this work we exploit the static camera in da Vinci surgeries, which allows us to isolate camera motion free image sequences.The processing pipeline is shown in Fig. 3.
Image pairs are sampled from image sequences of the HFR da Vinci surgery dataset of Fig. 1a.An image pair consists of an anchor image I n , and an offset image I n+t .The offset image is sampled uniformly from and interval t ∈ [−T, T ] around the anchor.The HFR da Vinci surgery dataset is relatively small, compared to the laparoscopic datasets, see Fig. 1b.Therefore, we apply image augmentations to the sampled image pairs.They include transform to grayscale, horizontal, and vertical flipping, cropping, change in brightness, and contrast, Gaussian blur, fog simulation, and random combinations of those.Camera motion is then added synthetically to the augmented image I aug n+t via the homography generation algorithm from Sec. 3.4.A DNN, with a backbone, then learns to predict the homography G 4point between the augmented image, and the augmented image with synthetic camera motion at time step n + t.

Homography Generation Algorithm
In its core, the homography generation algorithm is based on the works of (DeTone et al. 2016).However, where DeTone et al. crop the image with a safety margin, our method allows to sample image crops across the entire image.Additionally, our method computes feasible homographies at runtime.This allows us to continuously generate synthetic camera motion, rather then training on a fixed set of precomputed homographies.The homography generation algorithm is summarized in Alg. 1, and visualized in Fig. 3.
Initially, a crop polygon P c is generated for the augmented image I aug n .The crop polygon is defined through a set of points in the augmented image P c = {p c i , i ∈ [0, 3]}, which span a rectangle.The top left corner p c 0 is randomly sampled such that the crop polygon P c resides within the image border polygon P b , hence , where h, and w are the height and width of the crop, and the border the homography G −1 is returned, otherwise a new four point homograpy G 4point is sampled.Therein, * indicates that the intersection matrix may hold any value, and T, F indicate that the intersection matrix must be true or false at the respective position.In the unlikely case that no homography is found after maximum rollouts, the identity G 4point = 0 is returned.Once a suitable homography is found, a crop of the augmented image Crop(I aug n , P c ) is computed, as well as a crop of the warped augmented image at time n + t, Crop(Warp(I aug n+t , G −1 ), P c ).This keeps all computationally expensive operations outside the loop.

Experiments
We train DNNs on a 80% train split of the HFR da Vinci surgery dataset from Fig. 1a.The 20% test split is referred to as test set in the following.Inference is performed on the ground truth set from Sec. 3.2.3.We compute the Mean Pairwise Distance (MPD) of the predicted value for G 4point from the desired one.We then compute the Cumulative Distribution Function (CDF) of all MPDs.We evaluate the CDF at different thresholds t i , i ∈ {30, 50, 70.90}, e.g.30% of all homography estimations are below a MPD of t 30 .We additionally evaluate the compute time on a GeForce RTX

Backbone Search
In this experiment, we aim to find the best performing backbone for homography estimation.Therefore, we run the same experiment repeatedly with fixed hyperparameters, and varying backbones.We train each network for 50 epochs, with a batch size of 64, using the Adam optimizer with a learning rate of 2e−4.The edge devation is set to 32, and the sequence length T to 25.

Homography Generation Algorithm
In this experiment, we evaluate the homography generation algorithm.For this experiment we fix the backbone to a ResNet-34, and train it for 100 epochs, with a batch size of 256, using the Adam optimizer with a learning rate of 1e−3.Initially, we fix the sequence length T to 25, and train on different edge deviations ∈ {32, 48, 64}.Next, we fix the edge deviation to 48, and train on different sequence lengths T ∈ {1, 25, 50}, where a sequence length of 1 corresponds to a static pair of images.

Backbone Search
The results are listed in Tab. 1.It can be seen that the deep methods generally outperform the classical methods on the test set.There is a tendency that models with more parameters perform better.On the ground truth set, this tendency vanishes.The differences in performance become independent of the number of parameters.Noticeably, many backbones still outperform the classical methods across all thresholds on the ground truth set, and low compute regime models also run quicker on CPU than comparable classical methods.E.g. we find that EfficientNet-B0, and RegNetY-400MF run at 36 Hz, and 50 Hz on a CPU, respectively.Both outperform SURF & RANSAC in homography estimation, which runs at 20 Hz.

Homography Generation Algorithm
Given that ResNet-34 performs well on the ground truth set, and executes fast on the GPU, we run the homography generation algorithm experiments with it.It can be seen in Fig. 4a, that the edge deviation is neglectable for inference.In Fig. 4b, one sees the effects of the sequence length T on the inference performance.Notably, with T = 1, corresponding to static image pairs, the SURF & RANSAC homography estimation outperforms the ResNet-34.For the other sequence lengths, ResNet-34 outperforms the classical homography estimation.The CDF for the best performing combination of parameters, with T = 25, and = 48, is shown in Fig. 5. Our method generally outperforms SURF & RANSAC.The advantage of our method becomes most apparent for a CDF ≥ 0.5.Even the identity outperforms SURF & RANSAC for large MPDs.This aligns with the qualitative observation that motion is often overestimated by SURF & RANSAC, which is shown in Fig. 6.An exemplary video is provided1 .

Conclusion
In this work we supervisedly learn homography estimation in dynamic surgical scenes.We train our method on a newly acquired, synthetically modified da Vinci surgery   dataset and successfully cross the domain gap to videos of laparoscopic surgeries.To do so, we introduce extensive data augmentation and continuously generate synthetic camera motion through a novel homography generation algorithm.

CDF on annotated Subset of Cholec80
In Sec.5.1, we find that, despite the domain gap for the ground truth set, DNNs outperform classical methods, which is indicated in Tab. 1.The homography estimation performance proofs to be independent of the number of model parameters, which indicates an overfit to the test data.The independence of the number of parameters allows to optimize the backbone for computational requirements.E.g., a typical laparoscopic setup runs at 25 − 30 Hz, the classical method would thus already introduce a bottleneck at 20 Hz.On the other hand, EfficientNet-B0, with 36 Hz, and RegNetY-400MF, with 50 Hz, introduce no latency, and could be integrated into systems without GPU.
In Sec.5.2, we find that increasing the edge deviation has no effect on the homography estimation, see Fig. 4a.This is because the motion in the ground truth set does not exceed the motion in the training set.In Fig. 4b, we further find how training DNNs on synthetically modified da Vinci surgery image sequences enables our method to isolate camera from object and tool motion, validating our method.In Fig. 5, it is demonstrated that ResNet-34 generally outperforms SURF & RANSAC.This shows that generating camera motion synthetically through homographies, which approximates the surgical scene as a plane, does not pose an issue.
The object, and tool motion invariant camera motion estimation allows one to extract a laparoscope holder's actions from videos of laparoscopic interventions, which enables the generation of image-action-pairs.In future work, we will generate imageaction-pairs from laparoscopic datasets and apply IL to them.Describing camera motion (actions) by means of a homography is grounded in recent research for robotic control of laparoscopes (Huber et al. 2021).This work will therefore support the transition towards robotic automation approaches.It might further improve augmented reality, and image mosaicing methods in dynamic surgical environments.

Figure 1 .
Figure 1.Da Vinci surgery and laparoscopic surgery datasets.Shown are relative sizes and the absolute number of frames.Da Vinci surgery datasets are often released at a low frame rate of 1 fps for segmentation tasks (a).Much more laparoscopic surgery data is available (b).
(a) Binary segmentation mask, obtained through thresholding the bilateral filtered image.(b) Circular boundary detection and static landmarks (blue arrows).

Figure 2 .
Figure 2. Cholec80 dataset pre-processing, referring to Sec. 3.2.2.The black boundary circle is automatically detected.Landmarks are manually annotated and tracked over time (b).

Figure 3 .
Figure 3. Deep homography estimation training pipeline.Image pairs are sampled from the HFR da Vinci surgery dataset.The homography generation algorithm then adds synthetic camera motion to the augmented images, which is regressed through a backbone DNN.

Algorithm 1 :
Homography generation algorithm.Randomly sample crop polygon P c of desired shape in P b ; while rollouts < maximum rollouts do Randomly sample G 4point , where ∆u i ∧ ∆v i ∈ [− , ] ∀i; Perspective transform boundary polygon p : P b , G −1 → P b ; Compute intersection matrix DE9IM(P b , P c ); , P c ; end Increment rollouts; end return 0, P c ; 2070 GPU, and a Intel Core i7-9750H CPU.

Figure 4 .
Figure 4. Homography generation optimization, referring to Sec. 5.2.Shown is a ResNet-34 homography estimation for different homography generation configurations, and a SURF & RANSAC homography estimation for reference.The edge deviation is varied in (a), and the sequence length T is varied in (b).

Figure 6 .
Figure 6.Classical homography estimation using a SURF feature detector under RANSAC outlier rejection, and the proposed deep homography estimation with a ResNet-34 backbone, referring to Sec. 5.2.Shown are blends of consecutive images from a 5 fps resampled Cholec80 exemplary sequence (Twinanda et al. 2016).Decreasing the framerate from originally 25 fps to 5 fps, increases the motion in between consecutive frames.(Top row) Homography estimation under predominantly camera motion.Both methods perform well.(Bottom row) Homography estimation under predominantly object motion.Especially in the zoomed images it can be seen that the classical method (d) misaligns the stationary parts of the image, whereas the proposed method (e) aligns the background well.

Table 1 .
Results referring to Sec. 5.1.All methods are tested on the da Vinci HFR test set, indicated by t test i , and the Cholec80 inference set, indicated by t gt i .Best, and second best metrics are highlighted with bold character.Improvements in precision t gt 90,imp and compute time CPU imp are given w.r.t.SURF & RANSAC.