Classification with a Deferral Option and Low-Trust Filtering for Automated Seizure Detection

Wearable technology will become available and allow prolonged electroencephalography (EEG) monitoring in the home environment of patients with epilepsy. Neurologists analyse the EEG visually and annotate all seizures, which patients often under-report. Visual analysis of a 24-h EEG recording typically takes one to two hours. Reliable automated seizure detection algorithms will be crucial to reduce this analysis. We investigated such algorithms on a dataset of behind-the-ear EEG measurements. Our first aim was to develop a methodology where part of the data is deferred to a human expert, who performs perfectly, with the goal of obtaining an (almost) perfect detection sensitivity (DS). Prediction confidences are determined by temperature scaling of the classification model outputs and trust scores. A DS of approximately 90% (99%) can be achieved when deferring around 10% (40%) of the data. Perfect DS can be achieved when deferring 50% of the data. Our second contribution demonstrates that a common modelling strategy, where predictions from several short EEG segments are combined to obtain a final prediction, can be improved by filtering out untrustworthy segments with low trust scores. The false detection rate shows a relative decrease between 21% and 43%, and the DS shows a small increase or decrease.

. Average performance on all patients (no cross-validation) as a function of the percent of 2-second segments that are filtered (% removed), for the first rule investigated. Performance is normalized with respect to the performance when there is no filtering. Note that FDR is called FP rate. Figure S2. Average performance on all patients (no cross-validation) as a function of the percent of 2-second segments that are filtered (% removed), for the second rule investigated. Performance is normalized with respect to the performance when there is no filtering. Note that FDR is called FP rate.

Classification With A Deferral Option: Extra Results
The results for the performance as a function of the fraction of deferred data with trust models trained on the CI labels are shown for the CI SVM in Figure S3 and for the FS SVM in Figure S4. Similar behaviour was observed compared to the trust models trained on the FS labels, which are shown in the main article.
The number and average length of the deferred segments as functions of the fraction of deferred data are plotted for the FS SVM in Figure S5.
In Figure S6 we plot the FDR as a function of the fraction of the data that is deferred to a human annotator, for different p low . In contrast to our main approach, the segments that contain seizure flags are not automatically the first to be deferred. If a seizure flag is in a deferred segment for at least one second, we assume that it is completely checked by the human annotator, even if a part is checked by the algorithm. We observe that the optimal p low is 5%. This is in agreement with our conclusion for the optimal p low for the detection sensitivities, as discussed in the main article.
In the main article, we deferred the same percentage of segments for each patient. We also investigated a strategy where a different percentage of segments is deferred per patient. The numerical threshold at which to defer a segment is the average of the optimal thresholds of the two validation sets. The comparison between these two strategies for the CI SVM with SVM confidences is shown in Figure S7. The strategy where the same percentage of segments is removed is clearly superior. Similar curves are obtained for other models (FS SVM) and confidence measures (trust models trained on the CI or FS labels). -score as a function of the fraction of the data that is deferred to a human annotator, for the CI SVM (p low = 5). The standard deviation of the performance is shown as a shaded area, with the upper values capped at one. Segments are deferred using the SVM confidences (SVM) or trust scores (trust) from a trust model trained on the CI labels. The first point with fraction deferred > 0 is the performance when all segments that contain a seizure flag are deferred. The inset plotted in the FDR figure shows that around 1% of the EEG data is contained in segments that contain seizure flags.  Figure S4. Average (a) detection sensitivity; (b) FDR/24h; (c) PPV; (d) F1-score as a function of the fraction of the data that is deferred to a human annotator, for the FS SVM (p low = 5). The standard deviation of the performance is shown as a shaded area, with the upper values capped at one. Segments are deferred using the SVM confidences (SVM) or trust scores (trust) from a trust model trained on the CI labels. The first point with fraction deferred > 0 is the performance when all segments that contain a seizure flag are deferred. The inset plotted in the FDR figure shows that around 4.5% of the EEG data is contained in segments that contain seizure flags. -score as a function of the fraction of the data that is deferred to a human annotator, for the CI SVM (p low = 5). The standard deviation of the performance is shown as a shaded area, with the upper values capped at one. Segments are deferred using the SVM confidences. The first point with fraction deferred > 0 is the performance when all segments that contain a seizure flag are deferred. The inset plotted in the FDR figure shows that around 1% of the EEG data is contained in segments that contain seizure flags. In one strategy we defer the same percentage of segments per patient, referred to as patient-independent (PI). In the other strategy we defer a different percentage per patient, referred to here as patient-specific (PS). The PI strategy is the one from the main article.

Low-Trust Filtering: Extra Results
Classifying with a defer option, with and without starting with a model on which LTF has been performed, is shown in Figure S8 for the CI SVM and in Figure S9 for the FS SVM. Although the performance at 0% deferral is better, there is no real advantage for larger deferral percentages. This is most likely because the additionally detected seizures from LTF are in segments with low confidence, since those are the segments that are expected to benefit from LTF. These segments are among the first to be deferred. The percentage that needs to be deferred to obtain a FDR of 0 is similar, and even slightly lower without LTF for the CI SVM. Although one generally expects the reverse behaviour, as can be seen for the FS SVM, the fact that false positives are clustered combined with our deferral strategy can lead to this behaviour. Figure S10 shows a visualisation of a new seizure detection after performing LTF. We also investigated a strategy of LTF where a different percentage of lowest-trust segments is filtered per patient. The numerical threshold at which to filter is determined from the average of the optimal threshold of the two validation folds. We call this the patient-independent (PI) strategy (same percentage filtered per patient) and patient-specific (PS) strategy (different percentage filtered per patient). There is no big difference between the strategies, as can be seen for the CI SVM with LTF from trust scores from trust models trained on the FS labels (Table 1) or the CI labels (Table 2), and the SVM confidences (Table  3).  Figure S9. Average (a) detection sensitivity; (b) FDR/24h; (c) PPV; (d) F1-score as a function of the fraction of the data that is deferred to a human annotator, for the FS SVM (p low = 5). The standard deviation of the performance is shown as a shaded area, with the upper values capped at one. Segments are deferred using trust scores (trust) from a trust model trained on the CI labels. The first point with fraction deferred > 0 is the performance when all segments that contain a seizure flag are deferred. Low-trust filtering (LTF) is either performed or not.