Towards a safe and efficient clinical implementation of machine learning in radiation oncology by exploring model interpretability, explainability and data-model dependency

The interest in machine learning (ML) has grown tremendously in recent years, partly due to the performance leap that occurred with new techniques of deep learning, convolutional neural networks for images, increased computational power, and wider availability of large datasets. Most fields of medicine follow that popular trend and, notably, radiation oncology is one of those that are at the forefront, with already a long tradition in using digital images and fully computerized workflows. ML models are driven by data, and in contrast with many statistical or physical models, they can be very large and complex, with countless generic parameters. This inevitably raises two questions, namely, the tight dependence between the models and the datasets that feed them, and the interpretability of the models, which scales with its complexity. Any problems in the data used to train the model will be later reflected in their performance. This, together with the low interpretability of ML models, makes their implementation into the clinical workflow particularly difficult. Building tools for risk assessment and quality assurance of ML models must involve then two main points: interpretability and data-model dependency. After a joint introduction of both radiation oncology and ML, this paper reviews the main risks and current solutions when applying the latter to workflows in the former. Risks associated with data and models, as well as their interaction, are detailed. Next, the core concepts of interpretability, explainability, and data-model dependency are formally defined and illustrated with examples. Afterwards, a broad discussion goes through key applications of ML in workflows of radiation oncology as well as vendors’ perspectives for the clinical implementation of ML.


Introduction
Radiation oncology is a medical field that heavily relies on information technology and computational methods. Even though the goal of radiation therapy can be stated as simply as irradiating the tumor while minimizing the dose to the healthy tissue, numerous and complex calculations are needed to achieve such a goal. From the image reconstruction and analysis steps to locate the tumor and organs, down to the plan optimization process to find In this review, we describe in detail key aspects of interpretability, explainability and data-model dependency in ML/DL, and discuss how they can be applied to increase the reliability and safety of ML/DL applications in the field of radiation oncology. Section 2 starts by reviewing all the possible risks associated with ML/DL models, and provides illustrative examples in the medical field. Section 3 introduces general considerations and technical foundations about interpretability, explainability and data-model dependency in ML. These topics have been studied for years in fundamental ML research, but they only start to integrate the vocabulary of clinical research and practitioners. We believe it is essential to bring this knowledge closer to the clinical environment, in order to provide the radiation oncology community with a well-structured background to develop reliable and safe ML models. Section 4 walks the reader through the radiation oncology workflow and digs into key applications of ML, specifically discussing issues related to interpretability, explainability and data-model dependency. Section 5 wraps-up this manuscript with final conclusions.

Risks associated with the use of ML for medical applications
The first step towards a safe clinical implementation of ML models is to become aware of the different risk factors associated with this technology, which is the goal of this section. As ML techniques are essentially datadriven, the main risks associated with their use can then stem from the data itself or the model. Data issues appear when the data used to train our ML algorithm does not reflect the ground truth of the problem at hand, whereas model issues are due to incorrect performance of the model itself. In the following, we identify the main issues in these two categories and provide illustrative examples in the medical field.

Data
In computer science, the acronym GIGO stands for 'Garbage In, Garbage Out', and it refers to the fact that when a system is fed with low-quality data, the output will be deficient likewise. In ML specifically, GIGO can have dramatic consequences as it affects the training of the model. In medical applications of ML, GIGO can affect the patient's outcome and it is one of the main factors to take into account when aiming at their safe clinical implementation. GIGO has two main roots: insufficient data in quantity and inappropriate data in quality (figure 1).
More specifically, most ML applications attempt to learn an unknown phenomenon y=j(x) in a supervised way, that is, where inputs are mapped to some desired output, with a flexible model y=f θ (x) having parameters θ. A finite dataset of input-output pairs (x i , y i ) 1iN is sampled from a population (figures 1(a) and (b)). In this sampling and learning process, insufficient data problems arise when the dataset size N is too low, Figure 1. Some data-related pitfalls of supervised learning, exemplified with a binary classification problem. Panel (a) formalizes the problem and how the model maps the inputs (features or images) to the outputs (class labels green and orange). Panel (b) shows an ideal dataset with enough data globally (high N) and in each class. Panel (c) illustrates insufficient data, when the number of total examples N is too low (for all classes). Panels (d) to (f) illustrate cases of inappropriate data: (d) Class imbalance, when class populations are unequal and minor classes might not be given enough importance in the performance figures. (e) Low-quality or corrupted inputs x, e.g. blurred, noisy, or artifacted images, represented by a lighter color and gray dots in the figure. (f) Annotation errors (mistakes in class labels y). To some extent, class imbalance can be seen as a particular case of insufficient data, when one of the classes has a low N with respect to the other(s).
whereas inappropriate data problems are related to the sampling, measurement, and annotation in the pairs (x i , y i ) (figures 1(c)-(f)).
2.1.1. Insufficient data Insufficient data often result from the difficulty to collect and to annotate data in the medical field, due to cost, ethical issues, or expert availability. A too small dataset is generally unable to reflect all variations that can exist in a (patient) population. The size of the data to be collected typically must grow with the complexity of the task to accomplish. A complicated task usually involves many features or criteria to make a decision. The input dimensionality (e.g. just a few biomarkers, versus images with millions of voxels) and the output dimensionality (e.g. the number of classes or diseases to be distinguished) are typically faithful indicators of complexity. In computer vision, for classification of natural images, rules of thumb state that up to 1000 instances per class can be necessary, and the performance increases logarithmically with the dataset size (Sun et al 2017). In the medical field, the lower availability of data (Willemink et al 2020) is compensated by the greater regularity in images, with simple backgrounds, similar anatomies and orientations in the foreground. For instance, in dose prediction for radiotherapy, models like U-Net are efficient at learning from relatively small datasets (e.g. around 50-100 patients), thanks to a densely connected network architecture  (Nikolov et al 2018) and target volumes (Cardenas et al 2021)), image synthesis (e.g. generation of synthetic CTs from MR images (Maspero et al 2018)), or image registration, have also demonstrated a good performance when trained with databases in the order of one hundred patients or even lower (Sokooti et al 2017). Nevertheless, building a well-curated and up-to-date database of few decens or hundreds (patients) samples still remains a challenge for most medical institutions, and it is often the result of several years of work. For instance, (Grossberg et al 2018) presented the head and neck squamous cell carcinoma collection, comprising data from 215 patients collected during 10 years of treatment (from 2003 to 2013).

Inappropriate data
Inappropriate data covers a wide range of possible problems. In collecting input-output pairs (x i , y i ) they can concern the sampling of x i in the population, the measurement of x i , or the annotation y i . Often, medical databases can suffer from several of these issues. Therefore, good data curation algorithms, together with interpretable/explainable ML and the exploration of data-model dependency, can help to properly identify and fix each issue (see section 3).

Data sampling in the population: domain coverage and class imbalance
To be effective and to generalize to any individual from the population, the collected data must be representative of it, that is, it has to reflect all relevant variations in that population (i.e. domain coverage). In classification tasks, for example, not all variabilities could be represented within a single class or one or several classes might be underrepresented with respect to others in the database used to train the ML model (i.e. minority classes). Often, the technical term used to refer to this situation in ML is 'class imbalance' (Johnson and Khoshgoftaar 2019). This results in wrong or reduced accuracy predictions for those underrepresented classes. In fact, the ML model will focus mainly on the majority class during learning, and in extreme cases, may ignore the minority class altogether. Class imbalance can be also seen as a particular case of insufficient data (section 2.1.1), where the number samples in the minority class(es) (N m ) is much lower than that of the dominating class(es) (N d ), i.e. N m = N d (figure 1). Notice, however, that class imbalance can occur even for models trained with databases containing a large total N, as long as the ratio between classes remains inappropriately balanced. This is the reason why we have decided to include class imbalance in the 'inappropriate data' category.
In the medical field, the minority class can be represented by patients groups (e.g. with positive/negative diagnosis, rare diseases, patients under/over certain age, gender, ethnicity, etcK), but also at the pixel level (e.g. 2% of pixels of class A and 98% pixels of class B).
At the patient groups level, a common example of imbalanced datasets are those for skin cancer, which consist predominantly of healthy samples with only a small percentage of malignant ones (Mikolajczyk and Grochowski 2018, Emara et al 2019, Zunair and Ben Hamza 2020. Another example is how gender unbalance between male and female patients in the training database can lead to biased ML models. For instance, a recent study analyzed the effect of gender imbalance when training ML models to diagnose various thoracic diseases (Larrazabal et al 2020). A consistent decrease in performance was observed when using male patients for training and female for testing (and vice versa).
Regarding the pixel level, the most trivial example is the detection or segmentation of small lesions or organs from medical images (Bria et al 2020. A good illustrative case is the segmentation of organs for head and neck cancer patients, where the ratio between small and big organ volumes can reach a factor 100 (e.g. optic structures versus parotids or oral cavity) . For instance, a difference up to 20% in Dice coefficient for the ML model accuracy can be found between the smallest organs (e.g. optic nerves and chiasm) and the bigger ones (Tong et al 2018).

Data measurement: low quality or corrupted records
As soon as population sampling issues are sorted out, another caveat concerns the quality of the records in that sample. For example, in an application that involves medical images, those can be more or less noisy, blurry, or subject to artifacts (Dodge and Karam 2016). Concepts like image definition, (optical) resolution, contrast, or signal-to-noise ratio are important here and condition even more ML performance than it does for human observers, who can more naturally disregard artifacts and compensate for noise or blur. This is really the classical meaning of 'garbage in, garbage out' in signal processing: corrupted data leads to poor performance. Typical examples of noise and artifacts in medical images include CT artifacts due to metal implants (Kalender et al 1987, Barrett andKeat 2004), ring and scatter noise in Cone Beam CT images (Zhu et al 2009), or artifacts due to patient motion (Zaitsev et al 2015). In extreme cases, even slight perturbations can have dramatic effects and can be exploited to defeat or 'attack' the model with so-called 'adversarial examples' (Szegedy et al 2013, Finlayson et al 2019. For instance, adding adversarial noise to an image of a skin mole, classified by the model as benign, can suddenly make the model change the output to malign (Finlayson et al 2019).
For noise, blur, and low contrast, improving the image acquisition device or tuning its parameters are straightforward recommendations. Data curation to avoid badly corrupted records or the presence of confounding artifacts can also improve performance. Often, this is at the price of lower robustness and generalization capability, since ML models are left totally unaware of these outliers and pathological cases at training time, although they might still show up when the ML model is queried. Some unwanted artifacts in images can also turn into confounders or spurious revealers, like the presence of a plaster cast in radiological images when it comes to spot broken bones, or image tags that correlate with patient, disease, or treatment categories that should be predicted from the image content, not from such side information (Zech et al 2018, Badgeley et al 2019. Another type of low quality records include the cases for which data is uninformative or not informative enough. The records do not convey all the necessary information to solve the problem at hand. For instance, an image with a small field of view that does not cover (or not entirely) the region of interest for a diagnosis or segmentation model would be considered uninformative. Another example is when the necessary information is spread over several sources and the model has access to only one or few of them. For instance, ML models for segmentation of tumor volumes are often provided with only one image (e.g. CT), while in clinical practice the physician gathers information from several sources to perform the segmentation (e.g. PET, MR, endoscopy images or meta-data like age, patient's physical condition, other diseases, etc) (Moe et al 2021, Ye et al 2021).
2.1.2.3. Data annotation: low quality annotation, label noise, or inter-observer variability In the collected data pairs (x i , y i ), y i is responsible for the supervision of the training, that is, to associate the correct output to any input record x i . The quality of this annotation or label is thus of paramount importance (Frenay andVerleysen 2014, Karimi et al 2020).
The most straightforward example of low-quality annotations is the presence of inaccuracies induced by human errors when labeling medical images used for training a ML model. For instance, (Yu et al 2020) recently studied the effect of using inaccurate contours when training an automatic segmentation ML model for the mandible. They showed a decrease in the Dice coefficient between 5% and 15% when the ratio of inaccurate contours increased from 40% to 100%. Another recent study investigated the effect of using erroneous labels when training a ML model for skin cancer classification (Hekler et al 2020), reporting a 10% decrease in accuracy when using the imperfect labels versus the perfect ground truth.
Another major data quality issue in the radiation oncology field is data heterogeneity or variability. Overall, these variabilities can be viewed into two categories: (1) lateral variability and (2) longitudinal variability. Lateral variability describes the difference in data distributions for a given time frame. Some examples include the interobserver variability in radiotherapy treatment planning (Nelms et al 2012, Berry et al 2016, the variability in delineation of tumor and organ volumes across different physicians (Apolle et al 2019, van der Veen et al 2020, or the differences between clinical practices among institutions (Eriguchi et al 2013, Gershkevitsh et al 2014. In contrast, longitudinal variability describes the difference in data distributions over time, such as the evolution of treatment techniques (Shang et al 2015), the introduction of new delineation guidelines (Brouwer et al 2015, Grégoire et al 2018 or fractionation protocols (Dearnaley et al 2017, Parodi 2018. Lateral and longitudinal variability are often entangled together within retrospective databases containing patients treated with radiotherapy by different physicians, institutions, and at different time points. Although the individual effect of each source of variability is hard to quantify, a recent study has demonstrated that the use of homogeneous data increases the accuracy and the robustness of ML models (Barragán-Montero et al 2021b). The study compared two ML models for radiotherapy dose prediction for esophageal cancer. The first model was trained with a variable database (i.e. retrospective patients, different time frames, planning protocols, treating physicians), while the second was trained with a homogeneous one (i.e. same time frame, same treatment protocol, same physician). The second model was able to reduce the mean absolute error of the predicted dose distribution.
Yet another important issue is the presence of annotation bias. General examples of bias in the medical domain include over-diagnosis of certain diseases (Blumenthal-Barby and Krieger 2015), or bias induced by gender, race or socioeconomic factors (Bach et al 1999, Schulman et al 1999, Lievens and Grau 2012, Forrest et al 2013, Obermeyer et al 2019. For instance, (Bach et al 1999) reported significant racial differences in the treatment of lung cancer. They observed that black patients are less likely to receive surgical treatment than white patients, which entailed a decrease of 8% for the five-year survival rate of this population. Often, one of the most important sources of this kind of bias is the socioeconomic level of the patient, which is also well known to affect the treatment chosen and delivered for cancer patients (Ou et al 2008, Lievens and Grau 2012, Forrest et al 2013. Last but not least, variability and biases can somehow co-exist in many scenarios. For instance, in lateral variability, medical experts can disagree persistently about the annotation of some data instances. Across consistent groups of experts, this can be seen as biases, whereas for ML models these discrepancies are seen as a variability around a consensus that might not be agreed upon yet. The framework of supervised learning, with functional models ŷ=f (x) can only produce a single output ŷ for a given input x. If several outputs need nevertheless to be produced, then new explicative inputs must be identified and appended to x. Alternatively, one can also train an individual model for each possible output ŷ l , like if several ground truths were possible for a given x. For instance, a recent study about radiotherapy dose prediction for prostate cancer patients illustrated the differences in treatment planning practices between different doctors and institutions, and generated specific ML models for each clinical practice (Kandalan et al 2020).

Model and learning frameworks
Most current ML methods extend and upscale supervised learning techniques developed by statisticians over the past 100 years (Friedman et al 2001). Supervised learning for ML algorithms do not substantially differ from linear or logistic regression models. In all cases, they find a function y=f θ (x) that models the phenomenon under study y=j(x). Model fitting amounts to minimizing the discrepancy between the ground truth y, as measured or annotated, and ŷ as yielded by the model. ML tries to identify the relationships that map the features in x to the outputs y. In the following, we present several limitations related to this learning framework, which should be carefully taken into account when implementing ML models in the clinical environment.

Non-causal correlations and hidden confounders
When trying to find the relationships that map the features in x to the outputs y, the optimal solution is typically the one that finds strong dependencies between the considered features (e.g. patient's smoking condition) and outcomes (e.g. probability of lung cancer). However, the weakness of supervised learning, and most ML frameworks in general, is that it cannot infer causality out of the input-output dependencies, which can be either causal and relevant or spurious and confounding in the interpretation of the model. This represents an important risk when it comes to medical applications (Castro et al 2020). For instance, a recent study found that a convolutional neural network (CNN), trained to process x-rays images to predict pneumonia, was using the hospital information to make predictions, often disregarding the areas of the image with radiological findings relevant to the underlying pathology (Zech et al 2018). Specifically, the CNN was trained with databases from multiple hospitals, where the prevalence of pneumonia was very different. The hospital information was retrieved from a hospital-specific token, located in the corner of the image, and other image features indicative of the radiograph's origin (figure 2). This information was strongly correlated with the prevalence of pneumonia in the considered dataset, without any causality, thus acting as a hidden confounder and leading to the so-called 'shortcut learning' (Geirhos et al 2020). One can find many other examples of confounders and spurious correlations in the literature of ML models for medical applications. For instance, another study reported that an artificial neural network, trained to estimate the probability of death from pneumonia in the emergency room, labeled asthmatic patients as having a low risk of death, because in the training data this cohort was seeking care faster than non-asthmatic patients (Cooper et al 2005). Yet another recent study found that colon cancer screening or abnormal breast findings were highly correlated to the risk of having a stroke, with no clinical justification (Mullainathan and Obermeyer 2017).

Model complexity: size, nonlinearity, and opacity
Beyond the inability to identify relevant causality, the interpretability of ML models can be further impeded by their sheer size and complexity. The advantage of state-of-the-art ML models (i.e. CNNs, GANs, K) over classical linear models is their increased capability to find a function that approximates the problem under study (y=f θ (x)). This is often done by drawing on nonlinear relationships between variables (e.g. patient characteristics) and outcomes (e.g. mortality probability). Finding the final function can be accomplished by either directly estimating the parameters of a nonlinear function of fixed complexity (e.g. an artificial neural network) or estimating the complexity and shape of a nonlinear function (e.g. non-parametric algorithms like gradient boosting) (Friedman et al 2001). In all cases, the consequence of nonlinearity is an increased number of parameters required to build that function f θ (x). A modern ML model can have between a few thousands and several millions of trainable parameters. For instance, Nguyen et al (2019) compared different ML models for predicting the radiotherapy dose for head and neck cancer patients, reporting between 3 and 40 millions of trainable parameters for the considered models (Nguyen et al 2019a). The bigger the number of parameters, the less tractable the model becomes, thus reducing the interpretability of the provided function and turning it into a black-box. Notice that the same issue happens for big linear models, too. Promoting sparsity, that is, the parsimonious use of the available features and variables, to reduce the number of effective (non-zero) parameters) (Rish and Grabarnik 2014, Oswal 2019, Vinga 2021 can mitigate this issue of size and interpretability of large black-box models. For such models, identifying hidden confounders and non-causal correlations becomes very difficult, which certainly increases the risk when using them for medical applications. This lack of interpretability has been recently highlighted as one of the most important issues to be addressed in the medical domain before ML algorithms can be widely accepted in the clinic (Luo et al 2019, Reyes et al 2020).
2.2.3. Task-specialized learning, static models, and low generalization Supervised learning is often cast within a simplified framework that ignores time, where all the dataset is supposed to be known at once and engraved in marble for eternity. Any change entails retraining from scratch. In other words, most ML models cannot learn incrementally, interactively, nor in real-time. They are trained with data from past experience and they become fixed and static models as soon as training ends. This represents an important limitation when it comes to their application in the ever-changing medical field: technologies improve (Shang et al 2015), medical protocols evolve (Grégoire et al 2018, Parodi 2018, and the distribution of patient populations change over time (Chai and Jamal 2012). In this fast-moving world, static AI models quickly become irrelevant. Therefore, it is imperative to shift towards models and frameworks that can quickly adapt to new settings or changing distributions over time. The framework of supervised learning is also essentially specific to a task and exclusively driven by performance at that task. This means that a model trained for a particular application offers no real guarantee to be good at other similar tasks, and the learnt skills are hard to reuse and/or generalize. For instance, specific ML models are currently trained to predict the radiotherapy dose for each cancer location (e.g. head and neck (Nguyen et   showing the relevant regions considered by the CNN to make the prediction. The model in this study was trained to predict pneumonia from x-ray images. By looking at the CAMs, they found out that the model was looking at the corner of the images, and in particular, at the hospital-specific metal token (a hidden confounder) to make the prediction. observed for other applications, such as diagnosis or organ segmentation models. In order to be more efficient and increase the generalization capabilities, future ML in the medical field would require stronger models, with an increased capability to reuse the learning skills. This paradigm shift has been coined as the 'weak versus strong AI'.
The low generalization capability of current ML models is widely debated in the literature. In the medical domain, many publications state that, for a successful clinical implementation, ML models should be able to generalize to new data, that is, keep performing well enough on records coming from different hospitals, images from different scanners and vendors, different imaging and treatment protocols, different patient populations, data changes over time, etc. A large number of studies have been published focusing on the question of generalization. For instance, (Liang et al 2020) illustrated the problem of generalization with a ML model trained to convert CBCT into synthetic CT images. The authors trained the model on CBCT images acquired from one vendor's scanners for head and neck cancer patients, and they quantified the decrease of performance when applying the model to images from another vendor's scanners and from different locations (e.g. prostate, pancreatic, and cervical cancer). In (Feng et al 2020), the generalization issue was illustrated with a model trained to segment thoracic organs. The model could not generalize to their local dataset because they used an abdominal compression technique, whereas the training set was acquired with free breathing. The subtle shift of thoracic organs due to the abdominal compression caused significantly worse performance on the local dataset. . Another well-known example is the study by Zech et al (2018), already discussed in section 2.2.1 (figure 2). The ML model was not able to generalize to radiographs from other hospitals because its learning had been biased by a hidden confounder (i.e. the hospital-specific metallic token).
Generalization is a very abstract term, and the examples above show that poor generalization can be frequent. Recently D'Amour et al (2020) introduced an umbrella term to cover all the seemingly different failures to generalize in current ML: 'underspecification'. It refers to the typical inability of the ML pipeline (training, validation and testing) to ensure that the model has seen and encoded all the relevant variabilities of the underlying system or problem. Eche et al (2021) discuss how this concept echoes in the medical field, from the perspective of radiologists. They relate underspecification to the aforementioned antagonism of 'weak versus strong AI'. They also distinguish narrow and broad generalization. Narrow generalization corresponds to the case that is considered by design in most validation frameworks: test or deployment data are supposed to be independent and identically distributed (i.i.d.) as data in the training and validation sets. Independence guarantees the new data is unseen, while the identity of the underlying distribution ensures consistent predictability. In contrast, broad generalization aims at maintaining predictability if the deployment data are independent but possibly differently distributed. The deployment data distribution can then have other or slightly shifted variabilities than in training and validation. For this reason, broad generalization is also known as (distribution) domain shift or drift. If generalization problems arise, we can refer to our two-fold categories in this section: data and model issues. A model cannot generalize properly if the training data and the actual data at deployment time are not i.i.d., that is, the former is not representative of the latter (see section 2.1), or if the model has not learned correctly, due to hidden confounders, overfitting to (noisy) training data, etc. Broad generalization to non-i.i.d. datasets is a much more ambitious goal and it aims at strong AI, closer to natural intelligence, where general knowledge is acquired and re-used across analogous problems and tasks. Although strongly desirable, broad generalization is controversial. In Futoma et al (2020) the authors discuss how seeking broad generalisability can be detrimental to the clinical applicability of some ML models, and they provide some illustrative examples. Imagine, for instance, a ML model with an excellent performance for diagnosis of a certain disease in hospital A, properly generalizing to the entire patient population in that hospital. The model might not work with equal performance for hospital B, since the patient population might differ (domain shift and out-ofdomain samples). However, trying to change the model to increase the performance for hospital B might be at the cost of lowering the performance for hospital A, in the same way as when individual human experts get replaced with a single all-rounder. For current ML models there is a trade-off between performance and generalization, which must be carefully considered for clinical applications. In this case, building a new (specific) model for hospital B would be more appropriate than using a general model with lower performance. Futoma et al claim that we should stop demanding broad generalization and focus on understanding how, when, and why a ML system works.

Interpretability, explainability and data-model dependency
The previous section introduced the different risk factors of ML models for medical applications, clearly distinguishing two categories: data and model issues. However, in practice, data and model issues are often entangled, and identifying the actual risks for a given medical application is not straightforward. In order to properly identify and fix each risk factor, we must implement strategies that enable us to interpret and/or explain the behavior of ML models, as well as to explore the data and how the model performance depends on it. More importantly, this entanglement between data and model issues makes the possible range of solutions a non bijective problem, i.e. a certain technique can be the solution to several of the aforementioned issues in section 2, and vice-versa, a certain issue can be fixed (or mitigated) by different techniques. For instance, providing explanations about the model behavior may reveal non-causal correlations involving confounders; but they can also be revealed by exploring the performance of the model in different datasets or related tasks. Figure 3 presents a schematic view of the concepts described in this section, in order to guide the reader to understand how these techniques connect and serve as solutions to the risks presented in section 2, ensuring a safe and efficient clinical implementation of ML. Section 3.1 will cover general concepts and key techniques for interpretability and explainability. These techniques can be used to inspect if a ML model has learnt the underlying problem correctly, thus helping to identify data issues, hidden confounders, etc section 3.2 will cover key concepts related to the data and the learning process. On the one hand, targeting directly the data distribution to avoid insufficient and low-quality data will ensure that the ML model is encoding and learning the problem correctly. This includes data curation to detect and fix possible data issues, data augmentation to ensure a sufficient domain coverage, and techniques to efficiently incorporate (expert) prior knowledge about the domain. On the other hand, analyzing how the model reacts to different and external datasets (i.e. test data augmentation or stress testing), and estimating its uncertainty, can serve to further quantify the performance and generalization capacity. Lastly, a full section is dedicated to describe and discuss different learning frameworks proposed in the ML community to achieve robust and efficient learning, becoming one step closer to strong AI models.

Interpretability and explainability
Although the terms interpretability and explainability are often used interchangeably , Reyes et al 2020, Huff et al 2021, it is important to stress the difference between the transparency of the model to the enduser (i.e. interpretability), and the techniques used to provide insights about the inner workings of black-box models (i.e. explainability). In this section, we provide basic background knowledge about interpretability and explainability, so that the reader can make a conscious choice when aiming at the clinical implementation of ML methods. Please note that this is not an exhaustive review of all existing methods for interpretable and explainable ML, but rather an introductory section to these topics for the medical community. For extensive technical reviews we refer to Doshi-Velez and Kim (2017), Arrieta et al (2020).

Interpretability
Interpretability is a property of models (and sometimes decisions) to be understandable by their users (Guidotti et al 2019, Arrieta et al 2020. Although the questions about interpretability have been around for a few decades already (Kodratoff 1994) (Adadi and Berrada 2018), the vocabulary and its conceptualization were not so clear. Until 2015-2016, interpretability was identified in the ML literature by several different terms (interpretability, understandability, comprehensibility, etc) (Bibal and Frénay 2016). Furthermore, the problems of providing understandable, trustworthy, or justifiable models were confounded. With the growth in use of ML and, in particular, DL, in our society, the ML literature had to focus on interpretability.
In fact, interpretability is a concept that is hard to define because of its subjective nature (Bibal and Frénay 2016). For example, a model can be interpretable for a ML expert, but not for a lay person. In particular, a model that would include and manipulate information that a physician can easily understand can, on the contrary, be difficult to understand by a radiotherapy technician or a dosimetrist. Objectively quantifying interpretability is hard and has mostly been done in the ML literature through the complexity of models, excluding the content of these models. For instance, the bigger a decision tree is (i.e. the more nodes it has), the less interpretable it gets. Similarly, the more non-zero coefficients a linear model has (i.e. the less sparse it is), the less interpretable it is. Some models, specially those with highly nonlinear nature like neural networks (see section 2.2.2), are assumed to be black boxes in practically all cases, as they always are structurally complex, even if they manipulate understandable information.
Although controversial (Rudin 2019), most researchers rely on the hypothesis that the more complex the model is, the better accuracy it has. For instance, if the underlying relationship between features and outcome is nonlinear, the result will be models with likely better accuracy compared to linear models. Similarly, shallow ML models are often overperformed by deep models (Liang et al 2019a) (Chauhan et al 2019). Hence, what we trade for better accuracy is a higher complexity, and thus worse interpretability of ML models ( tool. The model achieved a high accuracy while being rather transparent, since the subpopulations defined by the leaf nodes of decision trees could easily be interpreted by human experts. Another example is the use of Generalized Additive Models, which create nonlinear transformations of individual variables, later combining them into a generalized linear model. The contribution of each variable can be interpreted from the individual graphs representing the nonlinear transformations (Caruana et al 2015). Yet another example is the recent work of Luna et al, who created a further improved decision tree by exploiting the mathematical connection between individual partitions and gradient boosting. The resulting decision trees were smaller and, as such, more accurate (Luna et al 2019). Despite the promising results obtained by these algorithms, whether they can obtain similar performance on more complicated medical problems remains to be seen.
The complexity of the model is only one of the multiple factors that are involved in the concept of interpretability (Guidotti et al 2019). Indeed, this feature does not suffice, as mathematically complex models can be made understandable through their representation. For instance, what makes decision trees interpretable is not the mathematical complexity behind those trees, but the fact that a tree representation is easy to follow by humans. After the complexity of models, the second factor is therefore the possible representations of this model. Third, as previously mentioned, the expertise of the user also plays a major role. The interpretability of decision trees and their useful representation can be low for someone who has never seen any decision tree, while it can be high for a ML expert.
Finally, the time provided to grasp the model is also a factor of interpretability. With an infinite amount of time, all models can be understood. What makes complex models hard to grasp is that they have to be understood in a short period of time. Therefore, the shorter this period of time is, the more difficult it is to interpret the model. This means that in a clinical environment, where the schedules are very tight, for a model to be interpretable, it must largely be less complex than in other contexts with milder time constraints.
Another way to see the aforementioned factors (e.g. complexity, representation, and time) is that if one of them is low, the others have to compensate. For instance, if the period of time to grasp is very short (e.g. in a case of medical emergency), then (1) the intrinsic complexity of the model must be low, and/or (2) the representation of the model must make it easy to grasp, and/or (3) the users (in this example, the emergency caretakers) must be trained to be experts in those models. Note that the concept of explainability (i.e. the ability to explain the inner workings of the model) is also determined by the same factors.

Explainability
When a model is not interpretable (i.e. it is a black box), bu7t its scrutiny is still important or necessary (e.g. by law, to enable a safe clinical implementation or simply to increase trust of the medical practitioners), another property is considered: its explainability (Guidotti et al 2019, Arrieta et al 2020). Explainability is the capacity of a model to be explained, even if not totally interpretable. The question 'is the model understandable by itself?' (figure 4) is therefore the first to be answered before unnecessarily using explanation methods if the model is already interpretable. If the answer is negative, there are different approaches to provide explanations, depending on the accessibility of the inner workings of the model (model-specific versus model-agnostic explanations), as well as on the nature of what should be explained (local versus global explanations).

Model-specific versus model-agnostic explanations
If the elements of the inner workings of the model are accessible, this information can be used to provide explanations about the model behavior. In these cases, the way the models are built can provide clues about the model decisions. These explanations are model-specific as they cannot be used, as they are, to explain a completely different model. Notice that the difference between the access to these elements of explanation and interpretability is that these elements do not fully explain the model. They are just characteristics of the models that can be exploited to gain insights about its inner workings. These clues may not be enough for gaining the trust of users or, in certain cases, for the law, but it is a first step that makes black boxes a bit more transparent. Two examples detailed just below of model-specific explanations are the feature importance provided by the out-of-bag error in bagging methods like random forests or boosted decision trees, and saliency maps when there is an access to the gradients in artificial or CNNs (Simonyan et al 2013).
Random forests (Breiman 2001) use different subsets of instances when training the different decision trees in the forest. For each decision tree, the subset of instances that are not used to train the tree (i.e. that are out of the bag) can be used to compute a certain error called the out-of-bag error. The feature importance in the forest is then provided by the effect of perturbing the feature values on the out-of-bag error. If the out-of-bag error changes when perturbing the feature values, this means that the feature is important. For instance, a recent study used the out-of-bag error for highlighting the most important features of a ML model applied to detect lung cancer from CT radiomics and/or semantic features (Bashir et al 2019).
If the gradients of a model are accessible, they can be used to explain the model. For instance, when predicting an image class, CNNs back-propagate the decision on the class to the pixels through the gradients. Looking at the gradients when back-propagating has the effect of providing, for each pixel, the importance of the pixel on the prediction. The resulting image, where pixels are highlighted with respect to their contribution to the prediction, is called a saliency map (Simonyan et al 2013). Other gradient-based explanation techniques have been developed since then, like Grad-CAM (Gradient Class Activation Maps) and all its variants (Selvaraju et al 2017). Gradient-based techniques have been extensively used in medical applications to explain the performance of ML models (Singh et al 2020, Huff et al 2021. A popular example is the study by Zech et al (2018), already mentioned in section 2.2.2, where a CNN was trained to predict pneumonia from x-ray images (figure 2). By using class activation maps (CAM) (Zhou et al 2016), they discovered that the CNN was not looking at relevant areas for the disease in the x-ray images. Other examples include the study of Diamant et al (2019), where a CNN was trained to predict treatment outcome of patients with head and neck cancer, and Grad-CAMs were used to visualize the areas of the CTs that were found to be relevant for the prediction. Yet another example is the study by Liang et al (2019a), who trained a CNN to predict pneumonitis as a side effect from thoracic radiotherapy, and used Grad-CAM to locate the regions of the dose distribution that were relevant to the prediction.
Another idea is to test whether activations, in a chosen layer, relate to predefined concepts by defining Concept Activation Vectors (CAV) (Kim et al 2018). The idea is similar to saliency maps, except that it is the sensitivity of the activations with regards to predefined concepts that is investigated, instead of a sensitivity with regards to the input (e.g. the pixels). This strategy is sometimes called explanations through semantics (Reyes et al 2020), since it allows us to explain the features learned by the model to the users in terms of humanunderstandable concepts. Concept Vectors have not yet been used in many medical applications, but a good illustrative example is the study from Graziani et al (2020). They applied CAV and an extended version of it, Regression Concept Vectors, to provide explanations for CNNs trained to diagnose breast cancer from histopathological Whole Slide Imaging and retinopathy of prematurity from retinal photographs. They used concepts such as the area or the contrast of the image to describe the visual aspect of the learned features.
In some cases, the black box does not provide any information about its inner workings. This can be, for instance, because the model is property of a company that does not want to provide access to the inside of its black box. In such a case, generic methods for explaining black boxes (also called model-agnostic methods) are used. These agnostic methods work on analyzing the decisions made by the black box when particular inputs are provided.
Agnostic feature importance highlights the input features that seem to be the most important ones when making a decision (Fisher et al 2019). One particularly well-known technique of agnostic feature importance is SHapley Additive exPlanations (SHAP) (Lundberg and Lee 2017). Recently, SHAP has been used to provide explanations of a model trained to predict locoregional relapse for oropharyngeal cancers (Giraud et al 2020), to interpret a model trained to predict 10-year overall survival of breast cancer patients (Jansen et al 2020), or yet to produce heat maps that visualize the areas of melanoma images that are most indicative of the disease (Shorfuzzaman 2021).
Notice that model-agnostic can have two different meanings in the literature. The first one, presented here, considers that the explanation is model-agnostic because no assumption is made about the inner workings of the black box (Guidotti et al 2019, Molnar 2019). The second meaning of 'model-agnostic' is that the explanation technique can be applied to a broad range of different models (Arrieta et al 2020, Das and Rad 2020). This distinction makes that saliency maps are not included in the first meaning (because the inner workings are considered through the gradients), but included in the second (because saliency maps can be developed for all differentiable models).

Local versus global explanations
When a local explanation is required, the objective is to provide an explanation that is faithful to the behavior of a black box for a particular decision, and for the decisions on very similar input data. Notice that the categories model-specific/agnostic and global/local are complementary to each other. For instance, the flagship method among model-agnostic local explanation methods is Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al 2016). The idea of LIME is to learn an interpretable model (e.g. a linear model) based on instances that are obtained by perturbing the feature values of the instances for which the decision needs to be explained (figure 5). By perturbing the target instance, a neighborhood around this instance is created and the black box is queried for this neighborhood. The interpretable model is then trained to reproduce the decisions of the black box for the instances in this neighborhood, hence the local-aspect of the explanation. Many variants of LIME have been developed, for instance, by making the perturbations in such a way that the neighborhood is realistic (e.g. randomly perturbing pixels of face images will not provide another face image, a smarter perturbation technique would be needed to obtain that (Ivanovs et al 2021)). Applications of LIME in the medical field remain seldom, but an illustrative example is the study by Palatnik de Sousa et al (2019), who generated explanations on how a CNN detects tumor tissue for lymph nodes metastasis in patches extracted from histology whole slide images. Another example is the study by Jansen et al (2020), who also used LIME to interpret a model trained to predict 10-year overall survival of breast cancer patients.
Regarding model-specific local explanations, attention mechanism is a good example. Attention-based neural networks are models that contain one or several layers designed to focus on the relevant elements of the input for a particular prediction ( In the case of a global explanation, like agnostic (Gevrey et al 2003, Fisher et al 2019 or specific (Breiman 2001) global feature importance explanations, the entire inner workings of the black box is approximated. For instance, a neural network can be co-learned with a decision tree to (i) produce a better decision tree thanks to the neural network and (ii) obtain an interpretable representation of the neural network via the decision tree (Nanfack et al 2021). Another example is the neural decision tree technique proposed in Yang et al (2018), where any setting of the weights corresponds to a specific decision tree. Notice that a global explanation can be obtained by combining several local explanations that are performed on sufficiently different input instances (Setzu et al 2020). However, the issue is that combining many interpretable models can make the whole combination uninterpretable (e.g. the combination of decision trees in a random forest), which does not solve the problem of explaining the black box.

New trends and limitations
Today, many conferences, workshops and special issues in journals focus on interpretability and explainability. This interest leads to an ever growing literature on the subject. In particular, one hot topic, in addition to the post-hoc methods like LIME, is the subject of disentangled neural networks , Chen et al 2020b. The idea behind neural network disentanglement is to combine the performance of neural networks with the need for interpretability and explanations. In disentangled neural networks, while the network is optimized to solve the problem, the neurons and filters are also constrained to correspond to concepts that are easily identifiable by humans. In the end, when the network is trained and makes a prediction, the activation of the neurons provides important clues on the concepts that have been used to make the decision. Medical applications of disentangled neural networks are rare, since it is a rather new field. But a good example is the work from Chartsias et al (2019), who explored a factorisation to decompose the input into spatial anatomical and imaging factors. Their model was applied to analyzing cardiovascular MR and CT images. Another example is the study from Meng et al (2021), who applied disentangled representations to fetal ultrasound images.  Ribeiro et al (2016). Workflow illustrating the use of LIME (Local Interpretable Model-agnostic Explanations). The idea of LIME is to learn an interpretable model (e.g. a linear model) to explain individual predictions. In the example, a black box model receives a set of variables for a new patient (i.e. age, smoker, K) and classifies the patient as having lung cancer. The LIME model then provides the user with information (i.e. explanations) about the features that most contributed to the prediction. 'Age' and 'Sex' did not contribute at all, 'Smoker' and 'Weight-loss' were against it, while 'PET-SUV', 'Histology', and 'Coughing' contributed for the positive lung cancer classification.
Another hot-topic is based on the aforementioned limitation of attention to be an explanation (Jain and Wallace 2019, Wiegreffe and Pinter 2019). While the debate converges towards the idea that attention may not be an explanation, solutions have been developed to address the issue. In particular, effective attention has been found to be the part of attention that can be considered as an explanation (Brunner et al 2019). The idea would therefore be to decompose attention weights into two parts and to use the effective attention part to explain the model.
In general, an important point for discussion is the accuracy of the explanations. For the cases where the approximation of the black box by the explanation is correct, the explanation gives truthful information about how different variables interact to result in a prediction. However, for those cases where the approximation is not correct, algorithms designed to provide explanations about the original black-box model are not a faithful representation of the original model (Jacovi and Goldberg 2020). As such, they provide a false and possibly dangerous sense of confidence. Unfortunately, it is not possible to know beforehand whether the approximation made by the explanation is accurate.
Some authors are also critical of the kind of explanation that is under study. Most, if not all, explanation techniques suppose that an explanation should only be faithful to the model (i.e. accurately reflecting its reasoning) (Jacovi and Goldberg 2020). However, another important aspect of explanations is their plausibility (i.e. how convincing it is to humans) (Riedl 2019). Indeed, one could accept to lose a reasonable amount of faithfulness to make the explanation plausible and, thus, useful, for the user.
Finally, besides the degree of faithfulness and plausibility, the explanation may not be lawful enough (Bibal et al 2021). Indeed, the strength and the type of the explanation can also be constrained by the law. For instance, a feature importance method can have a reasonable level of faithfulness and plausibility, but can fail as an explanation with respect to the law.

Data-model dependency
As a consequence to the intrinsic data-driven nature of ML algorithms, many of the risks associated with their use are related to the data itself and how it is processed inside the model (see section 2). Thus, in addition to understanding the behavior of ML models (section 3.1), acting on the data and analyzing how the model performance depends on it is key to enable a safe and efficient clinical implementation. In the following, we present several lines of action that can help to identify and reduce the risks of failure for ML models in the medical context, as well as to ensure an efficient implementation and use.

Data curation and data augmentation
The most straightforward techniques to ensure sufficient quality and quantity for the data, before training the ML model, are data curation and data augmentation. First, data curation can help detect any errors in the labels or identify missing and incomplete records, among other issues. Second, data augmentation can increase the variability in the training set, thus helping better represent the patient population under study (see section 2).
Although most of the data curation process is currently done with very simple methods (e.g. scripts for data visualization, dictionaries for correct labeling (Mayo et al 2016, Schuler et al 2019, etc), some groups have recently started to explore the use of ML models to be used for data curation and label cleaning specifically. For instance, Yang et al (2020b) used a 3D Non-local Network with Voting to standardize anatomical nomenclature in radiotherapy treatments. Another interesting approach is the 'label cleaning network' or CleanNet, introduced by Lee et al (2018), although the latter has only been applied to natural images. Another interesting approach is the one presented by Dakka et al (2021), who trained multiple ML architectures on the data to be cleansed, with several cross-validation sets. The ML models are applied back to the same training (uncleansed) dataset to infer the labels, and those that cannot be consistently classified correctly are considered as poorquality data. They called the method 'untrainable data cleansing', and illustrated their successful performance in several medical classification problems. Other groups have concentrated efforts in developing crowd-powered algorithms for large-scale medical image annotation (Heim et al 2018). In addition to the data cleaning, preprocessing methods can be used to increase the consistency of the data. For instance, for medical images, it is important to pay attention to things such as the voxel size, the image size, range of the image voxel values, registration between multimodal images, etc. Typical pre-processing techniques are image resampling, cropping and (histogram) normalization. For a comprehensive review of data curation tools and open-access platforms we refer elsewhere (Willemink et al 2020. Regarding data augmentation, it works particularly well when dealing with images as input data. Two types of image data augmentation techniques exist: basic image manipulations and DL approaches (Shorten and Khoshgoftaar 2019). Basic image manipulation techniques consist of geometric image transformations such as image flipping, translations, random cropping and rotations and photometric image transformation like the addition of noise, mixing images and random erasing. Beyond those more basic approaches, adversarial training ML-based strategies that can be used for data augmentation. These techniques use neural networks to add transformations to the original data. In the case of adversarial training, two networks compete against each other: the first network (generator) generates synthetic images (the augmented data), while the second network (discriminator) tries to discriminate between real and synthetic images. Thus, the final transformations to generate the augmented data are those that are able to fool the discriminator network, leading to synthetic images that look truly real and have the same characteristics as the original set. In neural style transfer, the transformations are predefined (e.g. night to day) and a single network is used to turn the original data into the new style (Ma et al 2019, Gawlikowski et al 2021). For a complete review of data augmentation techniques we refer to the survey in Shorten and Khoshgoftaar (2019). Data augmentation is nowadays used in most medical imaging applications to increase the number of training samples and improve generalization (Nalepa et al 2019, Chlap et al 2021. For instance, (Meyer et al 2021) used a data augmentation approach based on Gaussian Mixture Models to increase the variability of a given dataset of MR images in terms of intensities and contrast. This helped to increase the generalization of ML models trained for segmentation of MR images from different scanners. In a similar study, the authors used adversarial training (GANs) to generate synthetic data to overcome generalization issues to different MR manufacturers . Another example is the study by Zhang et al (2020c), who applied a series of stacked transformations to each image when training the ML model. The idea was to simulate the expected domain shift for a specific medical imaging modality with extensive data augmentation on the source domain, thus improving the generalization to the shifted domains. They applied their model to segment different organs in MR and ultrasound images, showing promising results.
Although data augmentation is typically used to increase the training dataset, the same techniques can also be applied during the testing phase, in order to inspect the robustness and generalization of the ML model to a well-varied data distribution. This is known as test-time data augmentation (Nalepa et al 2019, Moshkov et al 2020. For instance,  investigated how test-time augmentation can improve the performance of a ML model for brain tumor segmentation. They augmented the image by 3D rotation, flipping, scaling, and adding random noise. After using test-time augmentation, their results appeared to be more spatially consistent. Recently, D'Amour et al (2020) proposed a well-controlled framework to analyze the generalization capacity of ML models with the so-called 'stress-testing'. The idea is to apply customized tests designed to reproduce the challenges that the model will encounter when deployed in the actual (clinical) world. In particular, two of the proposed tests (i.e. shifted performance and contrastive evaluation) aim to test the model with instances from a shifted domain. This can easily be done with test-time data augmentation, by changing the resolution, contrast, or noise level of the images. Although the concept of stress testing is rather new, the medical community is being encouraged to apply before clinically implementing ML models (Eche et al 2021). For instance, Young et al (2021) applied stress-testing for ML models trained to diagnose skin lesions. They found inconsistent predictions on images captured repeatedly in the same setting or subjected to simple transformations (e.g. rotation).
In addition, test-time data augmentation can be used as a means to quantify the uncertainty associated with the prediction (see section 3.2.3) of the ML model (Ayhan and Berens 2018, Gawlikowski et al 2021. For instance, in the previous example, Wang et al (2019b) used test-time data augmentation to generate uncertainty maps for the segmented brain volumes.

Prior and domain-specific knowledge
The learning capability of ML models critically depends on the information conveyed by the data used to train them. Beyond this obvious statement that has been discussed in section 2.1, we can possibly provide and/or guide our ML models with the even more relevant information for improved learning efficiency. Incorporating prior-and domain-specific knowledge into ML models can help achieve this goal and yield more robust models. There are several ways to incorporate this knowledge into an ML model (Muralidhar et al 2018a, Deng et al 2020, Dash et al 2022 and here we present three common approaches: input data, loss function and hand-crafted features. 3.2.2.1. Input data Sometimes, we attempt to train the model with incomplete information. For instance, medical images are typically associated with additional information than what is depicted. Certain anatomical features might result from specific diseases or medical procedures (e.g. surgical removal of the tumor), while remaining too stealthy cues. Similarly, a given radiotherapy dose distribution is the result of physician and patient choices regarding secondary effects, treatment protocols, and so on, while having directly this information side channel would ease learning. Training the ML model with the bare images, without including this prior and domain-specific information will result in poor performance. A common strategy to include this prior-and/or domainknowledge is to modify the input itself. This includes changing the size and/or the format of the input: adding more input channels for CNN models, mixing images and text data as input, etc. When adding more input channels but keeping the same data type (e.g. stacking extra images such as MR or PET on top of CT), no significant changes need to be done in the architecture of the model. However, when using heterogeneous data types (e.g. images, text, scalars, K) several options are possible as to where to merge these sources in the network data path. We refer here to the early fusion, joint fusion and late fusion strategies (figure 6). In the first, the different input modalities are joined before being fed into a single model. This fusion is done through concatenation and/or pooling, among other strategies. The joint or intermediate fusion consists in joining the features learned from the first layers of the network with other input modalities, before feeding this joint data into a final model. Finally, the late fusion strategy refers to the process of using a combination of outputs coming from multiple models to make a decision .
Examples of incorporating domain-knowledge into the input data are many. For instance, a study looking into volumetric dose calculation using DL investigated the use of 3D voxel-based distance from source, central beamline distance, radiological depth, and volume density, as entire volumetric inputs (Kontaxis et al 2020). Other photon and proton dose calculation studies investigated having a first-order prior of the dose calculation as input into the model ( These studies demonstrate that, by including these additional domain knowledge-focused inputs, the models outperform those using only more basic input data.

Loss function
In supervised learning, for some input x i , the loss function L(y i , ŷ i ) measures the mismatch or error between the desired output y i and the actual output ŷ i =f θ (x i ) for the model with its current parametrization θ. Optimal parameters are found by minimizing the loss for all (x i , y i ).
Incorporating domain knowledge in the loss function aims at steering the model to prioritize error minimization for the most relevant data instances (patients), areas (in images), or metrics. Typically, it is done by adding to the loss function penalty terms that encourage outputs with properties imposed by the domain knowledge (output regularization). Commonly used losses in ML, like the mean squared error (MSE) and cross-entropy (CE), are general, domain-agnostic losses that can be applied to many regression and classification problems, respectively. When paired with the proper activation functions in the output layer, their gradients can be well behaved to make the optimization process converge efficiently. However, these generic losses are unable to minimize errors in any targeted manner. In contrast, domain-adapted losses achieve substantially superior performance for ML applications (Muralidhar et al 2018b). This was found to be especially impactful in situ ations where data is limited and of poor quality, a scenario that is often encountered in the medical field. However, due to the well-behaved gradients of most domain-agnostic losses, it is still preferred to use a combination of the two losses. Highly specific domain-adapted losses will likely have a poorly behaved gradient, and, thus, a well-behaved general loss will be a large driver at the beginning of the optimization. The domainadapted loss can then fine tune the model further once it gets close to the minima.
Early works of including domain knowledge into the loss function date from the mid nineties (Fu 1995, Dash et al 2022. The penalty terms were based on regularizing embeddings, which are low-dimensional representations of the input variables. The complexity of the embeddings was penalized with first-order logic (Rocktäschel et al 2014). In traditional ML models, prior knowledge can also be integrated into the loss function to guide the feature selection process. For instance, (Guan et al 2020) developed a know-guided random forest to incorporate prior knowledge from multiple domains in biomarker discovery. The authors added a penalty coefficient to the Gini index. In nowadays DL models, integrating domain knowledge in the loss function is an active field of research. For instance, a recent study investigated the use of both human and learned domainadapted losses in dose prediction for radiation therapy of prostate cancer with CNNs (Nguyen et al 2020). They included a differentiable approximation of the dose volume histogram into the loss function, which improved the prediction accuracy, particularly for dose-volume metrics. Furthermore, they investigated the inclusion of a learned domain-adapted loss in the form of an adversarial (ADV) loss. Also for a dose prediction task with CNNs, in this case for breast cancer patients, Bai et al (2021) proposed a dynamically scaled variant of the classical MSE loss, with a scaling factor that decreases in low-dose regions. This 'sharp-loss', as they coined it, aimed at solving the data imbalance issue of dose prediction problems where the region of clinical concern accounts for only a small part of the whole image. Another interesting approach is the focal loss proposed by Lin et al (2017), which enables the DL model to automatically focus itself onto the most important examples for the training by relying on a defined prior probability for the relevant classes, which helps to overcome data imbalance issues. Recently, Bird et al (2021) developed a DL model to generate synthetic CT for MR-only radiotherapy, and they used a focal loss function to enhance performance in the hard to predict bone region. Similar to the focal loss concept, He et al (2020) designed a domain-adapted loss for renal artery segmentation, which sampled the loss region dynamically according to the segmentation quality intra-image, so that the hardto-segment regions, such as edges, surfaces, ends, etc, will be focused and their segmentation quality will be enhanced. Instead of focusing on specific regions, other studies have explored the incorporation of anatomical priors as output regularization terms in the loss function. For instance, a star shape prior was encoded as a new loss term to improve the segmentation of skin lesions from their surrounding healthy skin (Mirikharaji and Hamarneh 2018). The model penalized the non-star shape segments and guaranteed a global structure in the final segmentation, thus achieving superior results in the ISBI 2017 challenge for skin segmentation. Similar approaches of incorporating anatomical priors as output regularization terms in the loss function can be found for the segmentation of other structures such as liver (Zheng et  Another interesting approach is to constrain the loss function to fit observed data or to yield predictions that approximately satisfy a given set of physical rules. This has been coined as physics-informed ML, and it is becoming increasingly popular. Although still not widely applied to the medical domain, there are some groups that explore this approach. For instance, (Kissas et al 2020) applied physics-informed neural networks to predict arterial blood pressure from non-invasive 4D flow MRI data. They used insights from computational fluid dynamics to ensure that the ML model yields physically consistent predictions. In addition to improved and more efficient learning, physics informed ML models have been claimed to have increased interpretability (Rudin et al 2021).

Handcrafted features (a.k.a. feature engineering)
Beyond the loss function, another way to better guide and interpret the model correctly through the learning process is to include the domain-specific knowledge into the feature selection process. Classical (shallow) ML models rely on humans to define specific features to extract from the data in order to guide the learning process (i.e. handcrafted features or feature engineering). In contrast, modern (deep) ML models (i.e. DL) rely on learning generic, parameterized features, turning feature engineering into an entirely automatic learning process for the model. This has been one of the reasons for the success of DL, since training a model can be done end to end without any human intervention. Moreover, the performance of classical ML models was limited to the adequacy of manually picked features, whereas DL models are assumed to have an improved performance thanks to the many degrees of freedom provided by generic trainable features. However, the automatic feature extraction of modern DL models can sometimes be a double-edged sword. Indeed, a DL model can easily extract thousands of features and, unlike handcrafted ones, these features are very hard to interpret by humans and to relate to relevant concepts in medical applications. Another pitfall of blind feature learning is that, due to the low control on many generic features, there is an increased risk of getting confounding features that are efficient but spurious, irrelevant, or poorly interpretable (see section 2.2). Thus, incorporating prior-and domainknowledge into the feature selection process can help improve the performance of ML models and also their interpretability. Although using handcrafted features might seem a step back in the evolution of ML, there are a few studies that start to follow this trend for medical applications , Welch et al 2020.
For instance, radiomics (Lambin et al 2012) is a typical use of ML in medical imaging and oncology relying on handcrafted features. Radiomics assumes that images convey useful but not necessarily visible information for medical tasks like prognosis or therapeutic response prediction (Guiot et al 2022, Walls et al 2022. Feature extraction and selection are then supposed to reveal this information, sometimes called a radiomic signature, gathering a limited number of task-relevant features, while also allowing for automation. After segmentation of the volume of interest, typically a tumor, several types of features can be extracted from it. Geometric features include size measurements (diameters, volumetry, etc) and shape descriptors (sphericity, compactness, etc). Image intensity is characterized by histogram features, like energy, entropy, mean, variance, kurtosis, and other similar statistics, which are sometimes specific to imaging modalities like SUVs in PET (Leijenaar et al 2015, Orlhac et al 2021, Jiménez Londoño et al 2022. These first-order intensity features are complemented by second-order features that characterize textures in the images, i.e. the local relationships between nearby image voxels. Those features originate from tools like Haralick's gray-level co-occurrence matrix (GLCM) (Haralick et al 1973), the gray-level run-length matrix (GLRLM) (Tang 1998, Tustison andGee 2011), the gray-level size zone matrix (GLSZM) (Thibault et al 2013), the gray-level dependence matrix (GLDM) (Sun and Wee 1982), and the neighborhood gray tone difference matrix (NGTDM) (Amadasun and King 1989). Yet other, higher-order texture characterizations can come from image decompositions in Fourier/Gabor or wavelet/fractal spaces. All these image-related radiomic features can obviously be combined with features of various other origins, like genomics (Lu et al 2021), histology, clinical scores or indicators, etc.
Being slightly anterior to the popularization of DL in medicine, radiomics has historically relied on a classical ML pipeline, starting with handcrafted image preprocessing and feature extraction, followed by optional feature selection and traditional models for classification or regression. However, the field might evolve towards more end-to-end DL models ( Another study developed what they called Expert Augmented Machine Learning (EAML), a methodology to automatically acquire problem-specific priors and incorporate them into the ML model (Gennatas et al 2020). These approaches demonstrated to learn more efficiently, increase the interpretability of the ML model by using concepts that medical experts are familiar with, improve the generalization of the model (including out-of-sample distributions), and facilitate the detection of hidden confounders (Gennatas et al 2020).

Uncertainty quantification
Another key aspect to ensure a safe clinical implementation of ML models is to be able to quantify their risk of failure. This can be done by estimating the uncertainty associated with the prediction that the ML model yields for a given input sample ( with a high uncertainty is then a way for the ML model to tell us 'I am not confident about the answer' or even in extreme cases, 'I don't know the answer'. Uncertainty quantification tools can thus alert clinicians when the confidence of the ML model on the output is too low and let them take over to complete the task. Implementing such QA tools is crucial to gain clinicians' trust in ML technology, since it helps identify the limitations of ML models and avoid the risks associated with uncertain predictions (Begoli et al 2019, Kompa et al 2021. Several reasons can make a ML prediction uncertain, but given the data-driven nature of ML, many of them are related to the quantity and quality of the data used for training, as well as to the characteristics of the new input sample. In this context, uncertainty is typically categorized in two types: aleatoric and epistemic uncertainty (Anon 2009, Hüllermeier and Waegeman 2021). Aleatoric uncertainty measures the uncertainty inherent to the data (e.g. noisy, inaccurate, or low-quality records and labels, see section 2.1). It cannot be reduced even if more data is collected. However, increasing the quality of inputs (both training data and new unseen samples) would lead to a reduction. Epistemic uncertainty, on the other hand, represents the lack of knowledge of the model itself and is often referred to as model uncertainty. Epistemic uncertainty can stem from data sampling problems (e.g. the training data does not represent well the population under study, or the new input sample is out of the intended population distribution); or from issues related to the model structure (e.g. the model does not interpolate/extrapolate well enough). Thus, epistemic uncertainty can be reduced by either collecting more data to better sample the problem or by using more appropriate architectures with improved learning abilities (Gal 2016. Although the two uncertainty types are often combined into the so-called predictive uncertainty (Gal and Ghahramani 2016) For simple models, such as linear regression, the standard error of parameter estimates is directly available and it can be used to compute a confidence interval (typically 95%), which is a classical way to estimate the predictive uncertainty. Unfortunately, for more complex models, with a large number of parameters and nonlinear relationships, such as modern deep neural networks, estimating the predictive uncertainty is not straightforward.
Uncertainty quantification for ML/DL is a very active research field, and many different strategies have been proposed in recent years (Gawlikowski et al 2021). One of the traditional approaches is to model uncertainty in a probabilistic way, within a Bayesian framework. Instead of having models that process single point estimates, the idea is to replace them with probability distributions that indicate which values are more likely to happen (Beck and Katafygiotis 1998). In addition to Bayesian methods, another popular and rather simple approach for uncertainty quantification is the use of ensemble methods. We provide a general description of these methods, together with illustrative examples of their application in the medical field. For a detailed description and a full overview of the current state of the art in uncertainty quantification methods we refer to Gawlikowski et al (2021).

Bayesian methods
Inspired by Bayesian theory, Bayesian DL aims to change conventional DL architectures to have a prior distribution on the weights of the model parameters, instead of a single value (figure 7).
In this way, the model can easily generate an estimation of the uncertainty, since it will produce a (posterior) probability distribution over the output for a given input sample. The challenge in Bayesian DL architectures is that the inference of the model posterior distribution becomes intractable, due to the high computational complexity required to estimate the weight distributions. This is especially true for complex models with a large number of parameters, such as modern deep neural networks. This is the reason why the research community has focused on developing approximated versions of the full Bayesian framework. One of the most popular approaches is to use Monte Carlo Dropout (MCDO) as Bayesian approximation (Gal and Ghahramani 2016). Dropout is a mechanism initially designed to avoid overfitting during training (Srivastava et al 2014), and it consists in switching off (i.e. dropping) a random fraction of neurons in the network (figure 8). When a neuron is turned off, it is hidden from the network and its output is zero. In MCDO, the neurons that are dropped are sampled from a Bernouilli distribution. Typically, dropout is applied during training, but when using MCDO as Bayesian inference approximation, dropout is also used at testing time. As a consequence, when several (T) predictions are obtained with active MCDO, all T predictions will differ from each other, since they stem from slightly different models, with different sets of neurons that are turned on or off. By performing a sufficient number (T) of predictions, one can have a sort of approximation for the (posterior) probability distribution of the output. This sample of T predictions is used to compute the mean and standard deviation, the former being equivalent to a pointwise prediction and the latter being a surrogate for the predictive uncertainty. In addition to the sample standard deviation, mutual information and predictive entropy are other metrics that can be extracted from the T predictions and are commonly used as a surrogate of the predictive uncertainty (Gawlikowski et al 2021).
The advantage of MCDO is that, as soon as dropout layers are included in the architecture of the network, the implementation and computational efforts to obtain the uncertainty are minimal. On the one hand, the architecture for conventional DL models does not need to be modified to apply MCDO at inference time. On the other hand, despite having to perform T predictions, with current DL models inferring within a few seconds, the uncertainty estimation is rather quick ( figure 10(a)).
MCDO has started to be a popular tool to quantify the predictive uncertainty of ML models for medical applications. For instance, (Mobiny et al 2019) used MCDO to build a risk-aware ML model to detect skin lesions. The model asked for clinician input when the uncertainty of the prediction was too high, and thus, the clinician-ML workflow reached a much higher accuracy than the (non risk-aware) ML model alone. The same group recently published another study (Mobiny et al 2021) where they used a generalized version of Dropout, DropConnect (Wan et al 2013), to quantify the uncertainty in a CNN trained to segment different organs in abdominal 3D CT scans. They used the mutual information to estimate the epistemic uncertainty, since they  were interested in knowing the regions of the data space where the model was uncertain. Also for a segmentation task, in prostate cancer patients,  used MCDO to estimate and visualize the 95% upper and lower confidence bounds for each prediction, which informed the physicians of areas that might require correction. MCDO has also been used for regression tasks, such as to generate an uncertainty map when predicting the dose for radiotherapy in prostate  or head and neck patients (Vanginderdeuren et al 2021) ( figure 9). Yet a last example, (Nair et al 3.2.3.1) provided an interesting comparison of different uncertainty measures derived from MCDO (predictive variance, MC sample variance, predictive entropy, and mutual information) for segmenting lesions in brain MR images. They illustrate how the different metrics do not highlight the same regions.
Note that recently, several groups have started to go beyond the MDCO approximation and use an approach closer to the full Bayesian framework. In LaBonte et al (2020) and McClure et al (2019), the authors compared MCDO to a CNN where the weights were sampled from a distribution (Blundell et al 2015). In this case, the models learn the parameters of the distributions instead of the weights values. They showed that such models produce better results and more interpretable uncertainty maps as we can decompose aleatoric and epistemic uncertainties (Depeweg et al 2018), as presented also for an ischemic stroke lesion segmentation model (Kwon et al 2020).

Ensemble methods
Ensemble methods deploy concurrent models that solve the same problem and compute a prediction based on the individual predictions of the ensemble members (e.g. average, majority voting, etc) ( figure 10(b)). Initially, they were developed to improve the performance of ML models, with stronger generalization and stability. They rely on the hypothesis that a group of decision makers tend to provide better decisions than a single one, since they complement each other's weaknesses (Schapire 1989, Sagi andRokach 2018). Having multiple predictions for the same problem allows ensemble methods to represent the model uncertainty on a prediction in a rather simple way: by evaluating the variation among the individual predictions (e.g. with the standard deviation). Ensemble learning was used successfully in Wickstrom et al (2021) to detect myocardial infarction in echocardiograms by identifying relevant time steps. The drawback of ensemble methods is that they have a higher upfront cost, since multiple models need to be trained individually. However, uncertainty generation at inference time can be as fast as MCDO. To some extent, MCDO is an ensemble method where all models are subnetworks of a complete neural network.
A popular ensemble learning algorithm is bagging (Bootstrap AGGregatING). Bagging uses random subsets of training data (allowing replacement) to build multiple models and averages out their results. Apart from the computational cost, ensemble methods have no technical complexity, and that has motivated their use in different medical applications, often in comparison with Bayesian methods. For instance, the aforementioned examples for dose prediction in radiation therapy , Vanginderdeuren et al 2021 compared bagging against MCDO.

Beyond conventional supervised learning
This manuscript has been entirely focused on supervised learning, which is the most used learning framework so far in medical applications. As previously introduced, supervised learning relies on the availability of a dataset that contains input-output (x, y) pairs, where y is in charge of supervising the model training. In other words, supervised learning requires a set of examples x for which the desired answers y, also called labels or annotations, are known. This entails a strong dependency of the model performance on the quantity and quality of the labels y (see section 2.1). This section presents different learning frameworks that can help reduce this dependency, allowing the model to perform well even with few or low-quality labeled data samples. In addition, we also discuss how some of these learning frameworks can help to overcome the static and task-specific nature of current ML models, improving their generalization capacity (see section 2.2).

Unsupervised learning
A way to reduce the model performance dependency on the availability of large sets of high quality labeled data is to shift towards learning frameworks with less supervision. Unsupervised learning deals with data x without output values y and it aims at exploring the features and patterns in the distribution of data in x, such as clusters, modes, and outliers (Bengio et al 2013). It is sometimes known as self-organization, since the learning process is blind and cannot rely on unambiguous supervision. Some techniques of unsupervised learning can help reduce the problems of insufficient data due to the cost of manual annotations, as well as those of inappropriate data due to the quality of the annotations. For instance, cluster labels obtained with unsupervised learning can be adopted as class labels in further supervised learning (Peikari et al 2018). The use of unsupervised learning is still less extended than supervised learning, but many groups are starting to explore fully unsupervised or semisupervised techniques (i.e. when only a part of training data contains known outputs) in the medical domain (Raza and Singh 2021). Examples of unsupervised and semi-supervised learning include clustering to identify patterns across patients suffering from Alzheimer's disease (Alashwal et al 2019), or medical image analysis like in Gu et al (2020), where the authors incorporate local structure of unlabeled data into their random forest algorithm. Examples specific to the radiotherapy domain includes the use of unsupervised learning to correct cone beam CT scans for artifacts (Dong et al 2021), or to learn radiomic features that predict treatment response and overall survival of lung cancer patients , among others (Raza and Singh 2021).
Recently, a new variant of unsupervised learning, namely self-supervised learning, has been gaining attention (Lan et al 2019, Taleb et al 2020, Hatamizadeh et al 2021, Jing and Tian 2021. This framework uses unlabelled data but exploits labels that come almost for free, which are intrinsically present in the data and can be extracted from its structure to solve pretext tasks. An example of a pretext task could be rearranging image patches such as parts in a jigsaw (figure 11). Self-supervision works in two steps, the first aiming at obtaining the supervisory outputs (y) by solving a pretext task, whereas the second uses them to solve the actual task of interest.
Self-supervised algorithms start only to be used in medical applications, but good illustrative example of their potential is the work of Chen et al (2019), who used self-supervision for image classification of 2D fetal ultrasound images, organ localization on abdominal CT images, and segmentation on brain MR images (downstream tasks). Their strategy consisted in modifying the spatial distribution of the images, and training a network to restore the original version in order to learn the contextual information (pretext task).

Reinforcement Learning
Together with supervised and unsupervised learning, reinforcement learning is often considered as the third main learning paradigm. In reinforcement learning, the algorithm simulates an agent that interacts with its environment to perform a certain task over time. During training, the agent takes successive actions to change state and eventually reach a final one, like victory or defeat in a game. After each action towards a new state, the environment can either reward or punish the agent who has then to best predict the longer-term consequences of future actions in a trial and error fashion. The difficulty of policy making in reinforcement learning is that immediate rewards are not necessarily correlated with ulterior gains. Hence, feedback partly guides the agent who learns to act based on either past experiences (exploitation) or new choices (exploration).
Reinforcement learning has been used in medical imaging to devise and generate specific treatment plans for cancer patients treated with radiation therapy  as well as for other diseases (Watts et al 2020). For instance, the study by Zhang et al (2020b) describes a planning bot based on reinforcement learning to systematically address complex dose tradeoffs and achieve high plan quality for stereotactic body radiation of pancreas cancer patients. The authors defined planning actions to represent steps that human planners would commonly implement to address different planning needs, and they derived a reward function based on the physician-assigned constraints, as one would do in clinical practice. In addition, the authors claimed that the training phase of the bot was tractable and reproducible and that the acquired knowledge was considered to be interpretable by humans. This example shows that, in order to define the environment and actions in reinforcement learning algorithms, significant prior and domain-specific knowledge is needed. In exchange, the advantages of reinforcement learning is that it can help humans to explore new actions (e.g. new planning strategies, new treatments) that have not been previously investigated in clinical practice. It is the case of the study by Moreau et al (2021), who explored new radiotherapy dose fractionation based on a tumor growth model. Other applications include image segmentation (Li and Xia 2020, Winkel et al 2020 or reconstruction (Shen et al 2018).

Active learning
Beyond shifting towards strategies requiring less supervision, another approach to reduce the label workload is active learning (Abdar et al 2021, Budd et al 2021. This learning framework builds upon supervised learning, but starts with a small set of labeled data and later selects the best data to be annotated next for optimal model Figure 11. Self-supervised learning workflow with an example of a pretext task where the input is the image from which patches have been mixed up. The aim of this pretext task is to reconstruct the initial image hoping the encoder extracts useful features from the data (Inspired from (Taleb et al 2020). The knowledge acquired by the trained network on the pretext task is later used to carry out the main, original task. performance ( figure 12). The selection is based on the estimation of the informativeness of each unlabeled data sample. The chosen candidates are labeled by an expert and subsequently added to the training set. Then, the model can be retrained from scratch or fine tuned by using the new labeled data. In short, active learning is a type of iterative supervised learning where the model demands the most relevant data for an optimal performance. As informativeness is not a metric in itself, multiple methods exist to select the samples to be labeled. Most of them are based on uncertainty quantification strategies (section 3.2.3) and sometimes combined with other quantities such as representativeness (Huang et al 2014) (Du et al 2017). Representativeness is used to select instances that are the most emblematic of the unlabeled dataset and thus contribute to better coverage of the (patient) data distribution domain under study. Using only uncertainty as the selection metric can lead to situations where out-of-data distribution instances are selected because of their high uncertainty, and thus they will instead worsen the model performance once they are included in the training. In their medical image segmentation framework MedAL, detailed in Smailagic et al (2018), authors use as metric a combination of uncertainty measure and distance between feature descriptors. In (Sourati et al 2018), the Fisher information is used to ensure diversity among queried samples.
Once the metric is chosen, unlabeled data can be ranked accordingly. First active learning algorithms selected the most informative sample or subset to submit them to human experts for labeling. In Kirsch et al (2019), authors argue that performing the labeling of a batch is more efficient as it reduces the frequency of expert intervention. Other methods such as CEAL (Cost-Effective Active Learning)  consider that while keeping the human labeling for informative data, samples for which the network is most certain about should be labeled automatically by the model itself.

Transfer learning
Transfer learning (Pan and Yang 2010) reuses part of the architecture and parameters values in a model trained with a given data for a certain task (source domain and task), and tune the model to be applied to a different data or task (target domain and task). Notice that transfer learning is a high-level, abstract framework that can be applied to any model, regardless of the learning paradigm (i.e. supervised, unsupervised or reinforcement learning). The advantages of transfer learning are twofold. On the one hand, one can solve the target task with very little data ( figure 13). On the other hand, learning from little data enables the quick generation of new models that work for different tasks, as well as to efficiently update models that were no longer valid due to a change of the data distribution over time. As a consequence, transfer learning is an excellent technique to overcome to some extent the static and task-specific nature of current ML models, improving the generalization to the same domain (i.e. i.i.d. data) or different domains (i.e. shifted distributions) (section 2.2). The particular use of transfer learning techniques to adapt models to different domains is also known as 'domain adaptation' (Wang andDeng 2018, Guan andLiu 2022). Often, the term multi-task learning is also used, when the goal is to learn multiple tasks (Caruana 1998, Zhang and. Examples of the use of transfer learning in the radiotherapy field are many. For instance, a radiotherapy dose prediction study reported several planning styles for prostate cancer patients treated with VMAT and demonstrated that, through the usage of transfer learning, the models were capable of adapting from one planning style to a new target style. Transfer learning significantly reduced errors for clinical dose metrics on target datasets with limited training data size for the target domain, as low as 16 patients (Kandalan et al 2020). Another study, already discussed in section 2.2, focused on CBCT to CT image conversion for prostate, pancreatic, and cervical cancer patients. They found that the models were not generalizable across different image scanners, due to different characteristics and parameters in the scanners themselves. Significant improvement in the model performance was observed when using transfer learning to adapt to the target data distribution from a different machine (Liang et al 2020). Yet another example is the recent work from Mashayekhi et al (2021), who developed, through the use of transfer learning, a site-agnostic radiotherapy dose distribution prediction ML model. The model can leverage data from any treatment site (e.g. prostate, head and neck) and it only requires a brief fine-tuning with a small dataset to be applied to a new site.
The examples above used labeled data from the target domain. When the labels are not present in the target domain dataset, the problem then becomes unsupervised transfer learning, most known as unsupervised domain adaptation (Wilson and Cook 2020, Kouw and Loog 2021). For instance, (Perone et al 2019) explored unsupervised domain adaptation for segmentation of MR images. Similarly, Kamnitsas et al (2017) used unsupervised domain adaptation for brain lesion segmentation. Another good example is the study by Brion et al (2021), where the model used unsupervised domain adaptation to leverage a large database of annotated pelvic CTs (source domain) to segment CBCT images (target domain). The target domain database contained CBCT scans that were not annotated. This is extremely useful for the actual clinical practice in radiotherapy, where the manual segmentation is done in CT images while CBCT scans are typically left un-labelled, since they are used chiefly for repositioning or for visual inspection of the anatomy.

Other trends
With the fast evolution of ML, more and more higher level learning frameworks, like transfer learning or active learning, get formalized and investigated. They sometimes combine existing learning paradigms (e.g. supervised learning), frameworks (e.g. active learning) and strategies (e.g. prior knowledge incorporation) to solve a specific problem. A good example is the popular few-shot learning regime (  Learning, where knowledge is transferred from another network performing a similar task. The advantage is that the required size of Dataset 2 can be reduced significantly. very limited number of samples with specific learning techniques. Humans are very good at recognizing new classes (e.g. a book), even when only one or a few examples of that class have been shown to us. Sometimes, we can even distinguish objects from classes that we have never seen, based on our prior knowledge and the (dis) similarity to other known classes (i.e. zero-shot learning (Lampert et al 2009)). Few-shot learning and its extreme variant one-shot learning (Fei-Fei et al 2006, Koch et al 2015, Vinyals et al 2016, try to mimic this human learning feature by integrating prior knowledge into ML models. Few-shot learning is often referred to as a type of meta-learning, a concept that defines algorithms that 'learn to learn', i.e. algorithms that are able to learn from multiple tasks and extrapolate the acquired knowledge to carry out new tasks (Seita D 2017, Finn et al 2017. , where the goal is to find a solution given only a few state-action pairs. Several strategies have been proposed to efficiently include prior knowledge into ML models; some of them have been already described in detail in section 3.2.2. For a complete review of all possible strategies to incorporate prior knowledge in the context of few-shot learning we refer to a recent survey by Wang et al (2021), who identified three main categories: 1) data, using prior knowledge to augment the data from few to many samples (e.g. data augmentation or transfer learning); 2) model, using prior knowledge to reduce the size of the optimization space search; and 3) algorithm, using prior knowledge to alter the search strategy to learn efficiently from few samples. Examples of recents applications of few-shot learning in the medical domain include the study by Medela et al (2019), who reduced the need of labeled data in diagnosis of histopathological images. They used a popular fewshot learning model, namely, Siamese networks (Koch et al 2015), which distinguished the different classes by ranking the similarity between input images. Other examples include the use of few-shot learning for deformable image registration and motion tracking in 4DCTs (Fechter and Baltas 2020, Chi et al 2022. In addition to few-shot and one-shot learning, zero-shot learning studies are also becoming popular (Palatucci et al 2009, Socher et al 2013, Changpinyo et al 2016, Wang et al 2019c, where the aim is to build a ML model that is able to generalize to totally unseen domains. Zero-shot learning can be considered as an extreme subfield of transfer learning. Techniques to solve zero-shot learning problems include simple techniques such as data augmentation (Xu et al 2016) or more sophisticated techniques (Wang et al 2019c). Although the concept of zero-shot learning is not much investigated yet in the medical domain, a recent example of its application is the study by Paul et al (2021), which presented a zero-shot learning algorithm to diagnose chest x-ray images.
Beside few-to zero-shot learning regimes, other recent or trendy concepts are worth mentioning. For instance, continuous learning (Parisi et al 2019, Lee 2020, Pianykh et al 2020), where the goal is to build ML models that are not static, meaning that they can adapt to a slowly changing data distribution over time and to their ever-changing environments. Continuous learning can serve to prevent catastrophic forgetting, which is when ML models forget the previous data seen during training, leading to overall reduced performance Another example is the Self-Net described in Mandivarapu et al (2020), that uses an autoencoder to learn a set of low-dimensional representations of the weights learned for different tasks. An example of continuous learning in the medical domain is the study by Kiyasseh et al (2021), whose model learned to deal with cardiac signals across diseases, time, modalities, and institutions. As the models become more and more used in the clinical setting, developing stable continuous learning methods will become essential for the long-term viability of the models. Notice that continuous learning can also be considered as a meta-learning framework where the ML model learns to learn over time and environmental changes.
Other emerging learning frameworks include automatic machine learning (AutoML), or federated learning. AutoML tries to build ML methods that automatically configure themselves, including data preprocessing, network architecture selection, training, and post-processing for any new task (Hutter et al 2019). The idea behind AutoML is to automate the trial-and-error process that data scientists and practitioners typically carry out manually to find the optimal pre-processing steps and hyperparameters of the ML architecture. A recent example of autoML in the medical field is the increasingly popular nnU-Net, an autoML model for segmenting organs from any medical images (Isensee et al 2021).
Federated learning, also known as distributed learning (Boyd 2010), allows ML models to be trained with data sets of several origins (e.g. hospitals or clinics) without pooling them. As it can maintain patient data confidentiality, federated learning therefore raises much interest in the medical domain (Chang et al 2018, Sheller et al 2019. Instead of bringing all data to a central repository to train an ML model, distributed learning brings the model to the data. This approach facilitates cooperation through coalitions in which each member retains control and responsibility over its own data, including accountability for privacy and consent of the data owners (i.e. patients). Federated learning can also help ML models to better generalize, since they are exposed to training data from different hospitals, better encoding the variability of the problem.
To conclude this section, figure 14 summarizes some of the issues that have been discussed above, as well as some of the possible solutions (strategies, tools, and frameworks) to mitigate them. As it can be seen, there is no one-to-one mapping between issues and solutions, and practitioners often need some experience to identify the best associations.

Discussion: clinical implementation of ML in radiation oncology-the big picture
The previous sections have provided the reader with a general background about the risks and limitations associated with the use of ML in the clinical environment (section 2), and the different techniques that are being investigated by the research community to better identify and overcome those issues (section 3). This section discusses the specific application of ML techniques into the radiation oncology workflow and the implications this has for the clinical practice of this field. First, we start by walking the reader through the radiotherapy workflow, and discuss in detail key tasks that are undergoing a paradigm shift with the introduction of ML. Second, as clinical software is most of the time provided by industrial companies or vendors and implemented in close collaboration with them, we discuss the vendors' approach and point of view regarding the clinical use of ML.

Considerations on the radiation therapy workflow
The typical workflow of radiation oncology can be summarized in a sequence of tasks presented in figure 15. The inclusion of ML in the workflow aims at reducing human intervention, automating the tasks, standardizing clinical practice, and improving the overall treatment quality. As previously introduced, the gap between expertise and resources between institutions is sometimes quite big, representing one of the greatest inequality and challenges in health-care (Lievens et al 2020). Incorporating ML in the radiotherapy workflow can help to homogenize and improve clinical practice.
Historically, classical ML and image processing techniques (active contours, watersheds, multi-atlas registration, K) have been long used in an attempt to automate tedious, manual, and time-consuming tasks in the radiotherapy workflow. However, they often still required manual intervention and lacked some form of intelligence and memory. The disruptive change occurred with the advent of modern DL models, i.e, CNNs and image-to-image architectures like U-Net (Ronneberger et al 2015). Although much less interpretable than the aforementioned classical methods, DL models are now the state of the art.
Cancer diagnosis and treatment choice is the first step in the presented workflow, and involves the analysis of different types of data: medical records, patient's symptoms, raw images, histopathological data, genomic data, etc. Processing these large amounts of heterogeneous data is becoming a challenge for humans and, thus, the Figure 14. Tentative mapping between common issues that are encountered in the implementation and deployment of machine learning and some possible solutions to overcome them. Since machine learning relies on data-driven models, both the issues and solutions can be seen from the angles of data, modeling, or learning.
inclusion of intelligent systems for decision support might be of big help. Diagnosis is one of the earliest applications of ML in oncology, and the first studies date from the mid 1990 and early 2000 (Bertsimas and Wiberg 2020), where traditional ML models were used to analyze gene expression profiles and detect cancer biomarkers or to analyze images to detect features indicating the presence of cancer (Wolberg et al 1995). Two of the first cancer locations in which the research community started to focus on were skin and breast cancer. Today, though, a wide range of cancer types and locations benefit from the use of ML as a decision support tool (Bertsimas and Wiberg 2020, Iqbal et al 2021, Kleppe et al 2021. In addition to diagnosis, numerous studies focus on predicting radiation toxicity and possible side effects, in order to aid the physician to select the best treatment protocol (Isaksson et al 2020, Tran et al 2021. While the earliest applications for diagnosis and treatment choice focused either on one type of data (e.g. genomics or images), current ML have the potential to process several types of data simultaneously by fusion of the information at different parts of the model architecture (see section 3.2.2), therefore making a better informed diagnosis. Progressively, ML models for diagnosis and treatment choice start to be applied in clinical routine (Benjamens et al 2020, Savage 2020; some claim that the ML model rivals with or even outperforms human experts (Esteva et al 2017). However, the truth is that there are still very few ML applications developed in research environments that have made it to the clinic, due to poor generalization or the inability to guarantee the correctness of the answer. To overcome those issues, several solutions have been proposed in this manuscript, which are in line with the recent literature in ML applied to diagnosis. For instance, (Kleppe et al 2021) advocate the evaluation of the ML model in external cohorts, which could be also achieved with extensive data augmentation techniques when external cohorts are not available, as presented in section 3.2.1. Uncertainty quantification (see section 3.2.3) is another of the keys advised for diagnosis and decision-making ML models (Begoli et al 2019), which can be combined with techniques for explainability (see section 3.1) to ensure that there is no learning bias when building the models. Lastly, reinforcement learning algorithms can help to explore new and personalized treatment in a wellcontrolled framework.
After the treatment protocol has been selected, the second step is to image the patient with a CT scanner, in a controlled setting, simulating the treatment position and with proper immobilization devices to avoid motion. Eventually, other images needed for the treatment can also be acquired (e.g. MR, PET.,...), if they were not already taken in the diagnosis step. Given the excellent performance of modern deep CNNs to analyze and deal with images, many applications have been developed related to this imaging step (Shan et al 2020). For instance, ML is used in image reconstruction (Ahishakiye et al 2021), to increase the quality of the image by removing artifacts (Xie et al 2018, Dong et al 2021, or to register the different acquired images (Fu et al 2020, Haskins et al 2020. Particularly for image registration, the interest has rapidly increased in the last years, with numerous publications investigating some of the most advanced ML techniques, such as one-shot learning , unsupervised learning (Balakrishnan et al 2018), or reinforcement learning (Hu et al 2021a), among others. In short, ML models for image registration try either to learn feature maps for the input moving images and fixed images, or to learn new image representations for the original fixed images and moving images (e.g. transform the original images to be better suited for registration) (Fu et al 2020). The use of unsupervised techniques is very helpful for ML registration models, because it suppresses the need for ground truth deformation fields, which are costly to generate. Another direction to improve future ML models for registration includes boosting their performance by incorporating prior knowledge (see section 3.2.2). For instance, prior information related to the expected type of deformation, spatial relationship between anatomical structures, and the topology and morphology of anatomical structures, could be added to allow the ML model to perform better (Fu et al 2020).
Another popular task related both to the imaging step and to the treatment delivery, is the conversion or generation of synthetic images. Since the attenuation coefficients in the CT image are needed to perform treatment planning and dose calculation, techniques relying on other images, such as adaptive therapy based on CBCTs or MR-only radiotherapy, largely benefit from the use of ML to generate a synthetic CT. Image synthesis is thus considered the third most popular clinical application . Numerous examples of image synthesis have been given throughout the manuscript, such as the use of GANs to convert MR to CT (Maspero et al 2018, Kazemifar et al 2019 or CBCT to CT (Liang et al 2019b). A common concern in this field is the generalization of the ML model to different scanners and acquisition protocol (see section 2.2), and much effort has been put into addressing this issue with different techniques, such as transfer learning (Liang et al 2020) or data augmentation, among others. Beside generalization, a future research line could be developing techniques for interpretability and explainability for image synthesis. However, this is not straightforward, since in contrast to classification and segmentation tasks, the network will not focus on specific parts of the images but rather on the full image to be converted. In this case, CAV could be of help, in order to provide the user with the more relevant concepts for the transformation for verification (Lucieri et al 2020, Kim et al 2018. In contrast, the risk of failure could be easily assessed with uncertainty quantification tools as described in section 3.2.3.
Once the images are acquired, the following step is to contour or segment the relevant volumes needed for treatment planning. In particular, the segmentation of most organs from CT images is considered nowadays a pretty much solved problem, with the latest works reporting an accuracy similar to human experts' performance. For instance, (Nikolov et al 2018) achieved a Dice coefficient over 90% for most organs in the head and neck region. Motivated by these results, many research groups and clinical teams have already attempted a clinical implementation of ML based automatic segmentation, using either in-house or commercial solutions (van der Veen et al 2019, Brouwer et al 2020, Cha et al 2021. In fact, automatic contouring is today the most used ML application in the clinic and, therefore, we will discuss it in detail in the following paragraphs. In a survey from 2020, Brouwer et al reported that 26% of the responders were already using ML-based contouring in their clinics (most of them with commercial software, 76%), and nearly 20% were preparing for its implementation . However, despite this large adoption in the clinic, the current ML methods for automatic segmentation still lack QA tools to assess their interpretability and risk of failure. This is today compensated in quite a rudimentary way: the QA of the ML-contours is performed by visual inspection of a medical expert, who edits the contour in the regions where the ML model has failed. Although the time and magnitude of the editions are much shorter than fully manual segmentation (for instance, about 33% shorter for head and neck contours) (van der Veen et al 2019), the process still requires the systematic presence of a medical doctor for QA. When generating the contours offline, before the treatment starts, this can be manageable. But in adaptive radiotherapy workflows, where new contours have to be generated while the patient is on the couch, requiring the presence of a physician for every treatment fraction is truly a big limitation. Hence, it is imperative that clinical ML models start to integrate QA tools similar to those presented in section 3, in order to ensure their efficient and safe usage. Applying interpretability and explainability techniques during the training and validation phase, in particular those for visualization of the relevant regions contributing to the prediction (e.g. CAM, gradCAM and variants), can help debug the model faster and ensure it works correctly. In contrast, during routine clinical use, especially in adaptive settings, interpretability and explainability techniques might not be the best QA tools due to the tight time constraints. As introduced in section 3.1, interpretability is a complex concept, involving many factors, and both time and user-expertise play important roles. Unless very intuitive explanations can be provided for a fast evaluation by the medical staff, when time is crucial (i.e. adaptive treatment procedures), uncertainty quantification tools might be a better option. For instance, one could implement a flagging system based on the level of uncertainty associated with the prediction. When uncertainty is low, the treatment can be performed right away with the ML contours, without the need of edits by the medical doctors. When uncertainty is high, the doctors are asked to verify (and edit) the ML contours offline and, eventually, they are provided with explanations that support the ML answer. Such a workflow can save much time for the medical staff and, most importantly, it relieves users of constant QA. Moreover, the manual offline editions can later serve to improve the ML model if an active learning framework is deployed (section 3.2.4).
It is important to stress that, as in any classification task, the vector of class (organ) probabilities that the model yields for each voxel (i.e. the softmax output) is not a measure of uncertainty, but just a pointwise estimate of a class probability (Gal 2016). Indeed, this probability is often misinterpreted as an uncertainty, which can be misleading and risky. Voxels classified with a high probability can still carry a high uncertainty, especially for cases that are far from the training set (Gal 2016). Instead, techniques such as MCDO or other Bayesian approaches, as well as ensemble methods can be used to estimate the uncertainty and an associated confidence interval (figure 16).
Concerning the segmentation models for target volumes, there is still much room for improvement to have robust and accurate models. In contrast to the segmentation of organs, which can work rather well by just using the anatomical information in the images, the segmentation of target volumes involves many other variables. For instance, information from several imaging modalities is often used by the physicians to draw the clinical target volumes (e.g. MR, PET, endoscopy, K), together with indicators or reports from clinical examinations. In order to reach human level performance, ML models for target segmentation need to integrate this information and domain-knowledge, using the techniques presented in section 3.2.3. In addition, interpretability and explainability tools can be of much more importance here than in the case of organ segmentation, since QA cannot be done with a simple visual check due to the large number of variables involved. Apart from visual explanations like CAM and variants, text-based explanations relying on CAV (section 3.1.2) could be a QA for the provided contours.
Another strategy that can be of help to have efficient segmentation ML models that brings a real added value to the clinic is the use of techniques requiring less supervision (see section 3.2.4). This is especially important for image modalities used in adaptive settings (e.g. Cone Beam CT or MR), since retrospective databases of contours on these images are typically unavailable (i.e. the contours are done on the CT but not on the CBCT or MR). In this case, one could apply techniques such as unsupervised domain adaptation (UDA) (Ganin and Lempitsky 2015, Kamnitsas et al 2017, Brion et al 2021. UDA is a sort of unsupervised transfer learning strategy, where the modality for which the labels are available is considered the source domain (e.g. CT), and the modality without labels is the target domain (e.g. CBCT or MR). In all cases, if done properly, the introduction of ML segmentation in the clinic will definitely bring an improvement and standardization of the practice. Instead of having paper guidelines (  from one center to another, reducing the inter-observer variability .
After volume segmentation, the next labor-intensive step in the workflow is treatment planning and the optimization of dose distribution. Although current TPSs heavily rely on inverse problem solving and iterative computerized optimization, the definition of the objectives to attain and the constraints to fulfill is often difficult and requires trade-offs that are not easy to formalize mathematically. Once again, ML can memorize from past examples of such tradeoffs and generalize to new patient cases.
One of the first approaches for automatic treatment planning with ML used Random Forest algorithms in combination with multiple atlases (Contextual Atlas Regression Forests) to predict the dose distribution for a new patient based on the information in the atlas Purdie 2016, McIntosh et al 2017). Almost in parallel, several groups started to explore DL image-to-image networks (like U-Net or GANs), to predict the dose for a new patient anatomy using the CT and organs as input (Fan et al 2019, Kearney et al 2020. Needless to say, none of these approaches is very interpretable, but the one based on multi-altas Random Forests can implicitly report atlas distances representing the most-similar patients from the training set. Recently, this approach has been implemented clinically and analyzed prospectively (McIntosh et al 2021). They reported that these distance metrics could indeed be used to flag lower-quality generated dose distributions and the potential need for human verification, increasing the interpretability and usability of the method. For dose prediction methods based on U-Net or GANs architectures, there is no intrinsic attribute that could provide similar information. However, recent studies have explored the use of MCDO and ensemble methods (section 3.2.3) to quantify the uncertainty associated with the predicted dose , Vanginderdeuren et al 2021. As previously introduced, this uncertainty estimation can be used in a similar way to flag the poor performance of the model, as well as in active learning workflows to further improve the model. These studies reported the correlation coefficient between the estimated uncertainty (using the standard deviation, see section 3.2.3) and the actual prediction error (difference between ground truth and predicted dose). However, there is still room for improvement in order to achieve accurate metrics for uncertainty quantification in dose prediction, since the reported correlation coefficients were sometimes very low , Vanginderdeuren et al 2021.
In addition to risk assessment tools, two other lines of research are worth mentioning in the race for efficient and clinically meaningful ML models for dose prediction. Similar to ML models for segmentation, when properly implemented, ML-based dose prediction can bring significant improvement for clinical practice. For instance, since ML models can infer in a few seconds, one can predict dose distributions for different treatment modalities (e.g. proton therapy versus conventional radiotherapy), in order to refer the patient to the most optimal treatment (Guerreiro et al 2021). This allows for huge time savings and efficient resource usage.
To exploit the predicted dose distribution and to generate the final treatment plan, several options are possible. The most popular one is to use the predicted dose as a voxel-wise objective in the TPS, alone or in combination with dose-volume metrics. This optimization process is often called dose-mimicking, and translates the synthetic predicted dose into a physically deliverable dose (McIntosh et al 2017, Babier et al 2020). The process is the same as in regular treatment planning, using algorithms similar to gradient descent for optimization and analytical or Monte Carlo methods for dose calculation. However, some groups have pushed the use of DL even further, trying to predict the treatment plan (i.e. the machine parameters or fluence maps) from the predicted dose distribution (Wang et al 2020b) . Although these research studies are excellent to explore the potential of DL models in the radiotherapy workflow, we should be extremely cautious when it comes to clinical implementation. Indeed, a distinction should be made between soft computing (e.g. ML models) and scientific computing (e.g. physics-based models, analytical models), with the former not providing any strong guarantees of consistently good performance, whereas the latter does. DL models are excellent methods to be applied when we want to be fast, automatic, and free of any human intervention, like in segmentation or treatment planning. However, when fast and automatic scientific computing models already exist for a given task (e.g. optimization or dose calculation) and soft-computing does not bring any significant improvement in performance, scientific computing and physical models should be encouraged. Recently, another approach attempts to find the sweet spot between these two options, which could be physics-informed ML models (Raissi et al 2019) (section 3.2.2). These ML models have the particularity of being constrained with physical rules, and could help to extract the best from soft-and scientific-computing methods, while also increasing their interpretability (Rudin et al 2021).
Once the treatment is ready for delivery, the next step is to perform QA tests to ensure that the treatment is delivered as planned. ML applications in radiotherapy QA started to become popular around 2016, and many relevant studies have been published since then , Kalet et al 2020. Some examples include the study by (Li and Chan 2017), who developed a model to predict the performance of a Linac over time; the study by Osman et al (2020), who trained a model with log files to predict the multi-leaf collimation leaf positional deviations; or the study by Valdes et al (2016b), who designed a ML model to predict passing rates for IMRT QA. Although the use of ML in radiotherapy QA might be very beneficial for the medical physicists team, further automating and improving the QA process, the models developed so far have several limitations. (Kalet et al 2020) claim that data quality and model generalization are among the main limitations. As discussed in section 2, low-quality and insufficient data might lead to biased performance of the ML model. In order to overcome this issue,  advocates for multi-institutional validation of the developed ML models. In this context, federated learning might help to gather data from several institutions while preserving privacy and security. During the delivery of the treatment, several of the previously discussed tasks come again into play. For instance, in adaptive or image-guided radiotherapy, we use daily images to monitor the treatment and eventually adapt it to the new anatomy. In this context, ML models for image synthesis or conversion become useful when the monitoring image (e.g. CBCT) needs to be converted into a CT. Similarly, ML models for image registration, automatic segmentation, and treatment planning are useful to generate the adapted plan in a fast and automatic manner. Beside these applications, another task that can benefit from ML and has not been discussed so far is motion management. For instance, (Lin et al 2019) developed a ML model to predict tumor motion by combining features coming from images and Electronic Health Records.
After the treatment has been delivered, the final step is to follow-up the progression of the disease and the possible treatment complications. For this purpose, the patient has regular consultations every few months, where the patient's condition is analyzed and images are acquired if needed. Treatment outcome prediction can be of help at two time frames: 1) at the beginning of the treatment, to aid the treatment choice; and 2) at the end of the treatment, to predict the locoregional control and survival probabilities for a given patient. For instance, recent studies used ML to predict the treatment response for bladder cancer (Cha et  In their topical review, (Isaksson et al 2020) claim that, since they play an important role for treatment choice, critical efforts are required to improve the transparency of ML for outcome prediction, making them accessible to the clinical staff, who have little or no specific background on ML. Interpretability and explainability techniques such as the ones presented in section 3.1 could definitely help to reach this goal. Recently,  has published a review about popular applications is outcome prediction, discussing in detail the balance between interpretability and accuracy, and providing techniques to find the optimal settings for their safe clinical implementation.
To finish, we would like to bring up our point-of-view on how the clinical workflow will change with the introduction of ML. Although the implementation of techniques for interpretability and risk assessment presented here (i.e. data curation, uncertainty quantification, domain-knowledge, K), will reduce the human QA, it will still continue to be very important. Thus, the work of physicians, medical physicists and dosimetrists, will evolve from performing manual tasks to supervising ML models (Korreman et al 2021). Moreover, the medical staff will play a crucial role in data collection and curation to build ML models. The already multidisciplinary nature of this field will become even more important, since that will be the key to achieve comprehensive ML models that efficiently incorporate relevant domain-knowledge. Note that the need for interpretable and safe ML models start to be also discussed in legal environments and regulatory institutions both in Europe and America (Anon 2021, Bibal et al 2021). A famous example is the General Data Protection Regulation (GDPR), which specifically constrains the use of black box models in certain cases.

The vendors' perspective
As previously introduced, a large majority of clinically implemented ML software comes from industrial companies. Thus, the vendors play a crucial role in an efficient and safe deployment, since they are responsible for the released models. In the following, we go through the different phases of the clinical implementation of ML models from the vendor's perspective.

Model development and commissioning
Developing an ML model includes many steps: data collection and curation, model training, model configuration, and validation (figure 17). As vendors are responsible for the released ML models for their entire life-cycle, all these steps need to be managed and documented by them, not least for the regulatory processes.
In particular, the data included in model development needs to be accessible to the vendor for future support, model upgrades and regression testing. Often, the data collected, either from public sources or from clinics or both, need to be curated to align with the selected guidelines and protocols and to fit the purpose of model development (section 3.2.1). Ideally, vendors and clinics should agree and align on interpretation of guidelines and protocols as part of the data curation. Meta data for the datasets also need to be documented, such as versioning, data sources, data creator, protocol, and more. Vendor's should strive to use datasets from multiple sources in model development to increase model robustness, for instance by using data from multiple continents.
Moreover, it is critical to keep track of the training, validation, and test datasets, as well as data augmentation tools (section 3.2.1) and hyperparameters, in order to be able to re-train or further develop the model. The training, including infrastructure and computational resources, should be handled by the vendor.
After the training process, the model needs to be properly validated (figure 17) on independent, representative, and diversified data to make sure the model is fit for purpose, and identify the use cases and the limitations of the model (model scope). The resulting validation report can include a model data sheet specifying the training and validation details, as well as the intended use and limitations of the model ( figure 18). Such data sheets should always accompany the released ML model when distributed to clinics, which will allow the clinical users to apply the model to relevant cases and reduce the risk of misuse.
When a clinic goes live with a released ML model, they need to commission the model on their local data and use case. For instance, the commissioning of a validated DL segmentation model involves evaluating the model output on image sets and structure sets from the clinic, taking the intended use of the model into account. The commissioning resembles the validation process, and it may involve configuration of settings affecting the postprocessing of the model output to align the commissioned model with the clinical use case, scope, and specific treatment protocol ( figure 19). Notice that the released model itself, e.g. the optimized neural network parameters, is not affected by such a model configuration. The vendor should support the clinic with the commissioning process. After that, the model is locked and no settings affecting the output of the model can be changed. Although active and continuous learning workflows (section 3.2.4) are very attractive, their feasibility after the commissioning is done is rather complex, due to the risks associated with model changes (Liu et al 2020, Figure 17. Model development process. Figure 18. Example of data sheet for a released ML model for automatic planning of radiotherapy treatments for prostate cancer patients. The data sheet contains relevant information about the ML model, including the general overview and scope, as well as information about the training and validation phases. This data sheet should be provided by the vendor together with the model. Vokinger et al 2021). Thus, they are better suited to be applied during training, when changes in the model are still possible. In case the model becomes not valid anymore because of changes in the data distribution over time (section 2.1), re-training or re-model configuration could be performed, which would trigger a new commissioning. The commissioning results, which are specific to the clinic, must be stored for future reference and should ideally be shared with the vendor.

Using AI in clinical practice: implementation, model life-cycle and sharing
When going live with an AI model, it is important to monitor the performance of the model in terms of usage, results, adjustments, post-processing and approval times, and problematic cases. The vendor should develop tools for automating such monitoring and QA, to enable a safe and transparent clinical implementation. For instance, the ML generated segmentations can be stored separately and compared to the approved segmentations, allowing for monitoring of the models over time in terms of the manual adjustments needed.
ML models are suitable for sharing as they can be designed not to contain any personal data. We believe clinics will be open to share their knowledge with other clinics through ML models that have been trained and validated on their data, and vendors can provide tools to do that. For clinical purposes, an ML model can be shared if it has been validated and there is a model data sheet specifying its intended use and limitations ( figure 18). Model sharing should be centrally organized rather than bilateral to ensure quality, transparency, model distribution monitoring, and version handling. Also, if a clinically deployed model is deficient, the traceability is important so all affected clinics can be notified. Such centralization of models combined with centralization of outcome data and other relevant input may lead to consensus in how certain treatments should be conducted.

Conclusion
Thanks to impressive results in tasks that were previously reserved for human intelligence, like visual object recognition in natural images, ML has become very fashionable and has raised much interest in all sorts of applications. Medicine has not escaped that ubiquitous trend and, in particular, specializations that heavily rely on medical imaging, like radiation oncology, try to fully exploit the possibilities of ML models. The sharp turn in that direction leads to a road full of promises but also paved with many pitfalls and poor visibility ahead. In order to address this issue, a twofold approach has been proposed in this review. On the one hand, interpretability and explainability is meant to make ML more trustworthy and its users more confident. On the other hand, exploring the tight relationship between data and model performance can help us to achieve more efficient learning, as well as to develop tools for risk assessment and QA. This review has explored some of the most recent developments in interpretable and explainable ML, presented different concepts around the data-model dependency issue, and investigated in the literature how they start being applied in medicine and radiation oncology in particular. In the short term, interpretability is expected to be a topic of growing interest in interdisciplinary conferences and workshops, like 'UNSURE' (Uncertainty for Safe Utilization of Machine Learning in Medical Imaging) (Sudre et al 2021) or the 'iMIMIC' (Interpretability of Machine Intelligence in Medical Image Computing) (Reyes et al 2021) in MICCAI. In the mid term, interpretability and explainability of ML and AI in general are likely to be developed on their legal side by law-and policy-makers, as well as regulatory institutions. For example, Europe has already formed expert groups to discuss and emit recommendations on 'responsible AI'. These initiatives could follow a similar path as the GDPR or be integrated in it. Finally, in the longer term, more futuristic developments of ML and AI are aimed at streamlining the interface between human Figure 19. Commissioning and go-live for a released ML model. intelligence and its artificial counterpart, most probably by using natural language and other familiar means of communication.