Efficient Bayesian Inference of General Gaussian Models on Large Phylogenetic Trees

Phylogenetic comparative methods correct for shared evolutionary history among a set of non-independent organisms by modeling sample traits as arising from a diffusion process along on the branches of a possibly unknown history. To incorporate such uncertainty, we present a scalable Bayesian inference framework under a general Gaussian trait evolution model that exploits Hamiltonian Monte Carlo (HMC). HMC enables efficient sampling of the constrained model parameters and takes advantage of the tree structure for fast likelihood and gradient computations, yielding algorithmic complexity linear in the number of observations. This approach encompasses a wide family of stochastic processes, including the general Ornstein-Uhlenbeck (OU) process, with possible missing data and measurement errors. We implement inference tools for a biologically relevant subset of all these models into the BEAST phylogenetic software package and develop model comparison through marginal likelihood estimation. We apply our approach to study the morphological evolution in the superfamilly of Musteloidea (including weasels and allies) as well as the heritability of HIV virulence. This second problem furnishes a new measure of evolutionary heritability that demonstrates its utility through a targeted simulation study.

Phylogenetic comparative methods correct for shared evolutionary history among a set of non-independent organisms by modeling sample traits as arising from a diffusion process along on the branches of a possibly unknown history. To incorporate such uncertainty, we present a scalable Bayesian inference framework under a general Gaussian trait evolution model that exploits Hamiltonian Monte Carlo (HMC). HMC enables efficient sampling of the constrained model parameters and takes advantage of the tree structure for fast likelihood and gradient computations, yielding algorithmic complexity linear in the number of observations. This approach encompasses a wide family of stochastic processes, including the general Ornstein-Uhlenbeck (OU) process, with possible missing data and measurement errors. We implement inference tools for a biologically relevant subset of all these models into the BEAST phylogenetic software package and develop model comparison through marginal likelihood estimation. We apply our approach to study the morphological evolution in the superfamilly of Musteloidea (including weasels and allies) as well as the heritability of HIV virulence. This second problem furnishes a new measure of evolutionary heritability that demonstrates its utility through a targeted simulation study.
1.2. Model. PCMs posit a continuous-valued stochastic process running on the branches of a phylogenetic tree that gives rise to trait values, possibly measured with noise, at the tree tips for observed organisms.
Phylogenetic Tree. The phylogenetic tree represents the evolutionary relationship among the organisms studied. We assume, without loss of generality, that the tree is calibrated in time, so that branch lengths represent actual time. For organisms evolving rapidly, such as viruses, the observations at the n tips of the tree are not necessarily contemporaneous (see Figure 1, left). We denote by m the total number of internal and external nodes in the tree (m = 2n − 1 if the tree is binary).
Stochastic Process on the Tree. We assume that a continuous trait evolves over time according to a stochastic process. When a speciation event occurs in the tree, the process is split into two conditionally independent processes with the same distribution (see Figure 1). Only the values of the process at the tips of the tree are potentially observed. This model enforces a phylogenetic correlation structure on the observations, as shown below. 2004 2006 2008 FIG 1. Realisation of a univariate BM process (right) with variance σ 2 = 0.05 and mean root value µ = 0 on a timed phylogenetic tree (left). Observations span 3 years (2006 -2008, tips Z 1 to Z 5 ). t12 is the time of shared evolution between tips 1 and 2. 1 is the length of the branch ending at node 1.
Brownian Motion (BM). The simplest stochastic process (Cavalli-Sforza and Edwards, 1967;Felsenstein, 1985) assumes that a multivariate trait Z t of dimension p evolves in time t following a Brownian motion B t with variance R, such that dZ t = R 1/2 dB t for all t ≥ 0 and at the tree root Z r ∼ N(µ, Γ). Under this process, the covariance between trait k at node i and trait l at node j (1 ≤ k, l ≤ p and 1 ≤ i, j ≤ m) is the product of (i) the covariance R kl between traits k and l and (ii) the shared evolutionary time t ij between species i and j, i.e. the time from the root to their most recent common ancestor (see Figure 1), plus the contribution of the root itself: Cov[Z i k ; Z j l ] = t ij R kl + Γ kl (see e.g. Clavel, Escarguel and Merceron, 2015).
Ornstein-Uhlenbeck (OU). The OU process was proposed as a model for traits evolving under stabilizing selection (Hansen, 1997) and has become widely used across evolutionary biology (see e.g. Cooper et al., 2016, and references therein). The OU process generalizes BM by adding a deterministic call-back term to a given value β that is interpreted as the optimal value of the trait of a species in a given environment: dZ t = −A(Z t − β)dt + R 1/2 dB t for all t ≥ 0. Matrix A is the "selection strength" that is constrained to have positive real parts of its eigenvalues and controls the dynamics of the pull toward the optimum. The covariance between two multivariate traits at two nodes can be explicitly formulated (Bartoszek et al., 2012;Clavel, Escarguel and Merceron, 2015), and, compared to the BM where the variance increases linearly in time, it is bounded by the stationary variance V of the process.
Observation Model and Individual Variation. The processes described above are meant to capture the evolutionary dynamics of traits across organisms. However, various other sources of variation may contribute to the observed data, such as measurement error, independent environmental variation, or intra-specific variation (see e.g. Hadfield and Nakagawa, 2010, for a review). As outlined in the next section, we include these biological phenomena in our model as an extra layer, that links the realization of the process at the tips Z i to the actual measurements Y i through a Gaussian observation model.
1.3. Scope of the article. In this work, we propose a general and efficient Bayesian framework to rigorously analyse this broad class of evolutionary models.
State of the Art. Since their introduction in the seminal article of Felsenstein (1985), PCMs have undergone extensive development, resulting in increasingly realistic models. We limit references here to those that specifically relate to the inference problem in a general setting and refer to e.g. Harmon (2019) for a recent and more comprehensive overview of these models. Clavel, Escarguel and Merceron (2015) provide a comprehensive maximum-likelihood framework to fit a wide range of models, using explicit estimators that can in some cases be computationally prohibitive. Pybus et al. (2012), Freckleton (2012),  describe and implement likelihood computation algorithms that are linear in the number of observations in a framework similar to the one described here. This algorithm is exploited in Mitov,  to conduct maximum-likelihood inference. From a Bayesian perspective, Pybus et al. (2012) and Cybis et al. (2015) (Lemey et al., 2010). Other more general models have been developed, including: non Gaussian models (using e.g. Levy processes, see Landis, Schraiber and Liang 2013;Duchen et al. 2017, or general Fokker -Plank equations, see Boucher et al. 2018); models where the trait explicitly impacts the tree (with BiSSE and related methods, see Maddison, Midford and Otto 2007;Fitzjohn, Maddison and Otto 2009;Fitzjohn 2010;Goldberg, Lancaster and Ree 2011;Fitzjohn 2012); models where species interact with each other (through mutualism or competition, see Nuismer and Harmon 2015;Manceau, Lambert and Morlon 2016;Drury et al. 2016; Bartoszek et al. 2017;Drury et al. 2018; Aristide and Morlon 2019); or models in high-dimensional trait settings (using pseudo or penalized likelihood, see Goolsby 2016; Clavel, Aristide and Morlon 2019). These models are however outside of the scope of the present work, that focuses on the general Gaussian model as presented in the next section (see Definition 1).
Outline. We complement recent advances in computing the likelihood under a general class of Gaussian models with an algorithm to analytically evaluate its gradient with respect to any of the parameters in linear time in the number of observations. This general class includes the OU process, as well as measurement error and missing data. We exploit this algorithm to develop an efficient Bayesian inference framework that relies on the use of an HMC sampler and allows for model selection through marginal likelihood estimation. We implement this framework in the BEAST phylogenetic software  for a sub-class of models, namely the OU process with a diagonal selection strength matrix A, that are of particular interest for the biological problems we study here. In Section 2, we present the general likelihood and new gradient computation algorithm (with details in Appendices A and B). In Section 3, we develop the Bayesian statistical inference framework (with details in Appendix C). Finally, in Section 4, we illustrate the method on two recently published biological datasets as well as on simulations.

Efficient Gradient Computations.
In this section, we show how the likelihood and its gradient with respect to all the parameters in a general trait evolutionary model can be computed in linear time in the number of tips of a rooted phylogenetic tree.
2.1. Statistical Model. Conditioning on a tree T , we define the following general Gaussian model of trait evolution: DEFINITION 1 (General Gaussian Model of Trait Evolution). Let T be a rooted binary phylogenetic tree with n tips, m = 2n − 1 internal and external nodes. At each node j, 1 ≤ j ≤ m, define a latent variable Z j , and for each observation i, 1 ≤ i ≤ n, a measure Y i , both of dimension p. The general Gaussian model of trait evolution on T is then defined in a hierarchical way as follows: where pa(j) and pa(i) denote, respectively, the unique parent node of node j or latent tip associated with observation i; µ and Γ are the expectation and variance of the root variable Z r ; for any node j, q j , d j and Σ j are, respectively, the actualization, drift, and variance associated with the branch going from pa(j) to j; and for any observation i, S i is the variance associated with it. We further denote by Y = (Y 1 , · · · , Y n ) T and Z = (Z 1 , · · · , Z m ) T the matrices of observed and latent trait variables, and by X = (Y T , Z T ) T the complete dataset. Furthermore, we assume that Γ, Σ j , and S i are positive definite for any node j and observation i.
This allows entertaining a highly generic framework that encompasses various evolutionary scenarios, as underlined in the following three paragraphs.
Stochastic Process Propagation. Equation (2.2) describes the stochastic process that governs the evolution of the latent trait. It is similar to the G LInv model described in , or the generic formulation used in Bastide et al. (2018). Both the BM and OU models can be cast into this framework, by setting: We refer to  for more details concerning other models that can be described within this framework, including processes with shifts or jumps.
Observation Model. Equation (2.3) describes the observation model. The variance term S i can have multiple significations, from a simple measurement error, to a "meta-analysis" effect (see Hadfield and Nakagawa, 2010 for a review on observation errors), or an "intra-specific" variance (see e.g. Goolsby, Bruggeman and Ané, 2017). In the simple case where the same observation error is assumed for all measures i, 1 ≤ i ≤ n, this term reduces to S i = S. Note that, from a methodological point of view, Equation (2.3) can be considered as a particular case of Equation (2.2).
Phylogenetic Factor and Repetitions. We note that one may posit that Z i is of lower dimension q < p and link to Y i through a linear combination via a latent factor model (Tolkoff et al., 2018). It would also be straightforward to include several measurements associated with a single tip. For clarity, we omit these details in the main text, and refer to Appendix A for the derivations in this general framework.
2.2. Likelihood Computations. The general model of Definition 1 is Gaussian, and it is hence possible to write out the marginal distribution p ( Y | θ ) of the measures Y given the parameters θ for some specific models (see e.g. Clavel, Escarguel and Merceron, 2015 for such formulations in the multivariate OU case). However, this computation generally requires the inversion of a tree-induced variance matrix, of dimension n × n. This is inefficient (worse than quadratic in n, see e.g. Raz, 2003), and is ill-suited for handling large phylogenetic trees that now frequently confront practitioners (see e.g. Jetz et al., 2012;Blanquart et al., 2017) or when the tree itself is random (Pybus et al., 2012). To alleviate this issue, it is possible to write an efficient pruning-style algorithm that is linear in the number n of organisms.
Pruning-Style Algorithm. Felsenstein (1973) introduced the pruning algorithm into phylogenetics to compute the likelihood of a simple BM. It draws from classical Gaussian conditional propagation ideas such as the Kalman filter or other "forward-backward" algorithms (see e.g. Rabiner, 1989 for a review). Variants of this algorithm have been flowering in the literature, sometimes under different names. A non-exhaustive list of references for the BM case includes Hadfield and Nakagawa (2010), Pybus et al. (2012), Fitzjohn (2012) (Gaussian Elimination Method), Freckleton (2012), Lartillot (2014) (Phylogenetic Kalman Filter) and Cybis et al. (2015). Goolsby (2016) and Hassler et al. (2019) have recently proposed adaptations to handle missing data. Finally, an extension of the algorithm to the general case as presented in Definition 1 (with missing data) was proposed by Bastide et al. (2018), and also more recently by . Implementation in BEAST. All these methods allow computing the likelihood in linear time in n. As is frequently the case for PCMs, many have been implemented in various independent software packages, each likely with a specific use in mind (with the notable exception of the R Core Team, 2019 package PCMbase, see . While this explains at least partly the numerous references to the algorithm, it limits the use of several implementations beyond the specific models they consider. Bayesian Evolutionary Analysis by Sampling Trees (BEAST, Suchard et al., 2018) is a widely used, well established and versatile phylogenetic software package. It encompasses a great variety of molecular sequence modeling tools, making it possible to conduct a coherent joint inference of both the timed phylogenetic tree and of the properties of the stochastic process, without the need to resort to a two-step analysis as is usually the case in previous methodologies (see Section 3.1 for more details). We implemented the general algorithm presented in Bastide et al. (2018) in this unified framework (with improvements, see next paragraph), allowing for its seamless integration with the realistic analyses permitted by the software.
Efficiency and Numerical Robustness. When sampling the parameters in a wide region of the space, as is typically done in a Bayesian analysis (see Section 3), numerical robustness is particularly important, as small divergences due to possibly ill-conditioned matrices can accumulate over the tree traversal, and lead to diverging results. We tackled this issue using two independent developments. First, we reduced the number of operations actually performed during the traversal of the tree thanks to a careful analysis of the iteration steps, making the algorithm both more efficient and more robust. Second, we increased the numerical robustness by using a dedicated linear algebra library (the Efficient Java Matrix library, Abeles, 2016) to conduct the computations. Combined with the use of a Moore-Penrose pseudo inverse, this made our handling of the singular or near-singular matrices induced by the presence of missing data (see Bastide et al., 2018, Hassler et al., 2019 more numerically stable. These developments are presented in detail in Appendix A.1. 2.3. Gradient Computation. When performing statistical inference, either in a maximum likelihood or a Bayesian framework, having access to the gradient of the likelihood at relatively cheap computational cost facilitates faster and more accurate algorithms (see Section 3). In this section, we present a novel algorithm to compute the gradient of the likelihood with respect to any parameter in the general setting presented in Definition 1. The algorithm relies on two main ingredients: (1) as in Fisher et al. (2019), we express the derivative of the likelihood as the conditional expectation of a given function of the latent traits Z j at the internal nodes j, conditional on the observed traits Y; and (2) we use a pre-order algorithm inspired from the "downward" phase in Bastide et al. (2018) to compute this expectation in a linear time in n.
Gradient as a Conditional Expectation. We rely here on Fisher's identity (Cappé, Moulines and Rydén, 2005) that links the gradient of the log-likelihood log p ( Y | θ ) of the observed variables to the conditional expectation of the completed log-likelihood log p ( Z, Y | θ ): PROPOSITION 1 (Fisher's Identity; Cappé, Moulines and Rydén, 2005). Under broad assumptions, that are verified for Gaussian densities, the following identity holds (Equation 10.12 in Cappé, Moulines and Rydén, 2005): where X s represents any subset taken from the complete data.
Applying this identity, we obtain the gradient of the likelihood with respect to any parameter θ j = (q j , d j , Σ j ) or θ i = S i of the model: PROPOSITION 2 (Gradient with respect to Branch Parameters). Under the general model of Definition 1, for any observation or node k (1 ≤ k ≤ n + m), the following identity holds: with vech the symmetric vectorization operation (Magnus and Neudecker, 1986); and where n k , M j , Q −1 k and V k are parameters, representing the expectations and variances of two Gaussian densities (see Equations A.15 and A.23 in Appendix A.2), that can be computed in one pre-order traversal of the tree.
PROOF OF PROPOSITION 2. Let k be an observation or node index with associated trait variable X k (1 ≤ k ≤ n + m). As in Fisher et al. (2019), we decompose the observations Y as Y = (Y k , Y k ), where Y k denotes the observations that are "below" node k, i.e. that have k as an ancestor, and Y k denotes the observations that are "above" node k, i.e. that do not have k as an ancestor. The tree conditional structure then induces the decomposition: where only the middle term depends on parameters θ k associated with the branch ending at node k. Applying Proposition 1 with X s = X k , we get that: Using the pre-order formulas presented in Appendix A.2.1 (Equation A.15), we can see that X k Y k is normally distributed, with expectation n k and precision matrix Q k . Applying standard derivation formulas to a log Gaussian density (see e.g. Magnus and Neudecker, 1986), we have: .23), we know that X k Y is normally distributed, with expectation M k and variance V k . Equation (2.7) is then obtained by taking the conditional expectation of the above expression (2.8).
Chain Rule. In Equation (2.7), we express the gradient of the likelihood with respect to any branch parameter using only quantities that can be computed in two traversals of the tree, one post-order and one pre-order. This provides the basis for an algorithm to compute the gradient of the likelihood with respect to any parameter with a linear complexity in n. Indeed, one only needs to apply the derivation chain rule to: (1) obtain the gradient of the branch parameters n k and Q k with respect to the natural parameters of the process at hand; and (2) obtain the gradient of the likelihood with respect to parameters shared between several branches. We tackle task (1) using the pre-order formulas in Appendix B. Task (2) is a straightforward application of the chain rule over all the levels of the hierarchical model: Complexity. Appendix A implies that all the moments n k , M j , Q −1 k and V k (1 ≤ k ≤ n + m) appearing in Proposition 2 can be computed in linear time in the number of observations. Since formulas (2.7) and (2.9) only involve linear algebra operations in a space of the dimension of the parameters, the total complexity remains linear in n. In addition, we note that the sum in Equation (2.9) does not need to follow the tree order, as all the quantities are pre-computed, and hence can be parallelized easily, reducing the actual computation time.
3. Statistical Inference. In the previous section, we showed how both the likelihood and its gradient with respect to all the parameters of the models can be efficiently computed simultaneously. These quantities are the cornerstone of many statistical analyses, and allow for a wide range of analyses, from maximum likelihood to model selection (for examples in trait evolution, see e.g. Clavel, Aristide and Morlon, 2019). Here, taking advantage of the comprehensive Bayesian inference framework made available through the BEAST phylogenetics software package, we propose a new Bayesian approach that relies on the use of an efficient HMC sampler to perform both posterior inference and marginal likelihood estimation.
3.1. Bayesian Phylogenetics and the Total Evidence Approach. The likelihood and gradient algorithms presented below work conditionally on a phylogenetic tree between sampled species being known without error. However, the tree is generally a summary statistic resulting from a complex statistical analysis, and is usually inferred from molecular sequences, which are the actual observed data. Many methods in the literature follow a two-step procedure and first infer the tree from sequence data to then proceed with analysing the continuous traits, assuming that the phylogenetic tree is known and fixed (see e.g. Harmon, 2019 for a recent review of such methods). This approach suffers from two major drawbacks. First, it ignores the uncertainty in the reconstruction of the tree, which, given the difficulty of the task, can be substantial, and bias the subsequent analyses (see Felsenstein, 2004 for a review, and Bastide, 2017, Section 5.1, for an example in a specific case). Second, this approach does not allow for the complete use of the data available, as it ignores continuous traits for the tree reconstruction. Although, when present, sequence information tends to dominate over trait information , it is not always available for all sampled organisms. This is particularly true for ancient fossils, that might bear some continuous trait data, but, because of the rapid degradation of DNA molecules, can not be sequenced (Leonardi et al., 2016). We refer to Section 4.3 for an example of such a dataset, where the continuous trait constitutes the only source of information available to reconstruct the phylogenetic tree.
To overcome these limitations, we use a total evidence approach (Ronquist et al., 2012), that can analyse sequence and trait data jointly, using all the information available in a Bayesian analysis. Denote by S the sequence data (that might not be available for all the sampled species), and by φ all the parameters associated with the model of sequence evolution and the dating clock model, (see e.g. Felsenstein, 2004 for a review of such models). The goal of Bayesian phylogenetics is then to learn about the posterior: p ( θ, T , φ | Y, S ) . One crucial assumption that we make is that, conditionally on the phylogenetic tree T , the evolution of continuous traits and the sequences are independent, such that: The term p ( S | T , φ ) p (T , φ) has been the focus of an extensive literature, and benefits from efficients methods readily available in BEAST . Thanks to the tools presented in the previous section, we focus here on p ( Y | θ, T ) p (θ) that deals with the study of the distribution of continuous traits among the population of species.
3.2. Hamiltonian Monte Carlo. HMC is a powerful MCMC sampling technique, that exploits the geometrical properties of the density to be sampled through the use of Hamiltonian dynamics (Neal, 2011;Betancourt, 2017). It associates to a vector of parameters of interest θ, viewed as the position of a particle in a d-dimensional space, an auxiliary independent vector p of "momentum", that is typically chosen to be Gaussian: p ∼ N(0 d , I d ). The log joint distribution of the parameter (θ, p) then represents the "total energy" H(θ, p) = U(θ) + K(p) of the particle, with U(θ) = − log p ( θ | Y, T ) the "potential energy" set to be equal to the posterior density of interest, and K(p) = p T p/2 the "kinetic energy". Total energy is then invariant to the Hamiltonian dynamics: The HMC sampling scheme exploits this property using proposals that approximately follow these dynamics, as discretized by an appropriate numerical scheme such as the leapfrog.
This allows for a proposal that can have a small correlation with the current state, while still having a high probability of acceptation (Neal, 2011). Such an HMC sampler has already proven very successful in a phylogenetics context (Fisher et al., 2019;Ji et al., 2019). Our efficient and general algorithm for likelihood and gradient computation, presented in Section 2, makes it now applicable to the wide variety of models covered by Definition 1.
3.3. Confronting Constrained Natural Parameters. Some of the parameters of the models, such as the variance of a BM or an OU, live in constrained spaces with a non-trivial structure, that need to be sampled adequately. One standard way to deal with this structure (see e.g. Stan, 2017) is to map the constrained natural parameters θ to a vector of independent, unconstrained parameters η through a smooth transformation f . The density in the new, unconstrained space is then linked to the density in the constrained space by a simple multiplication with the determinant of the Jacobian matrix of the transformation (see e.g. Le Gall, 2006), such that: This formula allows us to easily update the constrained parameters θ from movements in the unconstrained space of η. We present the transformations and the associated priors used here in detail in Appendix C. In particular, we show that sampling the space of correlation matrices amounts to sampling vectors in the half-euclidean sphere, which provides an original and simple representation of the classical LKJ transformation (Lewandowski, Kurowicka and Joe, 2009).

Model Selection and Marginal Likelihood Estimation.
We described above a general framework to efficiently infer the parameters of a wide class of evolutionary models. When analysing a dataset, a question that naturally arises is the choice of the most suited model to interpret the evolutionary patterns in a specific problem.
Bayesian Model Selection. In a Bayesian setting, one natural way to compare a collection of models (M m ) 1≤m≤K is to compute their marginal likelihoods p ( Y | M m ) (see e.g. Oaks et al. 2019 for an introduction in a phylogenetic context). The marginal likelihood, that integrates all the parameters against the prior, takes the model complexity into account by design, "penalizing" complex models, that otherwise mechanically have a higher likelihood. Marginal likelihoods allow computing Bayes factors, which have a natural comparison scale (Jeffreys, 1935;Kass and Raftery, 1995). However, because of the need to integrate over the potentially very large space of parameters, this quantity is typically hard to compute, and approximations are required.
Generalized Stepping-Stone Sampling (GSS). In a phylogenetic context, the GSS approach  has been successfully used to approximate marginal likelihoods Fourment et al., 2019). For a given model M, it relies on the construction and sampling of a path between the unnormalized posterior and a "working" prior distribution p 0 ( θ | M ): When β = 1, the path likelihood is proportional to the classical posterior sampled in a standard MCMC analysis, while when β = 0, it reduces to the working prior p 0 ( θ | M ). This working prior is chosen to match the empirical moments from a sample of the posterior distribution, ensuring a less vague distribution that is closer to the posterior, hence inducing a more accurate approximation for a reduced computational effort than the vanilla steppingstone sampling procedure (Xie et al., 2011;Fan et al., 2011). As in Baele, , we adopt a kernel density estimator (KDE) for each parameter, using a normal kernel, that is log-transformed for positive parameters (see e.g. Jones, Nguyen and McLachlan, 2018).
Sampling the path with HMC. The GSS estimation implies sampling from the path likelihood q β (θ) for a sequence of β. Xie et al. (2011) and  show that choosing the path parameter as evenly spaced quantiles of a Beta distribution with shape 0.3 and scale 1.0, which allows for sampling more intensely regions where β is small, and hence where the path likelihood is changing the most rapidly, yields the best performance. This sampling is usually done through a standard MCMC procedure. Here, we use the efficient HMC approach presented in Section 3.2, which implies taking the gradient of the log path likelihood (3.2): This gradient involves a term proportional to the posterior that we already dealt with in the HMC inference, and the working distribution, that, as a product of independent KDE estimations, is straightforward to compute. This makes it possible to use the efficient HMC sampling scheme in the GSS marginal likelihood estimation framework already implemented and well established in BEAST Fourment et al., 2019).

Applications and Simulations.
4.1. Assumptions and Practical Implementation. We showcase the usefulness of our inference framework using two recently published datasets, one in ecology, and one in virology. This led us to implement a subset from all the models made accessible by the method. Specifically, we limit the evolutionary model to the BM and the OU with diagonal strength of selection A, and a shared residual variance for all the measures, possibly scaled by the tip heights (S i = S or S i = t i S for any observation i, 1 ≤ i ≤ n).

Phylogenetic
Heritability. The concept of phylogenetic heritability has been defined in the field of PCMs to study the relative importance of the evolution and observation models in the total measured variance at the tip of the tree (Lynch, 1991;Housworth, Martins and Lynch, 2004). It is linked to the notion of phylogenetic signal (Pagel, 1999), and its use has recently received considerable attention in studying infection traits in the field of virology (Alizon et al., 2010;Leventhal and Bonhoeffer, 2016;Mitov and Stadler, 2018). We introduce here a general definition of the phylogenetic heritability that extends this notion to our general framework. It relies on the expectation of the population variances computed at the latent tip level (for the Z tip = (Z j ) 1≤j≤n ), and at the observation level (for the Y = (Y i ) 1≤i≤n ): where the expectation and variance are taken following the process of evolution and observation defined in 1 with parameters θ. These quantities have closed-form expressions for all the models considered here, see e.g. Clavel, Escarguel and Merceron (2015) for the general OU case. The "heritability matrix" H can then be defined as: In the case of a standard univariate trait on an ultrametric tree with only one observation per tip, this formula coincides with the classical definition found in the literature (see e.g. Mitov and Stadler, 2018).
Population versus Empirical Variance. In equation (4.3), we use the population variance, instead of the empirical one used for instance in Blanquart et al. (2017), Hassler et al. (2019). We argue in Appendix D that, when the process is not a vanilla BM, the population variance is more appropriate, as the empirical variance might be impaired by confounding inter-group effects if the tips are expected to have different means under the trait evolution model, which is for instance the case for an OU model on a non-ultrametric tree.

Morphological Evolution in the Musteloidea Superfamily.
We illustrate the total evidence approach to study the evolution of some morphological features in the Musteloidea superfamilly (including weasels and allies).

Dataset and Analyses.
Dataset. We reanalyze the dataset published by Schnitzler et al. (2017), containing 81 taxa, including 4 fossils, and 2 outgroup species. Aligned sequence data for all 77 extant taxa are available (containing 22 nuclear and 5 mitochondrial genes). Three morphological traits are measured on 65 species, including fossils, with missing data for some taxa (see Figure 2). They are carnivorian ecometric traits, defined as meaningful ratios of osteological measurements. Note that our Bayesian framework can readily handle this heterogeneous dataset, and jointly analyse sequence and continuous traits with missing data on both.
Questions. We aim to address the following two questions. First, does total evidence (see Section 3.1) allow for better placement of the fossils? Given that we can take trait data into account while performing phylogenetic inference, we might expect that this extra information leads to better fossil placement estimates compared with Schnitzler et al. 2017, who only considered sequence data and use monophyly constraints. Second, which model of trait evolution is most suited to explain the observed trait distribution, and does this conclusion change when we include or set aside fossil data, as suggested by Schnitzler et al. (2017)?
Sequence Evolution and Dating. We use the same sequence evolution model as Schnitzler et al. (2017) for all 27 partitions of the dataset, with estimated base frequencies and site rate heterogeneity modeled using a discretised gamma distribution with 6 rate categories; an uncorrelated relaxed clock with an underlying gamma distribution; and an exponential growth coalescent tree prior. Following Schnitzler et al. (2017), we constrain 5 clades to be monophyletic, and each fossil is a priori assigned to one of these clades, except for Trocharion albanense, which remains unconstrained (see Figure 2). We assume a normal prior on the time of the most recent common ancestor for each of those clades, with means as in Schnitzler et al. (2017), and standard deviation 1. We assume a uniform prior on fossil dates, with maximum ranges taken from Law, Slater and Mehta (2018) (Table S5), except for Teruelictis riparius, for which dates were extracted from the Paleobiology Database, relying on Salesa et al. (2013). Note that these assumptions differ slightly from Schnitzler et al. (2017), who provide insufficient information to reproduce their exact pipeline.  Total Evidence Phylogenetic Inference. We conduct phylogenetic inference using 3 different data integration scenarios: no model of trait evolution (i.e. the continuous traits are not used); a BM model; and an OU with diagonal selection strength. The latter two combine both sequence and trait evolution. We run each analysis for 100 million iterations, sample every 1000 steps, and discard the first 10% as burn-in. The maximum clade credibility (MCC) tree is used to represent the evolutionary history.
Fossil Placement Analysis. We assess the uncertainty of fossil placement using a method introduced by Klopfstein and Spasojevic (2019). Given a sample of trees from the posterior, as the backbone tree is well resolved (see below), the method amounts to computing, for each branch of the MCC tree, the frequency a given fossil attaches to that branch. We measure frequency vector concentration using entropy; a fossil that is well resolved will be distributed over a small number of branches with high frequency, and hence has a low entropy.
Model Comparison. As in Schnitzler et al. (2017), we also conduct several model comparisons, conditioning on a tree fixed to the MCC tree from one of the previous analyses, with or without fossil species. On each tree, we test several hypotheses about the nature of trait evolution, by comparing (log) marginal likelihood estimates for the various models. Models tested in this section are the BM and OU models, but also the "trend" model, that is a BM with an added homogeneous deterministic drift (Hansen and Martins, 1996;Gill et al., 2016). See Supplementary Figure S3 for the list of all hypotheses tested.
Analysis and Representation of the Results. We use BEAST, TreeAnnotator and Tracer to conduct the analyses Rambaut et al., 2018). Trees are imported into R and plotted using treeio (Wang et al., 2019), tidytree and ggtree (Yu et al., 2017(Yu et al., , 2018 (2017), we estimate a wellresolved backbone tree, with uncertainty mostly at the genus level, particularly in the Mephitidae and Procyonidae families (see Figure 2). Including trait information does not dramatically change the inferred relationships between extant species, confirming that molecular data are generally more informative than trait data .
Fossil Placement. Taking into account trait information reduces entropy scores for each of the fossils, with the OU model having the lowest entropy for 3/4 of the fossils (see Table 1 and Supplementary Figure S2). Continuous trait measurements bear different amounts of information for each fossil, which leads to different entropy score behavior. The fossil Pannonictis is evenly distributed over all the branches of the clade (Mustelinae) where it is assigned. It has a high entropy that does not decrease much when traits are taken into account. This result is not surprising given that the traits vary little among all the species of this clade (see Figure 2), and hence yield little information with respect to fossil placement. Fossils Sivaonyx beyi and Teruelictis riparius are both assigned to the same clade (Lutrinae). Figure 2 illustrates that their R1 trait is relatively lower compared to other members of the clade. Taking this trait into account is thus informative, and entropy decreases, with the fossil estimated to lie at the root of the tree (see Supplementary Figure S2). Finally, the species Trocharion albanense is not assigned to any clade. Taking traits into account concentrates this fossil as a sister lineage either to Ailurus fulgens, or to the whole Mephitidae clade, which, in the assumed time range, have the most similar traits. Model Comparisons. We find that the favored model, for all tested trees, with or without fossils, is a simple BM for the first trait (R1), and an OU with diagonal selection strength but full correlation for the two other traits (R2 and R3), with R1 evolving independently from R2 and R3 (see Supplementary Figure S3). The parameter estimates are consistent with those from Schnitzler et al. (2017) (see Supplementary Figure S4). Schnitzler et al. (2017) fitted the three traits independently and used a simple penalized likelihood approach (using the Akaike Information Criterion, Akaike 1974) to demonstrate that a "trend" model is favored to the simple BM model for R1 when fossils are included. In contrast, our method is robust to the addition of these fossils that, given the missing data, amounts to the addition of two data points in the analysis (see Figure 2). The selected model has a log Bayes factor of at least 1.5 compared to the second best fitting model in all the scenarios, providing "substantial evidence" (Kass and Raftery, 1995) against the simple BM model.

4.4.
Virulence Heritability in Human Immunodeficiency Viruses (HIV). New challenges for PCMs have recently emerged in infectious disease research, more specifically on the extent to which virulence is a heritable trait in HIV. Here, we employ our new modeling framework to perform a fine-grained analysis to gain insight into this problem.

Dataset and Analyses.
Dataset. We revisit the most comprehensive dataset on HIV-1 heritability published in Blanquart et al. (2017) and further analysed in Hassler et al. (2019). We focus on subtype B and the measurements available for male subjects who have sex with men (MSM), which comprises a dataset of 1171 viral samples. Two traits associated with HIV virulence (Alizon et al., 2010;Blanquart et al., 2017) are measured for each sample: (i) the "gold standard viral load" (GSVL) that is a standardized measure of the viral load, taken on a single sample between 6 and 24 months after the infection and before the initiation of antiretroviral therapy; and (ii) the CD4 cell count slope decline (see Figure 3). Questions. Disease progression varies greatly among patients. Similar to other rapidly evolving human pathogens, it is challenging to determine to what extent this variance is due to the host or virulence of the viral genotype. The pioneering application of PCMs by Alizon et al. (2010) to estimate the heritability of HIV virulence using set-point viral load (spVL), has stimulated the generation of comprehensive data sets (Blanquart et al., 2017), but also led to a discussion about the underlying models (Mitov and Stadler, 2018;Bertels et al., 2018). Depending on the method and datasets used, the heritability of the spVL has been quite controversial (Leventhal and Bonhoeffer, 2016), with estimates ranging from about 50% (Alizon et al., 2010), to around 30% (Vrancken et al., 2015) and to as low as about 6% (Hodcroft et al., 2014). We explore here the fit of several models of trait evolution and individual variation (Equations 2.2 and 2.3) to study their impact on heritability estimation and other parameters of interest. As the virus-host interactions are a major source of trait variation, we expect the individual variation layer to be particularly important in these models.
Trait Evolution Models. We use three different evolution models for the two traits on the tree: a multivariate BM, a multivariate OU (with diagonal selection strength), and a mixed multivariate "OU-BM" model, that has an OU model on the GSVL, and a BM model on the CD4 slope, the two still being correlated. This last model illustrates the flexibility of our framework in model specification. It is motivated by the results presented in Blanquart et al. (2017), and by the data distribution (see Figure 3), with the CD4 slope being much more spread out than the GSVL (with respective quartile coefficients of dispersion of −0.56 and 0.09).
Individual Variation Models. One major driver of diversity for the two traits is the interaction of the virus with its host, that is independent from the viral phylogeny. This individual variation is captured through our observation model layer. We take this variation to be either identically distributed, or scaled by the tip heights, with or without trait correlation. Scaling the independent noise by the tip heights is an empirical model inspired from Pagel's λ model (Pagel, 1999) and is well suited for an environmental contribution that increases linearly with sampling time (Leventhal and Bonhoeffer, 2016).
Model Comparison and Model Fit. As in the previous example and using the same computational tools, we estimate (log) marginal likelihoods for all the models under study. Model Selection. The model favored according to the log marginal likelihood estimation is the OU model on both traits, with an independent, scaled observation matrix (see Figure 4). Compared to the OU-BM model with the same error structure, it has a log Bayes factor support of 3.14, indicating a strong support for the more complex OU model (Kass and Raftery, 1995). In general, the scaled error models appear to be much better supported than the non-scaled ones. This might raise some concerns on the ability of the MLE model selection procedure to select for the best suited model in this setting. We address both concerns with a simulation study in the next section. 4.5. Exploration of Model Selection in a Heritability Estimation Context. In the previous section, we estimated the heritability to be lower than 50%. This means that the phylogenetic model (2.2) accounts for at most half of the total variation observed in the dataset, while individual variation (2.3) takes up the remaining part. In the absence of strong phylogenetic signal in the trait, it might be challenging to uncover the true underlying trait evolution model. Using a simulation scheme that is inspired by the empirical dataset, we explore the limits of the MLE model selection procedure in this setting.

Setting.
Base Evolutionary Scenario. We used the same fixed HIV tree as in the previous section, normalized so that it had a maximum root to tip height of one. We then simulated a bivariate trait according to a multivariate correlated OU-BM model. By analogy, the two traits are named GSVL and CD4, and the parameters of the process were taken to be similar to the ones inferred in the previous section. For the "GSVL" trait, we took a half-life of 0.3% of the tree height, a stationary variance σ 2 /(2α) of 0.1, and an optimal value (equal to the root conditional value) of 1. For the "CD4" trait, we took a selection strength of 0 (BM), a variance of 0.01, and a root conditional value of −0.1. The correlation between the two traits was set to −0.9. Independent Variations. On top of this evolutionary model, we added independent individual variations at each tip. Under the base scenario, the variance of this extra noise was taken to be equal to the variance of the process (0.1 on the GSVL and 0.01 on the CD4). This reflects a heritability of about 50% (computed to be of 49.4% for the GSVL and 45.9% for the CD4). We then multiplied this noise variance by a factor f varying between 0.1 and 3, leading to a maximal heritability of 0.91 and 0.89; and a minimal heritability of 0.25 and 0.22 for both traits, respectively. The estimates for the empirical data set imply a scenario where this factor f is high.
Simulation and Inference. We simulated the OU-BM process using the R package Phylo-geneticEM (Bastide, Mariadassou and Robin, 2017;Bastide et al., 2018). Each scenario was repeated 50 times. On each of these datasets, we performed an analysis similar to the previous section, fitting three evolution models (the BM, the true OU-BM and the OU), with independent identically distributed individual variations at each tip.
Questions. We analysed the results to address the two following questions: (i) does the MLE model selection procedure recover the true generative model? and (ii) to what extent is heritability correctly estimated? 4.5.2. Results.
Model Selection. The proportion of each model being selected over the 50 repetitions is presented in Table 2. When the individual variation variance is low or equal to the evolutionary variance, the correct OU-BM model is selected in all or most of the cases. When this noise increases however, the proportion drops considerably, with the correct model being selected less than 75% of the cases when f = 3.  TABLE 2 Proportion of times each model is selected over the 50 replicates. When the independent noise is small compared to the phylogenetic signal, the right model (OU-BM, bold) is almost always selected. When the independent noise become overwhelming, the phylogenetic signal is lost, and the model selection is less efficient. The HIV example explored in Section 4.4 falls under the latter category, with a high noise (or low heritability), which might explain the somewhat unexpected results provided by the model selection in the HIV example.
Estimation of the Heritability. In Figure 5, we show the normalized estimated heritability when we use the correct OU-BM model. When the noise level factor f increases, the variance of the estimates over the 50 repetitions increases substantially, with higher levels of noise leading to an over-estimation of the heritability for both traits. Note that the estimate is more variable for the trait under the OU model (GSVL). This is consistent with the fact that it relies on the estimate of the strength of selection, which is difficult to infer (see Supplementary Figure S6). The empirical coverage level of the 95% HPD credible interval remains however relatively high, never dropping below 80% (see Supplementary Figure S7).
Selection Strength. When the OU model is incorrectly selected over the OU-BM (which can happen almost one fourth of the time under high levels of noise), then the selection strength on the CD4 trait, that is simulated without selection, is estimated to be generally higher than the selection strength of the GSVL (see Supplementary Figure S6). Discussion and Caveat. When the noise level is high, we observe three main artifacts in the estimation: (i) the OU model is often wrongly selected over the OU-BM; (ii) the heritability tends to be over-estimated; (iii) when the OU model is selected, the selection strength for the selection-free trait is estimated to be larger than for the trait under selection. These three artifacts are precisely the points that raised questions from a biological point of view in the previous section. This simulation study therefore illustrates that complex models of trait evolution such as the OU, that have recently been advocated in the context of heritability studies (Mitov and Stadler, 2018;Bertels et al., 2018), should be treated with caution when applied to a dataset that is burdened by high levels of noise.
5. Discussion. Motivated by the need to accommodate OU processes, we have presented an efficient inference procedure for a broad class of trait evolution models in a Bayesian inference framework. Using two empirical examples and a simulation study, we have demonstrated its applicability to answer a large spectrum of biological questions, in fields ranging from paleontology to virology.
At the core of the inference procedure, the likelihood and now gradient computation algorithms are linear in the number of observations, making them efficient on large trees, and applicable to a broad class of Gaussian evolutionary processes, including but not limited to the popular OU model. This algorithm however has worse than quadratic complexity in the number of latent traits propagated on the tree. However, using techniques such as phylogenetic factor analysis (Tolkoff et al., 2018), this latent dimension can be reduced to a manageable size. In this work, we used the efficient gradient computation algorithm in a Bayesian context. Note that this gradient could be more broadly exploited in other settings, such as in a maximum likelihood inference.
All the formulas are written here for the general model of Definition 1. Specific formulas are however only implemented for a sub-set of the possible models, namely OU modes with diagonal selection strength, with a constant or time-scaled noise, and with only one observation for each tip. Extensions to more general models will be required in order to study specific datasets and answer relevant biological questions. Although the main core mechanism remains unchanged, some derivations may still be needed to propagate the gradient in such complex models. For instance, dealing with the OU with a general selection strength implies taking the derivative of a matrix exponential with respect to a matrix, which is a notoriously difficult problem, and may require some approximations (Al-Mohy and Higham, 2010). However, in the quest for ever more complex models, particular care should be taken concerning practical and theoretical identifiability issues, as illustrated by our simulation study.
6. Data and Scripts. All the scripts and data used in this manuscript are publicly available as a GitHub repository: https://github.com/pbastide/HMC_OU.

APPENDIX A: POST AND PRE-ORDER ALGORITHMS
In this appendix, we show how all the quantities used in the main text can be computed in only two traversals of the tree: one post-order (from the tips to the root) for likelihood computations, and one pre-order (from the root to the tips) for gradient computations.
A.1. Post-Order Algorithm For Likelihood Computation. The post-order propagation formulas with missing data have been presented in Bastide et al. (2018) and . We start by writing a slightly modified version of the formulas found in the first reference, before showing how they can be made more efficient and robust by reducing the number of operations per propagation iteration.
A.1.1. Gaussian Propagation Formulas. Here, we re-write the propagation formulas found in Bastide et al. (2018) (Appendix 2.2), using the notation conventions found in Hassler et al. (2019), which provide a similar framework, but limited to Brownian diffusions. Model. Using the notation of Definition 1, we re-write Equations (2.2) and (2.3) as one generic propagation step: where pa(k) denotes the unique parent of node k, or the unique latent tip associated with an observation. This generic step encompasses both Equations, by taking: Note that here, we assume that there can be several measurements Y i associated to the same tip Z pa(i) , so that we have N ≥ n observations. Also note that, in all the derivations below, we do not assume that the actualisation matrices a k are invertible. This allows to encompass the phylogenetic factor model in this framework (with non-square actualisation matrices that represent loading matrices).
Pseudo-Gaussian Propagation. As in Hassler et al. (2019), we define the "pseudo-Gaussian" density of mean µ and precision P of dimension p as the function: logφ (·; µ, P) : wheredet (P) is the product of all non-zero singular values of P. Note that when P is positive definite, then this coincides with the standard Gaussian distribution. The algorithm presented in Bastide et al. (2018) shows that for any index k, the density of Y k (i.e. all the measurements that have k as an ancestor) conditionally on X k is proportional to a pseudo-Gaussian: with the following propagation formulas: where, given a node k, C(k) is the set of all the direct children of k, and P k is defined in the next paragraph.
Computation of P k . To compute P k , we distinguish between observations and tree nodes: where ∆ k is a diagonal matrix, with a 0 if the trait is missing, and a 1 otherwise. Note that we use the Moore-Penrose pseudo inverse (·) − in Equation (A.7), which can be computed using a singular value decomposition of the matrix. In Equation (A.8), the regular inverse can be used, as Σ k is assumed to be positive definite for any node k. Root and Likelihood. Once at the root r, we get the likelihood of the observed data given the root trait: log p Y r X r , θ = log p ( Y | X r , θ ) . An extra integration on the root trait, using Equation (2.1), gives the likelihood as log p ( Y | θ ) = log r tot , with: Note that, if the root is fixed (Γ = 0 pp ), then this expression simplifies to: Algorithmic Complexity. In the generic case, the complexity of this algorithm is O(N p 3 + mp 3 ). However, the computations on the observations (Equation A.7) could easily be parallelized, reducing the computational overhead. In addition, if the observation model is parametrized in term of the precision matrix R i (as is standard in Bayesian analyses), then Equation (A.7) can be replaced with: P i = ∆ i R i ∆ i , reducing the complexity to O (N p 2 + mp 3 ). Finally, if we assume that the observation variance is diagonal, then the complexity reduces to O(N p + mp 3 ).
REMARK 1. A necessary condition for guaranteeing the correctness of this algorithm is that P k m k = l∈C(k) a T l P l (m l − b l ) (see Equations A.4 and A.5), which is proven by the Lemma below. LEMMA 1.
PROOF. It is sufficient to prove that the equation P k x = l∈C(k) a T l P l (m l − b l ) has a solution. By Theorem 1 in Ben-Israel (1969), we only need to prove that P This implies v T a T l P l a l v = 0 for all l ∈ C(k). For each child l of k, l ∈ C(k), there exists an orthogonal matrix U l such that P l = U T l diag(λ k are positive eigenvalues of P l . Since v T a T l P l a l v = 0, we derive that the first k coordinates of U l a l v are 0. Hence, v T a T l P l = 0. Therefore, v T a T l P l (m l − b l ) = 0 for every l ∈ C(k), which completes the proof.
REMARK 2. In Bastide et al. (2018) and Hassler et al. (2019), the authors use a "lowdimensional" generalised inverse to deal with missing values. The generalized inverse M ∼ of a matrix M is defined as follow: (1) find the indices i ∈ I such that M ii is not infinite nor zero; (2) invert the sub-matrix M I with only rows and columns of M that are in I; (3) for any i in I set M ∼ ii = (M −1 I ) ii ; (4) set all other coefficients of M ∼ to zero or infinite (with non-standard rule 1/∞ = 0, and 1/0 = ∞). Given the special structure of the matrix to invert in Equation (A.7), the "infinite variance" values amount to marking the missing values, and the low-dimensional invert has the same result as the Moore-Penrose invert. However, it is easy to see that MM ∼ M = M so that this low-dimensional inverse is not a pseudo-inverse. Moreover, it is unclear that Lemma 1 holds for this inverse. For consistency and numerical robustness (see below), we use here the more standard Moore-Penrose pseudo-inverse.
A.1.2. A More Robust Propagation. In this section, we make the post-order traversal more efficient and numerically robust by reducing the number of operations needed at each step and using more numerically stable matrix algebra tools.
Reducing computations. Here, we show that we can reduce the number of operations at each step of the propagation, by eliminating matching terms over successive steps. Let j be any node that is not the root, with unique parent pa(j). Then, in the computation of log r pa(j) (Equation A.6), keeping only the terms that depend on j or pa(j), we get: where F −j and F −j are quantities that do not depend directly on j. From Equation (A.8), we get: so that, using Lemma 2 (see below), this expression then simplifies to: Changing the initialization for observations to (A.13) we can then replace the propagation Equation (A.6) by the following: (A.14) Note that this propagation formula is more efficient on several accounts. First, it implies less computations, as it requires only one determinant computation, instead of the number of children of k plus one. Second, by analytically reducing the matching quantities between iterations (rank and determinant), we prevent numerical errors from propagating up the tree, and hence obtain a more numerically robust algorithm.
LEMMA 2. For any node j, we have: rank(P j ) = rank(P j ).
Robust Linear Algebra. We mentioned above that we replaced the "low-dimensional" inverse used in previous studies by the standard Moore-Penrose inverse. In addition to simplifying the formula, the use of this standard pseudo-inverse allows us to use efficient and well established tools to perform the computations (Abeles, 2016). It also protects us from ill-conditioned variance matrices Σ j which, depending on the values taken by the branch lengths j , and, in the case of an OU process, on the values of the selection strength A, can become close to singular matrices (see Equations 2.4 and 2.5).
A.2. Pre-Order Algorithm For Gradient Computation. In this section, we show how to compute the moments appearing in Proposition 2 in one extra pre-order traversal of the tree. These formulas extend the developments found in Fisher et al. (2019) from the simple BM case to the general case of Definition 1.
A.2.1. Conditional Moments n k and Q k . Let k be any index. We write the distribution of X k Y k as a pseudo-Gaussian with mean n k and precision Q k : If k = r is the root of the tree, then as, by definition, all the observations have the root as an ancestor, p X r Y r , θ = p ( X r | θ ) and is given by Equation (2.1) from Definition 1. Otherwise, denote by p = pa(k) the unique parent of k, and by S(k) = C(pa(k)) \ {k} all the siblings of k (i.e. the direct children of pa(k) different from k). Using the graphical independence structure, we get: The first term in the integral is a Gaussian density, that we know from the propagation Equation (A.1) along the tree. To obtain the second term, we note that the observations not descending from node k are exactly the observations not descending from parent node p, plus the observations that do descend from sibling nodes S(k): Hence, using the graphical independence structure: We know the first term of the product from the recursion. The second term is similar to the quantities we have to deal with in the post-order traversal. We get that it is proportional to a Gaussian density, with precision P −k and mean m −k that are such that: where P k is computed during the post-order, and defined in Equations (A.7) and (A.8). Applying standard Gaussian combination rules, we hence get: with: . Finally, from the integral of Equation (A.16), we get: Note, as a sanity check, that we indeed recover the formulas of Fisher et al. (2019) in the case of a BM with no drift. Note also that, as c k is assumed to be positive definite for all node k, Q k is also positive definite, and the regular inverse can be used in the formula above.
REMARK 3. In the special case where the root is fixed (Γ = 0 pp ), for any children node of the root k (such that pa(k) = r), the propagation formulas (A.21) simplify to: Conditional Moments M k and V k . As in Fisher et al. (2019), we compute the full conditional moments of (A.23) p X k Y, θ =φ X k ; M k , V − k as a combination of the moments of p Y k X k , θ and p X k Y k , θ computed in the previous two sections in the post and pre-order traversals of the tree (Equations A.4 -A.6 and A.21). We distinguish between three cases, depending on the possible missing values.
Latent or Unobserved trait. We assume here that X k is either a latent trait (i.e. a trait for a tree node, N ≤ k ≤ N + m) or a measurement that is completely missing. Using Bayes rule, we write: so that: Completely Observed Trait. We assume here that X k = Y k is a completely observed measurement (1 ≤ k ≤ N ). Then p X k Y, θ = p Y k Y k , θ , so that: Partially Observed Trait. We assume here that X k = Y k is a partially observed measurement, such that where Y k o and Y k m are the vectors of observed and missing data at measurement k, with dimension p k o and p k m (p k o + p k m = p); and Π k,o and Π k,m are permutations of dimensions p × p k o and p × p k m that trace the observed and missing indices to their right places. Using Gaussian conditioning, we get:

APPENDIX B: GRADIENTS AND CHAIN RULES FORMULAS
In this appendix, we show how derivation chain rules can be used in combination with Proposition 2 to compute the gradient of the likelihood with respect to any of the natural parameters of a BM or an OU, as defined in Section 1.2.
B.1. Gradients with respect to Generic Model Parameters. In this section, we use the general model of Definition 1, with generic propagation Equation (A.1), and exploit the pre-order formulas (A.21) to get the derivative of the pre-order moments Q −1 k and n k that appear in Equation (2.7) with respect to the generic propagation parameters a k , b k and c k , for any index k, 1 ≤ k ≤ N + m. In the rest of this appendix, M, S and X are, respectively, an arbitrary test matrix, symmetric test matrix and test vector of adequate dimensions, on which the derivatives are applied.
B.1.2. Missing Data. In Equation (B.3), we take the derivative with respect to the symmetrically vectorized version of the variance matrix c k in order to account for only actual and unique parameters. However, when there is missing data, the variance terms associated to the missing dimensions are not relevant anymore, and must be excluded from the parameters. To account for this, we introduce the "symmetric with missing values" vectorization operator vechm k that, for any node k, maps the matrix c k to a vector with only observed dimensions: where, if p k o is the number of dimensions that are observed in at least one of the descendants of k (i.e. the dimensions d such that [P k ] dd = 0), D h,k is the [p k o (p k o + 1)/2] × [p(p + 1)/2] matrix that maps the symmetric matrix to its components matching with the observed dimension. Using this notation, Equation (B.3) becomes: where the derivative is taken according to the actual parameters only. B.1.3. Derivative with respect to the Inverse. Note that, for convenience, we show the formulas for the derivative of Q −1 k , but going back to Q k is straightforward, using the chain rule, and the following derivation formula (Magnus and Neudecker, 1986): B.2. Gradients with respect to Natural Parameters. The last step in the chained derivative is to link the generic parameters a k , b k and c k to the natural parameters of the process in use. Note that, until this step, the equations as well as the implementation are very general, and valid for any model that can be cast in the framework of Definition 1. In this section, we show how to link these generic formulas to actual models, that can be used in a phylogenetic analysis. One of the strengths of this framework is however to be quite easily extendable: for any new model of interest, one only has to express the generic parameters a k , b k and c k and their derivatives in terms of the natural parameters of the model in order to use the general machinery described here.
B.2.1. Brownian Motion with Drift. Using Equations (2.4), for any node j that is not an observation nor the root, we get the derivative of q j , d j and Σ j with respect to the variance and drift parameters R and ν of the BM: Ornstein-Uhlenbeck. Using Equations (2.5), for any node j that is not an observation nor the root, we get the derivative of q j , d j and Σ j with respect to the optimal value and variance parameters β and R of the OU: where we use the eigen-decomposition of the attenuation matrix A = PΛP −1 , and: Note that, when any λ l goes to zero, F j (Λ) ll goes to j , and the derivative in Formulas (B.10) on this dimension converges to BM Formulas (B.8).
To get the derivative with respect to the selection strength A, we make the extra assumption that it is diagonal, so that the actualisation terms q j are also diagonal, and we get: , with: Note that, when any λ l goes to zero, G j (Λ) ll goes to 2 j .
B.2.3. Simple Error Model. For a simple error model, with no dimension jump and the same error variance matrix for all the observations, we get: (B.14) S i = S, and the derivatives are straightforward to obtain.

APPENDIX C: CONSTRAINED PARAMETERS
C.1. The LKJ Transformation by Sampling Spheres. The LKJ (Lewandowski, Kurowicka and Joe, 2009) transformation is a popular tool to handle variance matrices in a Bayesian analysis. It relies on the decomposition of the variance matrix R = D σ CD σ as the product of the diagonal matrix of standard deviations D σ and the correlation matrix C. This allows for sampling both parameters independently, avoiding scaling effects that can occur when sampling the variance matrix directly, using for instance a Wishart distribution (Barnard, McCulloch and Meng, 2000). The LKJ transformation and distribution act on the constrained space of correlation matrices C p , defined as the space of squared matrices of dimension p, that are symmetric positive definite with diagonal values equal to one.
C.1.1. The LKJ Transformation. We show here that sampling in the space of correlation amounts to sampling vectors in the half euclidean positive sphere.
Cholesky Representation. As in Stan (2017), we use the Cholesky decomposition of the correlation matrix C = W T W, with W upper triangular. To ensure identifiability, we further assume that all the diagonal coefficients of W are positive. Instead of sampling the space C p of correlation matrices, we subsequently sample the space C ch p of Cholesky matrices, defined as the space of real upper triangular matrices with positive diagonal values such that W T W ∈ C p . This transformation has two advantages. First, as we will see below, this space has actually a relatively simple structure. Second, having the Cholesky decomposition of the correlation matrix is useful for all subsequent operations involving this matrix, such as taking the inverse. In practice, in the implementation we will never compute the actual matrix C, inducing better numerical performances.
Structure of the Cholesky Space. The following proposition shows that sampling the Cholesky space C ch p amounts to sampling p vectors of dimensions 1 to p in the half-euclidean sphere.
PROPOSITION 3. Let C ch p be the space of Cholesky matrices of correlation matrices, and, for any k, 1 ≤ k ≤ p, S pos k the half Euclidean sphere defined by S pos k = {x ∈ R k | x k > 0 and k i=1 x 2 i = 1}. Then C ch p is diffeomorphic to the Cartesian product of these half spheres: PROOF. Let W ∈ C ch p , and 1 ≤ k ≤ p a dimension. As W is upper-triangular, only the first k coefficients of its k th column vector are non zero. Denote by W k ∈ R k the vector with these non-zero coefficients. Then, from the definition of C ch p , we get that W kk > 0, and W T k W k = 1, i.e. that W k is in the half sphere S pos k . The function that to W associates the vectors (W 1 , . . . , W p ) is then a diffeomorphism between C ch p and × p k=1 S pos k .
Sampling the Half Euclidean Sphere. Sampling the half Euclidean sphere S pos k is then relatively standard. We start by mapping it to the underlying Euclidean ball B k−1 , then go to the infinite norm ball B ∞ k−1 , before reaching the unconstrained space R k−1 . As, for any x ∈ S pos k , x k = 1 − k i=1 x 2 i , the first step of going from S pos k to B k−1 is straightforward. The last step of mapping B ∞ k−1 to R k−1 is also standard, using a "Fisher Z" transformation, or area hyperbolic tangent. To cover the missing step, we use the following transformation L : B ∞ k−1 → B k−1 , adapted from Lewandowski, Kurowicka and Joe (2009), and defined for any x ∈ B ∞ k−1 and 1 ≤ i ≤ k by: with the convention that the product over the empty set is equal to one. Note that we use these three transformations to be consistent with the classical LKJ transformation (see below), but one could craft other ways to map the half sphere to the unconstrained space. For instance, the standard transformation that to a vector x ∈ B k−1 associates the vector y = (1 − x ) −1/2 · x in R k−1 could also be used instead.
The LKJ Transformation. The LKJ transformation, as defined in Lewandowski, Kurowicka and Joe (2009) and detailed e.g. in Stan (2017), is then just equivalent to the joint transformation of all the column-vectors W k of the Cholesky transformation W (1 ≤ k ≤ p) from the Euclidean half sphere S pos k to the unconstrained space R k−1 .
C.1.2. The LKJ Distribution. The LKJ distribution with parameter η is defined in Lewandowski, Kurowicka and Joe (2009) as a distribution over the space of correlation matrices C p , with density proportional to their determinant: with c p (η) defined below in Equation (C.8). Note that when η = 1, it represents the uniform distribution over correlation matrices. Without any expert information, that is the default uninformative prior we use in our analyses. It can be rewritten as an equivalent distribution over the Cholesky space C ch p (Stan, 2017): The Spherical Beta Distribution. Lewandowski, Kurowicka and Joe (2009) define the following elliptically contoured distribution over the Euclidean ball (see Lemma 7 in the aforementioned paper): DEFINITION 2 (Spherical Beta Density). For any positive integer k and positive real β > 0, the spherical beta distribution is defined by the following density, for any x ∈ S pos k : with: As in the previous section and using Proposition 3, the LKJ distribution can be recovered by jointly applying this spherical beta distribution to all the column vectors of the Cholesky matrix with adequate parameters: It is easy to check that the induced constants are indeed the same: where the last expression matches with the definition of the LKJ distribution (see Lewandowski, Kurowicka and Joe, 2009, Section 3.3).
C.2. Other Parameters of the Models. We detail below the transformations and priors used on the other parameters of the model. C.2.1. Variance Parameters. In addition to the correlation matrix, from the decomposition above we also need to sample the diagonal variance term. These are just constrained to be positive, so we use a standard log transformation on them, and a vague half-Student prior (with default degree of freedom 1 and scale 2.5, for a normalized tree).
C.2.2. Mean and Optimal Values Vectors. The mean and optimal values vectors are unconstrained in the general case, so no transformation is needed, and we use a default normal prior on them (with default expectation 0 and standard deviation 5).

C.2.3. Diagonal Selection Strength.
When the selection strength A is diagonal, all its diagonal terms (α 1 , . . . , α p ) are constrained to be positive. We hence use a standard log transformation on them. For the prior, we use a vague half-normal, with standard deviation set so that, under the prior distribution on a normalized tree of unit height, the phylogenetic half-life t 1/2 = log(2)/α k (Hansen, 1997) is larger than 5% of the tree height 95% of the time (i.e. sd = log(2)/(5/100)/q half-normal 95 ≈ 7.07).

APPENDIX D: THE HERITABILITY STATISTICS
In Section 4.2, we introduced a new heritability statistic based on the population variance, instead of the empirical variance as done in previously published studies (Hassler et al., 2019). We argue here that, when the evolutionary process is not a vanilla BM process (as the one they consider in Hassler et al. 2019), using the population variance is more appropriate, and can avoid some bias in the analysis. ) and population (this paper) approach (right) computed on for a BM model on a simple tree, with a BM variance of σ 2 = 0.01, an observation variance of s 2 = 1.0, and a shift δ in the process value affecting the two lower tips (left), so that the expectation of the tips in black is 0, and the expectation of the tips in grey is δ. As the observation error is much higher than the process variance, the heritability is low, as found by both method when δ = 0. When δ increases, the "inter-group" variance between grey and black tips blurs the empirical variance, leading to an inflated empirical heritability, that converges to 1. The population heritability is robust to this model change, as it uses the process parameters directly.
The main point of the argument is that, under any process that is not a vanilla BM (e.g. a BM with shifts or drift, or an OU), all the tips do not have the same expected trait values, and hence the empirical mean is not a good estimate of the population mean.
To see this, we study the simple example of a BM on a four taxon tree with one shift δ affecting half of the species (see Figure S1, left). The tree is taken to be ultrametric, with total height h = 1.0, and all branches of length 0.5. To simplify the analysis, we consider the case of a univariate process, with variance σ 2 = 0.01, ancestral mean µ = 0, and a uniform observation process with variance s 2 = 1.0. Note that, as the observation variance is much larger than the process variance, we expect the heritability to be low. In that case, using the definition and notations found in Hassler et al. (2019), the "empirical" heritability is given by: where the extra term δ 2 /4 comes from the difference of expectations at the tips (see Hassler et al., 2019, Formula (5) in Supplementary Section 2). From this formula, we can see that the empirical heritability will converge to 1.0 when δ becomes big enough, whatever the values of σ 2 and s 2 (see Figure S1, right). In contrast, the "population" heritability defined here (see Equation 4.3) reduces, in this simple case, to: which does not depend on the value of the shift δ.
Beyond this toy example, where the effects of the inter-group and intra-group variances is clearly marked, we would like to point out that the empirical variance is going to be biased as soon as all the tips are not in the same "group", i.e. as soon as they do not have the same expected values under the model. This happens in most of the models, such as the BM with drift (Gill et al., 2016), or the OU on a non-ultrametric tree (Clavel, Escarguel and Merceron, 2015). Using our population heritability might be a first step towards addressing this issue.  , on MCC trees obtained with the models used during the complete evidence approach (line type). Fitted models are either the BM, the "trend" model (i.e. a BM with drift) or the OU. The three traits are either all independent, all correlated, or R2 and R3 correlated, but R1 independent (colors of the points). The model where R1 follows a BM and is independent from R2 and R3 following an OU, with no error, seems to be favored in every setting. Including measurement error do not seem to improve the fit in this setting (data not shown). The "trend" model, that is favored for R1 when fossils are included in Schnitzler et al. (2017) comes second in this analysis, with a log marginal likelihood difference of 1.97, i.e. a log Bayes factor of 1.97, which, according to the guidelines found in Kass and Raftery (1995), can be considered as "substantial evidence" against the trend model.  Comparison between the best model fit on a tree with or without fossils. The model is a BM on R1, and an OU on R2 and R3, with correlations only between R2 and R3. Black points are the median, and red bars the 95% HPD. Including the fossils do not seem to have a dramatic effect on the estimates.

E.2. Virulence Heritability in Human Immunodeficiency Viruses (HIV).
Ancestral Credibility HPD interval overlap on both models for all parameters. Using the OU-BM model instead of the OU leads to a slightly higher estimate for the heritability of GSVL, and a lower estimate for the heritability of CD4. Consistently, the sampling variance of CD4 under the OU-BM model is estimated to be higher. This behavior is expected, as the OU allows for a higher disparity between trait measurements at the tips, reducing the need for the residual sampling variance. Black points are the median, and red bars the 95% HPD. For the half-life, outliers were omitted from the plot (representing, respectively, 0.75% and 0.61% of the sampled data points. Pagel's λ Analysis. Pagel's λ (Pagel, 1999) has been proposed as an estimator of phylogenetic heritability (Alizon et al., 2010;Vrancken et al., 2015). Given a BM model of evolution on a tree (with no observation model layer), the λ model is obtained by scaling all the internal branch lengths by a factor λ, and adjusting the length of all external branches so that all the tips remain at the same height (i.e. j (λ) = λ j for all internal node j, and i (λ) = i + (1 − λ)t pa(i) = λ i + (1 − λ)t i ) for all tips i). For a univariate trait, this model has been shown to be equivalent to a BM with a scaled error observation model, with the λ parameter equal to the heritability (Leventhal and Bonhoeffer, 2016). For a multivariate model, the same λ parameter applies to all the traits through the rescaling of the underlying tree. The model is then equivalent to a BM with an observation error that is equal to the variance parameter of the process, up to the tree re-scaling, and a unique scaling parameter µ = (1 + λ) −1 . This model enforces that all the traits share the same heritability λ.
Pagel's λ Results. The independent Pagel's λ model is associated with the lowest support, with the second weakest model (multivariate Brownian motion with independent, non scaled errors) having a relative log Bayes factor of 3. In contrast, the multivariate Pagel's λ model yields the strongest support, with a log Bayes factor of 1 compared to the second best fitted model, that has 3 additional parameters. Under this model, the common heritability is estimated to be 0.13 (HPD 95%: [0.05, 0.22]).
Note on the Phylogenetic Correlation Parameter. Figure S5 (last panel) shows that the estimated phylogenetic correlation variable between the two traits is very high in absolute value, with, for the OU-BM model, a median of −0.92 (95% HPD [−1, −0.76]). This result might look surprising at first sight, and in opposition with the empirical (non-phylogenetic) correlation of −0.23 observed between the variables. This is however a mechanical result of the model, as the tip variance is the sum of the phylogenetic variance and the observation variance (see Definition 1). When the later forced to be diagonal, any observed correlation must be phylogenetic. But, when the heritability is low, the observation variance has a stronger relative impact on the total variance than the phylogenetic variance, so that the phylogenetic correlation must be high in order to have a small impact on the total observed variance. This effect partly goes away when the observation error is allowed to be correlated, as the phylogenetic variance is then estimated to −0.71 (95% HPD [−1, −0.22]) for the same OU-BM model. In both cases however, the mean total correlation at the tips given by the model is stable, of, respectively, −0.27 and −0.28 for the independent and correlated observation models, which is consistent with the empirical correlation. Boxplot of the normalized estimates of the parameters of the process when a BM, an OU-BM (true model) of an OU is used for the inference and various levels of noise. The normalized estimates should converge to 0 (vertical line). Aside from some outliers, the ancestral state (optimal value) is well estimated, for any noise level and irrespective of the model used for the inference. The phylogenetic half-life tend to be under estimated when the noise becomes higher. That is consistent with the fact that a higher noise means less phylogenetically related measures, which can be obtained by higher selection strengths. For the CD4, the true model is a BM, so that the true half-life is formally equal to infinity. We can see that when an OU is inferred on this dimension, it can get a very short time-scale for high level of noise, indicating falsely a strong strength of selection. When the noise get higher, its parameter, the sampling variance, get better estimated. The heritability tends to be over-estimated with high levels of noise.