Multivariate Singular Spectrum Analysis by Robust Diagonalwise Low-Rank Approximation

Multivariate Singular Spectrum Analysis (MSSA) is a powerful and widely used nonparametric method for multivariate time series, which allows the analysis of complex temporal data from diverse fields such as finance, healthcare, ecology, and engineering. However, MSSA lacks robustness against outliers because it relies on the singular value decomposition, which is very sensitive to the presence of anomalous values. MSSA can then give biased results and lead to erroneous conclusions. In this paper a new MSSA method is proposed, named RObust Diagonalwise Estimation of SSA (RODESSA), which is robust against the presence of cellwise and casewise outliers. In particular, the decomposition step of MSSA is replaced by a new robust low-rank approximation of the trajectory matrix that takes its special structure into account. A fast algorithm is constructed, and it is proved that each iteration step decreases the objective function. In order to visualize different types of outliers, a new graphical display is introduced, called an enhanced time series plot. An extensive Monte Carlo simulation study is performed to compare RODESSA with competing approaches in the literature. A real data example about temperature analysis in passenger railway vehicles demonstrates the practical utility of the proposed approach.


Introduction
Time series analysis plays a crucial role in understanding and predicting the behavior of sequential data across a wide range of disciplines.It has proven to be an invaluable tool in fields such as finance, healthcare, ecology, engineering, and more.Singular spectrum analysis (SSA) has emerged as a powerful nonparametric tool for extracting valuable insights from time-dependent data.A comprehensive overview of SSA can be found in the books Golyandina et al. (2001), Golyandina and Zhigljavsky (2013), and Golyandina et al. (2018).Numerous examples showcasing the success of SSA can be found in the literature.Multivariate singular spectrum analysis, referred to as MSSA (Broomhead and King, 1986), enables the simultaneous analysis and interpretation of multiple time series, by exploiting the dependencies between variables.
A crucial step of SSA is the low-rank approximation of the so-called trajectory matrix, which will be described in the next section.The most often used tool for this is the singular value decomposition (SVD), which is however sensitive to the presence of outliers in the data.Several authors have proposed outlier-robust SSA methods by replacing the SVD by more robust versions.However, none of these low-rank approximations took the special diagonal structure of the trajectory matrix into account.The main contribution of our work is a new robust low-rank approximation method tailored to this situation.
The paper is organized as follows.Section 2 briefly surveys existing work and introduces the new approach, called RObust Diagonalwise Estimation of SSA (RODESSA), including its algorithm, implementation, and forecasting method.It is proved that each step of the algorithm reduces the objective function.Section 3 proposes an enhanced time series plot in which two types of outliers are represented by colors, in order to assist with outlier detection.In Section 4, the performance of RODESSA is assessed by an extensive Monte Carlo simulation study.Section 5 presents a real data example regarding temperature analysis in passenger railway vehicles.Section 6 concludes the paper.
The entire matrix X is thus a stacked Hankel matrix.The notations T SSA and T MSSA stand for the univariate and multivariate embedding operators that map X (j) and X to the corresponding trajectory matrices.
2. Decomposition.This step performs the SVD of the trajectory matrix X, yielding where d = rank(X), and u 1 , . . ., u d and v 1 , . . ., v d are the left and right singular vectors.The ordered singular values are and the matrices X r = β r u r v T r all have rank 1.The triple (β r , u r , v r ) is called the r-th eigentriple of the matrix X.
3. Grouping.The grouping step corresponds to splitting the terms of (2) into several disjoint groups and summing the matrices within each group.In this paper we focus on the case where two main groups are created, and we write where the estimate of the signal X q is a sum like (2) but only for the first q eigentriples, whereas the residual matrix R = X − X q is associated with the noise.We can see X q as a low-rank approximation of the trajectory matrix.
4. Reconstruction.In this step, the fitted matrix X q is transformed back to the form of the input object X in (1).In each submatrix X (j) q we compute average entries as follows.Denote an anti-diagonal as The reconstructed multivariate time series is then given by X = X (1) , . . ., X (p) .

Outliers in multivariate time series
Multivariate time series may contain outliers, that is, observations that deviate from the expected patterns or trends.They can be caused by a variety of factors such as measurement errors, data entry mistakes, sensor malfunctions, or rare and unexpected events.Outliers can bias statistical measures, affect parameter estimation, and lead to inaccurate forecasting, resulting in erroneous conclusions and flawed decision-making.As in the multivariate setting (Alqallaf et al., 2009;Raymaekers and Rousseeuw, 2023), we distinguish between a cellwise outlier, which is an outlying value x  x (1) 2 x (1) 3 x (1) 4 x (1) 5 . . .

Existing robust SSA methods
Classical MSSA, and its univariate version SSA, are sensitive to outliers because they are based on the SVD which is highly susceptible to outlying values.In order to remedy this, there has been research into the construction of outlier-robust SSA methods.In general, the purpose of robust methods is to limit the effect of outliers on the results, after which the outliers can be detected by their residuals from the robust fit, see e.g.Rousseeuw and Leroy (1987), Maronna et al. (2019).
When constructing a more robust version of SSA one needs to replace the classical SVD by something less sensitive to outliers.One approach is to start from an outlier-robust principal component analysis (PCA) method.A PCA fit of rank q is of the type where μ is the estimated center, and the matrix of scores U as well as the loadings matrix V have q columns.Many PCA methods have an option to set μ = 0, and then (5) yields an approximation of rank q to X as in (3).Afterward one can carry out the reconstruction step of SSA.De Klerk (2015) applied the robust PCA method ROBPCA (Hubert et al., 2005) to the trajectory matrix, and then centered the original data X as well as X by subtracting 1 n μT .In this way he obtained a robust centered SSA method.
A limitation of this approach is that ROBPCA and most other robust PCA methods are built to withstand outlying rows of the data matrix, but not outlying cells.But we have seen in the bottom part of Figure 1 that a single outlying value in a time series can affect many cells of the trajectory matrix X, especially when the window length L is high.Therefore a relatively small number of outliers in the time series can affect over half the rows of X, which ROBPCA might not withstand.In such situations a cellwise robust PCA method like MacroPCA (Hubert et al., 2019) could be used instead.
In the nonrobust setting, expression (3) can equivalently be seen as a problem of low-rank approximation of the trajectory matrix X by a matrix X L 2 , which minimizes F over all L × K matrices S of rank q.Writing the unknown S as a product S = U V T , where with the residuals R ℓk := X ℓk − q r=1 u ℓr v kr .We then put X L 2 := U L 2 V T L 2 .The solution of the optimization (6) is easily obtained through the SVD decomposition X = U D V T where D is the diagonal matrix of singular values.Restricting this to the q leading singular values β 1 ⩾ . . .⩾ β q > 0 we obtain X = U q D q V T q .We can then absorb the singular values by putting But the quadratic loss function in (6) makes this a least squares fit, which is very sensitive to outliers.
To remedy this, De la Torre and Black (2003) proposed a more robust low-rank approximation by replacing (6) by the minimization of the loss function where σ is a fixed scale estimate.The function ρ must be continuous, even, non-decreasing for positive arguments, and satisfy ρ(0) = 0. De la Torre and Black (2003) used that goes to 1 as t → ∞.Therefore an outlying t has much less effect on ρ(t) than with the ρ(t) = t 2 in (6).They constructed an iterative algorithm for U V T .For σ they took the median absolute deviation of the residuals from an initial estimate U 0 V T 0 .Note that here and in the sequel the estimation target is not the pair (U , V ) because that is not uniquely defined.Indeed, if we take a nonsingular q × q matrix A we see that (U A, V (A −1 ) T ) yields the same product U AA −1 V T = U V T as (U , V ), and hence the same objective (7).The actual estimation target is the product U V T .Chen and Sacchi (2015) applied this general low-rank approximation method to the decomposition step of SSA.They replaced the function ρ by Tukey's biweight function with tuning constant c = 4.685, and used the method to filter seismic noise.
with b = 1.345.Cheng et al. (2015) performed a robust SSA analysis by applying a different robust low-rank approximation method due to Candès et al. (2011).

The RODESSA objective function
The existing robust low-rank approximation methods described above are less sensitive to outliers than the classical SVD.But none of them are tailored to the (possibly stacked) Hankel structure of the trajectory matrix X.As we saw in the bottom part of Figure 1, an outlier in one of the time series corresponds to an entire diagonal of the corresponding trajectory matrix.In the least squares low-rank approximation setting, algorithms have been developed that incorporate the Hankel structure of X (Markovsky, 2008), but no such approach exists in the robust setting yet.
To fill this gap we propose a new robust method for MSSA, named RObust Diagonalwise Estimation of SSA (RODESSA), which explicitly takes into account the way outliers and their residuals occur in the stacked Hankel structure of the trajectory matrix.It is meaningful to talk about anomalous diagonals rather than anomalous rows of the trajectory matrix X. Therefore we propose to approximate the trajectory matrix X by X = U V T obtained by minimizing Here p is again the number of univariate time series and n i is the length of the diagonal corresponding to x (j) i as in (4).The predicted cell x(j) ia is given by . The fixed scales σ2 1,j standardize the squared norms of the diagonal residuals of series j, given by r (j) The overall scale σ2 2 standardizes the quantities based on all p coordinates.The functions ρ 1 and ρ 2 are defined for nonnegative arguments and must be continuous and non-decreasing.In our implementation both are of the form ρ(t where ρ c is Tukey's biweight (8).[For ρ 1 (t) = ρ 2 (t) = |t|, (10) would reduce to ( 6).]The goal of the normalization by n i σ2 1j inside ρ 1 in ( 10) is to give the r (j) i similar average sizes.Indeed, if the x (j) i − x(j) ia would be independent normal random variables with variance σ2 1j , then n i r (j) i would follow a Gamma distribution with parameters n i /2 and 2σ 2 1j and thus with mean equal to n i σ2 1j .Therefore the average of the r (j) i /σ 2 1j would be close to 1.A similar argument applies for the normalization by pn i σ2 2 inside ρ 2 .The functions ρ 1 and ρ 2 reduce the effect of cellwise and casewise outliers on the final estimates, because large values of r (j) i and/or r i contribute less to the objective function.We can say that r (j) i measures how prone the i-th value x (j) i of time series j is to be a cellwise outlier.Analogously, r i reflects how prone the multivariate T is to be a casewise outlier.Note that in the computation of r i the effect of cellwise outliers is tempered by the presence of ρ 1 , to avoid that a single cellwise outlier would always result in a large casewise r i .

The IRLS algorithm
We now address the optimization problem (10).The scale estimates σ1,j and σ2 are constants, whose computation will be described in Section 2.6.Because L ρ 1 ,ρ 2 is continuously differentiable, its solution must satisfy the first-order necessary conditions for optimality.They are obtained by setting the gradients of L ρ 1 ,ρ 2 with respect to u 1 , . . ., u L and v 1 , . . ., v K to zero, yielding where X 1 , . . ., X L and X 1 , . . ., X K are the rows and columns of X.Here W ℓ is a K × K diagonal matrix, whose diagonal entries are equal to the ℓth row of the L × K weight matrix where the Hadamard product ⊙ multiplies matrices entry by entry.Analogously, W k is an L × L diagonal matrix, whose diagonal entries are the kth column of the matrix W .The matrix W c in ( 14) is given by W c = W (1) c : . . .: c,N , containing the cellwise weights in which ρ ′ 1 is the derivative of ρ 1 .The matrix W r is given by W r = W (1) r : . . .: , with each W (j) r = T SSA (w r,1 , . . ., w r,N ) containing the same casewise weights Outlying cells x (j) i should get a small cellwise weight w c,i and outlying cases x i should get a small casewise weight w r,i .
The system ( 13) is nonlinear because the weight matrices depend on the estimate, and the estimate depends on the weight matrices.In such a situation one can resort to an iteratively reweighted least squares (IRLS) algorithm.Note that for a fixed weight matrix W , the system (13) coincides with the first-order necessary condition of the weighted least squares problem of minimizing where X ℓk = q r=1 u ℓr v kr .The optimization of ( 17) can be performed by alternating least squares (Gabriel, 1978).This minimizes (17) with respect to u 1 , . . ., u L where v 1 , . . ., v K and the weights are fixed, and alternates this with minimizing (17) with respect to v 1 , . . ., v K where u 1 , . . ., u L and the weights are fixed.
The overall algorithm starts from initial estimates U 0 , V 0 of U , V that will be described in Section 2.6.New matrices U t+1 , V t+1 are obtained from U t , V t by Then the weight matrix is updated to W t+1 as in ( 14).The iterative process continues until convergence, as summarized in Algorithm 1.
Algorithm 1 IRLS algorithm 1: Compute U 0 , V 0 , σ1,j and σ2 according to Section 2.6 ▶ Initialization 2: Set t = 0 3: Compute W 0 as in ( 14) Compute W t+1 as in ( 14) ▶ Weight matrix update 8: Proposition 1.Each iteration of Algorithm 1 decreases the objective function, that is, The proof is given in section A.1 of the Supplementary Material.Since the objective function is decreasing and it has a lower bound of zero, it must converge.Note that Proposition 1 is not restricted to the functions ρ 1 and ρ 2 used in RODESSA, which are of the type ρ(t) = ρ c ( √ t) where ρ c is Tukey's biweight (8).All that is needed is that the function ρ(t) is differentiable and concave.
For this purpose ρ c could be replaced by Huber's ρ b of (9), or the function ρ b,c used in the wrapping transform (Raymaekers and Rousseeuw, 2021).

Matters of implementation
This section describes several implementation specifics: the initialization strategy to select U 0 and V 0 in the IRLS algorithm, the scale estimates σ1,j and σ2 , and how to select the loss function tuning constants, the rank q, and the window length L.
Scale estimates.Given initial estimates U 0 and V 0 (see below), the scale estimates σ2 1,j and σ2 2 are computed as M-scales of the quantities r (j) i from (11) and the r i of ( 12), all with respect to the fit U 0 V T 0 .A scale M-estimator of a univariate sample (z 1 , . . ., z n ) is the solution σ of the equation where 0 < δ < 1.In our implementation we chose ρ to be Tukey's biweight ρ c of (8).We set δ = 0.5, which ensures a 50% breakdown value, and c = 1.548, which is the solution of E[ρ c (z)] = 0.5 for z ∼ N(0, 1), to obtain consistency at the normal model.
Initialization.Since the objective function L ρ 1 ,ρ 2 of ( 10) is not convex when ρ 1 are ρ 2 are biweight functions, U 0 and V 0 should be carefully selected to avoid that the IRLS algorithm ends up in a nonrobust solution.We compute several candidate initial fits, and select the best one among them.
One candidate fit is given by the first q terms of the standard nonrobust SVD as in ( 6), yielding the rank-q matrix X 1 .The second candidate X 2 is the fit obtained from (7) for ρ(t) = |t|, and the third candidate X 3 is the fit of Candès et al. (2011).For each of these candidate solutions X k we apply the scale M-estimate ( 19) to all the values r (j) i given by ( 11).Then we select the X k with the lowest M-scale.
Tuning constants.The functions ρ 1 and ρ 2 in (10) use the biweight formula (8), by ρ ).Now we need to choose the tuning constants c 1 and c 2 .Following Aeberhard et al. (2021) we set these tuning constants using a measure of downweighting at the reference model.The idea is to select tuning constants such that the average of the weights w of (15) matches a target value δ c and the average of the weights w r,i of ( 16) matches a target value δ r .For this computation the weights are standardized to range from 0 to 1.The target values determine how much r (j) i and r i are downweighted on average, and by default we set δ c = δ r = 0.9 .The corresponding values of c 1 and c 2 are obtained by Monte Carlo.This computation pretends that the data are clean, with i.i.d.errors following the standard normal distribution, and that the fitted U V T equals the true value.We can then simulate the distribution of all the weights w (j) c,i and w r,i for the given values of N , p, and L, and choose c 1 and c 2 .
Selecting the rank q.In classical MSSA, a popular way to select the rank q is to make a plot of the unexplained variance.This is the expression ||X − X r || 2 F as in ( 6), where X r is the best approximation of X of rank r.The unexplained variance is monotone decreasing in r, and one wants to select a value q where the curve has an 'elbow'.For the RODESSA method we inspect the analogous curve of the objective function (10) at the rank-r fit X r , that is, we plot the curve of Selecting the window length.The choice of the window length L in MSSA is a complex issue that has been addressed by Hassani and Mahmoudvand (2013) and Golyandina et al. (2018).In general, L should be selected to benefit either separability of the signal and the noise, or forecasting accuracy.Following Golyandina et al. (2018) we use L ≃ pN/(p + 1) for the analysis of a small number p of time series, and L ≃ N/2 otherwise.

Forecasting with RODESSA
The MSSA model assumes a linear recurrent relation (Golyandina et al., 2018).This implies that an observation can be predicted by a linear combination of the previous L − 1 observations.In classical MSSA the coefficients of this linear combination are derived from the unique SVD fit of rank q to the trajectory matrix X, see e.g.Danilov (1997).For RODESSA the rank-q fit X q can similarly be decomposed by the exact SVD, yielding the L × q matrix U = [ũ 1 , . . ., ũq ] of left singular vectors and the K × q matrix V = [ṽ 1 , . . ., ṽq ] of right singular vectors.Then the coefficient vector â = (â L−1 , . . ., â1 ) T is obtained as â = q r=1 ũr,L (ũ r,1 , . . ., ũr,L−1 ) T 1 − q r=1 ũ2 r,L where ũr,1 , . . ., ũr,L are the L components of ũr .Let (x (j) i ) N i=1 be the reconstructed multivariate time series associated with the rank-q approximate trajectory matrix X q .Then the h-step ahead recurrent MSSA forecasts x(j) N +1 , . . ., x(j) N +h for j = 1, . . ., p are given by x(j 3 Outlier detection RODESSA implicitly provides information about outliers in multivariate time series.The cellwise weight w (j) c,i given by (15) reflects how much faith the algorithm has in the reliability of x (j) i , the value of the j-th time series at time i.The casewise weight w r,i given by ( 16) does the same for the entire case x i = (x (1) i , . . ., x (p) i ) T at time i.A small weight means that the corresponding cell or case was deemed less trustworthy, and was only allowed to have a small effect on the fit.
We propose a new graphical representation, called enhanced time series plot, which facilitates outlier detection by visualizing these weights in a single plot.Since the cellwise weights w (j) c,i are a monotone function of the cellwise squared norms r (j) i of ( 11), this can be seen as an extension of the cellmap for multivariate data (Rousseeuw and Van den Bossche, 2018;Hubert et al., 2019) to time series.
We illustrate the enhanced time series plot on the publicly available Electricity Load Diagrams 2011-2014 dataset (Trindade, 2015).It contains the electricity consumption of 370 clients from January 2011 to December 2014.We plot the data of 4 clients from November 27th to December 3rd 2014, with consumption observed every 2 hours.For the purpose of illustration we inserted some outliers.
Figure 2 shows its enhanced time series plot.The solid black lines correspond to the reconstructed multivariate time series.The original cells x (j) i , connected by yellow solid lines, are represented by circles filled with a color.Those with cellwise weight w (j) c,i close to 1 are filled with white.Cells with a low weight and positive residual are shown in increasingly intense red, and those with a negative residual in increasingly intense blue.Moreover, x (j) i is flagged as a cellwise outlier when w (j) c,i < q c,α where q c,α is the α-quantile of the simulated distribution of cellwise weights at the reference model, described in Section 2.6.Cells flagged as cellwise outliers are shown as solid squares, in red when the residual is positive and in blue when it is negative.
At the top of the plot we see circles reflecting the casewise weights w r,i with weight 1 shown with a white interior and lower weights in increasingly darker grey.A case x i is flagged as a casewise outlier if w r,i is below the α-quantile q r,α of the simulated distribution of casewise weights at the reference model.Casewise outliers are indicated by a black solid circle and a vertical dashed grey line.The last one in the plot would not have been flagged in the cellwise way.
In this example we also want to plot the forecasts, which is not part of the default enhanced time series plot.The forecasts are shown as green triangles connected by green solid lines.The true values (which the algorithm did not know about) were added as black crosses connected by yellow solid lines, illustrating the forecasting performance.

Simulation study
In this section, the performance of the proposed RODESSA method in the presence of cellwise and casewise outliers is assessed through a Monte Carlo simulation study.The data generation process is inspired by the simulated example in Section 3.3 of Golyandina et al. (2018).Specifically, the multivariate time series is generated through the following signal plus noise model Table 1: Parameters A (j) and C (j) for scenarios 1, 2 and 3.
In each scenario we consider cellwise and casewise contamination settings, with the fraction of outliers ε equal to 0.1 and 0.2.In the cellwise contamination setting, outliers are introduced by adding the number γσ to εpN randomly chosen values x (j) i .We let γ range over γ = 0, 1, . . ., 8, so γ = 0 corresponds with the uncontaminated setting.In the casewise contamination setting, we add γσ to all values of εN randomly chosen Our proposed RODESSA method is compared with several competing approaches.We run the classical version of MSSA (labeled CMSSA) and several robust versions described in Section 2.3, which perform the decomposition step by (7) with ρ(t) = |t| as in Rodrigues et al. (2018) (labeled RLM), as well as the method of Cheng et al. (2015) (labeled CHENG), and that of Chen and Sacchi (2015) (labeled CS).
RODESSA is implemented as described in Section 2 with q = 2, which corresponds to the true rank since for any angle φ the signal A cos(2πi/10 + φ) equals the linear combination A cos(2πi/10) cos(φ) − A sin(2πi/10) sin(φ) of the functions cos(φ) and sin(φ).Figure 3 plots the objective function for increasing rank, which also indicates that most of the variability is explained by two components.We consider both L = 56 ≃ pN/(p + 1) and L = 35 = N/2, and choose the tuning constants by setting δ c = δ r = 0.90.
The competing approaches are implemented with the same values of q and L. To guarantee a fair comparison between RODESSA and CS, the tuning parameter in the CS method is chosen by means of the procedure in Section 2.6.For each combination of scenario, contamination setting, In the first formula the xj i are the reconstructions (i ⩽ N ).In the second formula they are the forecasts (i > N ) as defined in (20).The RE and FE are then averaged over the replications.
We report the results for the most challenging Scenario 3, with L = N/2.The supplementary material contains the simulation results for the other settings, which yield qualitatively similar conclusions.
For cellwise contamination, Figure 4 shows the average RE and FE as a function of the contamination magnitude γ for both outlier fractions.Without contamination (γ = 0), CMSSA is the best method in terms of average RE and FE, as expected in this situation.But the errors of RODESSA and CS are similarly small, whereas those of CHENG and RLM are larger.
When looking at increasing γ > 0 it appears that far outlying cells have an unbounded effect on CMSSA, a bounded effect on CHENG and RLM, and a small effect on RODESSA and CS, due to their bounded biweight ρ function.In that sense RODESSA and CS are similar.But RODESSA does better than CS, due to the fact that its decomposition step takes the diagonal structure of the Hankel matrix into account.The performance difference is largest for the higher outlier fraction ε = 0.2.are larger, with RODESSA outperforming the other methods more strongly.This is due to the fact that RODESSA can share information about diagonals across the stacked structure.This helps because we saw in Figure 1 that casewise outliers affect several diagonals in the trajectory matrix simultaneously.Unlike the other methods, RODESSA combines information across such diagonals through the loss function ρ 2 in (10), which makes it more robust for casewise outliers.

A real data example: temperature analysis in passenger railway vehicles
In this section, a real data example from the railway industry illustrates the applicability and potential of RODESSA.
In recent years, railway transportation in Europe has emerged as a viable alternative to other modes of transport, leading to intense competition between operators to enhance passenger satisfaction (Kallas, 2011).A particular challenge in this context is ensuring optimal thermal comfort inside passenger rail coaches (Ye et al., 2004).To address this challenge, new European standards, such as UNI EN 14750 (British Standards Institution, 2006), have been developed.These stan-dards focus on regulating air temperature, relative humidity, air speed, and overall comfort and air quality in passenger rail coaches, taking into account the diverse operating requirements of trains.
Consequently, railway companies are proactively installing sensors to gather and analyze data from on-board heating, ventilation, and air conditioning (HVAC) systems (Homod, 2013).HVAC systems play a crucial role in maintaining passenger thermal comfort, and their performance is being improved based on the insights obtained from the collected data.
We analyze operational data from HVAC systems installed on a passenger 6-coach train operating during the summer season (Lepore et al., 2022).The data are publicly available at https://github.com/unina-sfere/NN4MSP .Specifically, the inside temperature of each coach was recorded about every four minutes from 10:00 to 22:00, yielding N = 176 observations of a multivariate time series with p = 6 components.We applied RODESSA in its default form as described in Section 2.6, with L = 151 ≃ pN/(p + 1).We selected q = 7 based on the values of the objective function.This yields the enhanced time series plot shown in Figure 6.
The vertical dashed lines in Figure 6 indicate casewise outliers.It seems that the temperature inside the six coaches is significantly higher than expected in an almost periodic fashion.This behavior is particularly severe for the group of casewise outliers between 17:00 and 18:00.Discussions with domain experts revealed that those measurements were acquired during train stops at terminal stations.In this situation the coach doors were left open, and due to the summer season this increased the temperature inside the coaches.From this we can conclude that those measurements are not representative of the operating conditions of the train and are not useful to characterize temperature dynamics.Moreover, the reconstructed time series shows that the temperature inside coaches does increase and decrease periodically, which is in accordance with the on/off control of the HVAC system.Note that not all casewise outliers are labeled as cellwise outliers.For instance, the casewise outlier shortly after 16:00 has no component that is flagged as a cellwise outlier, but the temperatures were relatively low in all six coaches simultaneously.The ability to detect such effects is a feature of the proposed method.
To assess the forecasting performance of RODESSA relative to the competing methods described in Section 4, we used subsequences of the multivariate time series.Specifically, starting from the first 120 observations, h-step ahead forecasts of all methods are computed for h = 5, 10, 20.
Similarly, forecasts are obtained by considering the first 121, 122,... observations and so on.For each subsequence, the h-step ahead forecasts are compared to the observed values of the time series, by computing the median of their absolute differences, denoted as mFE.Note that here we do not use the forecasting error formula in ( 21) because the data we are predicting contains outliers as well.
Moreover, to assess the effect of the window length L on the forecasting performance we compare the two lengths given in Section 2.6, namely 1/2 and 6/7 of the subsequence length N sub .Figure 7 shows boxplots of mFE for h = 5, 10, 20 and L = N sub /2 (top) as well as L = 6N sub /7 (bottom).
The RODESSA method outperforms its competitors overall, which is in line with the simulation study.

Conclusions
Multivariate singular spectrum analysis (MSSA) is a technique for fitting vector time series and making forecasts.It starts by constructing a matrix with a special structure.It consists of several submatrices stacked side by side, one for each component of the multivariate time series.A cell of the time series, that is, the value of one of its coordinates at a given timepoint, corresponds to a diagonal in the corresponding submatrix.A case of the time series, that is, all of the coordinates at a given time point, corresponds to diagonals in all of the submatrices.
Sometimes cells are outlying, and sometimes entire cases.But the classical MSSA method can be strongly affected by outliers, because it is based on a singular value decomposition, which is a least squares fit.Several more robust methods have been proposed in the literature, based on versions of the SVD that are less sensitive to outliers.However, these versions do not take the diagonal structure into account, and even a small number of outliers can create a large number of outlying diagonal entries that affect many rows and columns of the matrix, thereby overwhelming the fit.To resolve this issue we propose the RODESSA method, which explicitly takes the diagonal structure into account when decomposing the matrix.This makes it more robust than its predecessors, as illustrated in the extensive simulation study reported in Section 4 and the Supplementary Material.Moreover, it loses little efficiency when there are no outliers.
The RODESSA method is performed by a fast algorithm based on iteratively reweighted least squares.In Section 2 we prove a proposition stating that each step of the algorithm decreases the objective function.We also propose a new graphical display, called the enhanced time series plot, which visualizes the weights of the cells as well as the cases, making outliers stand out.We have applied the RODESSA method to a real multivariate time series about temperatures in passenger railway vehicles.This illustrates its good forecasting performance, as well as the convenient outlier detection by the enhanced time series plot.
The first inequality derives from the concavity of ρ 1 and the fact that ρ 2 is nondecreasing.The second inequality is due to the concavity of ρ 2 .Therefore L is a concave function.
We can also write the weighted least squares objective (17) as a function of f .We will denote it as L W (f ) := (vec(W )) T f .Here vec(.)turns a matrix into a column vector, in the same way as was done to obtain the column vector f .The next lemma makes a connection between L W and the RODESSA objective L.
Proof.From Lemma 1 we know that L(f ) is concave as a function of f , and it is also differentiable because ρ 1 and ρ 2 are.Therefore where the column vector ∇L(g) is the gradient of L in g.But ∇L(g) equals vec(W ) by construction, so (∇L(g)) With this preparation we can prove Proposition 1.
Proof of Proposition 1.When we go from θ t to θ t+1 the least squares fits in steps 5 and 6 of Algorithm 1 ensure that L Wt (f (θ t+1 )) ⩽ L Wt (f (θ t )), so it follows from Lemma 2 that

A.2 Additional simulation results
In this section we present all the simulation results according to the settings described in Section 4.
In the uncontaminated setting (γ = 0), Figure 8 shows examples of multivariate time series generated by scenarios 1, 2 and 3. Scenario 1 generates series in which the components have increasing amplitude, whereas Scenario 2 shifts their phase.Scenario 3 combines both effects.(1) x i (2) x i (3) (1) x i (2) x i (3) (1) x i (2) x i (3) x i (4) We first consider the window length L = N/2 = 35.Figure 9 shows the average RE and FE for each scenario (S1, S2 and S3) and outlier fraction ε as a function of γ, for data with cellwise contamination.Figure 10 does the same for casewise contamination.We conclude that all these simulation results are qualitatively similar to those reported in Section 4 of the paper.

RE FE
j-th univariate time series only, and a casewise outlier, where at a time i several or all of the values x

Figure 1
Figure 1 illustrates this for a 3-variate time series.The purple triangle is a cellwise outlier which affects only the first univariate time series at time i = 3, whereas the orange squares indicate a casewise outlier at time point i = 5.The effect of both types of outliers on the trajectory matrix (1) is shown in the bottom part of the figure.The cellwise outlier corresponds to an antidiagonal in the leftmost Hankel matrix only, whereas the casewise outlier affects all three stacked Hankel matrices.

Figure 1 :
Figure1: The effect of a cellwise outlier (purple triangle) and a casewise outlier (orange squares) on the diagonals of the trajectory matrix X.
Rodrigues et al. (2018) also constructed a robust SSA from the low-rank approximation method of De laTorre and Black (2003), but replaced the function ρ by ρ(t) = |t| as inCroux et al. (2003), yielding an L 1 objective.An advantage of the latter ρ function is that σ can be moved out of (7), so one does not need to estimate σ in advance.In subsequent work, Rodrigues et al. (2020) carried out robust SSA based on the low-rank approximation of Zhang et al. (2013) which used the Huber function ρ b

Figure 2 :
Figure 2: Enhanced time series plot obtained by applying RODESSA to the Electricity Load Diagrams data.

Figure 3 :
Figure 3: Graph of the objective function for increasing values of the rank r.

Figure 5 Figure 4 :
Figure5compares the same methods in the presence of casewise outliers.Here the differences

Figure 8 :
Figure 8: (No outliers.)Examples of randomly generated multivariate time series under scenarios 1, 2, and 3 of the simulation study.

Figures
Figures 11 and 12 show the corresponding results for window length L = 56 ∼ pN/(p + 1).
Figure 6: Enhanced time series plot from applying RODESSA to the multivariate time series of temperatures.