A Simple Lifelong Learning Approach

In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be to use data to improve performance on both future tasks (forward transfer) and past tasks (backward transfer). In this paper, we show that a simple approach -- representation ensembling -- demonstrates both forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, vision (CIFAR-100, 5-dataset, Split Mini-Imagenet, and Food1k), and speech (spoken digit), in contrast to various reference algorithms, which typically failed to transfer either forward or backward, or both. Moreover, our proposed approach can flexibly operate with or without a computational budget.

1 Introduction Learning is a process by which an intelligent system improves performance on a given task by leveraging data [1].In classical machine learning, the system is often optimized for a single task [2,3].While it is relatively easy to simultaneously optimize for multiple tasks (multitask learning) [4], it has proven much more difficult to sequentially optimize for multiple tasks [5,6].Specifically, classical machine learning systems, and natural extensions thereof, exhibit "catastrophic forgetting" when trained sequentially, meaning their performance on the prior tasks drops precipitously upon training on new tasks [7,8].However, learning could be lifelong, with agents continually building on past knowledge and experiences, improving on many tasks given data associated with any task.For example, in humans, learning a second language often improves performance in an individual's native language [9].
In the past 30 years, a number of sequential task learning algorithms have attempted to overcome catastrophic forgetting.These approaches naturally fall into one of two camps.In one camp, the algorithm has fixed resources, and so must reallocate resources (essentially compressing representations) in order to incorporate new knowledge [10][11][12][13][14]. Biologically, this corresponds to adulthood, where brains have a nearly fixed or decreasing number of cells and synapses.In the other camp, the algorithm adds (or builds) resources as new data arrive (essentially ensembling representations) [15][16][17][18].Biologically, this corresponds to development, where brains grow by adding cells, synapses, etc.
Approaches from both camps demonstrate some degree of continual (or lifelong) learning [19].In particular, they can sometimes learn new tasks faster due to prior learning on related tasks, while not catastrophically forgetting old tasks (see Appendix .1 for a detailed discussion on the relevant algorithms).However, as we will show, many lifelong learning algorithms are unable to transfer knowledge forward (to future unseen tasks) and most of them do not transfer backward (to previously seen tasks).With high enough sample sizes, some of these algorithms are able to transfer forward or backward, but transfer is more important in low sample size regimes [17,20].This inability to effectively transfer in low-sample size regimes has been identified as one of the key obstacles limiting the capabilities of artificial intelligence [21,22].
In this paper, we propose a general approach for lifelong learning which can be used with many existing encoder models.Specifically, we focus our approach on ensembling deep networks (Sim- ple Lifelong Learning Networks, SiLLy-N).Additionally, we demonstrate how the same approach can be generalized for lifelong learning based on ensembling decision forests (Simple Lifelong Learning Forests, SiLLy-F).We explore our proposed algorithm as compared to a number of reference algorithms on an extensive suite of numerical experiments that span simulation, vision datasets including CIFAR-100, 5-dataset, Split Mini-Imagenet, and Food1k, as well as the spoken digit dataset.Figure 1 illustrates that our algorithm outperforms all the reference algorithms in terms of forward, backward, and overall transfer on CIFAR 10X10 dataset.Ablation studies indicate the degree to which the amount of representation or storage capacity and replaying old task data impact performance of our algorithms.Figure 1: Performance summary on CIFAR 10X10 benchmark dataset.Columns are different evaluation criteria (see Section 2 for definitions, and Section 6 for experimental details), each strip of colored dots corresponds to an algorithm (we introduce SiLLy-N here) and each dot represents a task.In all the figure, older tasks have darker colors.Resource growing algorithms have a '*'.EWC, O-EWC, SI, TAG and ER always perform worse than LwF, and hence we do not show them in the plot.SiLLy-N (red) outperforms all reference algorithms in terms of forward (second panel), backward (third panel), and overall transfer (first panel).Importantly, such better transfer is achieved at high overall accuracy (last panel).More datasets are evaluated in Figure 8.

2.1
The lifelong learning objective Consider a lifelong or continual learning environment with tasks, We consider task aware lifelong learning, i.e., the tasks are known during both training and testing time.For simplicity, we consider that any task t ∈ T has the same input space, i.e., they have X ⊂ R D valued inputs with Y = {1, • • • , K t } valued class labels.We assume the tasks arrive sequentially, but the n t data samples within each task t, S t = {(X i , Y i )} nt i=1 , are batched and sampled identically and independently (iid) from some fixed distribution D t , and T t=1 n t = n.A learner f ∈ F trains on S t and chooses a hypothesis h t ∈ H associated with task t by minimizing a particular risk, where F and H are the algorithm and the hypothesis space, respectively.In supervised learning settings, one can consider the following risk for a particular task t: (1) where ℓ t : Y × Y → [0, ∞) is a given loss function associated with the task t and h t = f (S t ).Note that the data S t may contain data that is relevant to any number of tasks (potentially all the tasks) in the environment.One may take expectation over S t and consider the generalization error for the task as: (2) In the above equation, the learner will have access to a total of T datasets after T tasks, T t=1 S t , instead of S t only 1 .The goal is to find a learner f ∈ F that chooses a set of total T hypotheses {h 1 , • • • , h T } (one hypothesis for each task) such that the generalization error over all the tasks after observing all the data is minimized, that is:

Lifelong learning evaluation criteria
Others have previously introduced criteria to evaluate transfer, including forward and backward transfer [23][24][25][26].Pearl [27] introduced the transfer benefit ratio, which builds directly off relative efficiency from classical statistics [28].We define three notions of transfer building on relative efficiency.
Definition 1 (Transfer).Overall transfer of algorithm f for a given Task t is: (4) .
We say that an algorithm f has transferred to task t from all the tasks up to T if and only if Transfer t (f ) > 0.
Forward transfer quantifies how much performance a learner transfers forward to future tasks, given prior tasks.
Definition 2 (Forward Transfer).The forward transfer of f for task t is : (5) .
We say an algorithm (positively) forward transfers for task t if and only if: Forward Transfer t (f ) > 0.
Backward transfer quantifies how much a learner transfers backward to previously observed tasks, in light of new tasks.Definition 3 (Backward Transfer).The backward transfer of f for Task t is: .
We say an algorithm backward transfers to Task t from all the future tasks up to T if and only if: Note that Transfer can be decomposed into Forward Transfer and Backward Transfer: = log Another paper [26], concomitantly introduced transfer and forgetting (backward transfer).Their statistics are the same as ours, except they do not use a log.We opted for a log to address numerical stability issues in comparing small numbers.Because log is a monotonic function, the order of ranking algorithms is preserved (Appendix Figure 1 shows a version of Figure 1 and Figure 8, but using Veniat's statistics, which is nearly visually identical).By virtue of introducing Forward Transfer here, we can identify the inherent trade-off between forward and backward transfer, for a fixed amount of total transfer.Apart from the above statistics, we also report accuracy per task.
Definition 4 (Accuracy).The accuracy of algorithm f on task t after observing total T datasets is:   3 Representation ensembling algorithms Shannon proposed that a learned hypothesis can be decomposed into three components: an encoder, a channel, and a decoder [29,30]: Figure 2 shows these three components as the building blocks of different learning schemas.The encoder, u : X → X , maps an X -valued input into an internal representation space X [31,32].The channel v : X → ∆ Y maps the transformed data into a posterior distribution (or, more generally, a score).Finally, a decoder w : ∆ Y → Y, produces a predicted label.
A canonical example of a single learner depicted in Figure 2A is a decision tree.Importantly, one can subsample the training data to learn different components of the tree [33][34][35].For example, one can use a portion of data to learn the tree structure (which is the encoder).Then, by pushing the remaining data (sometimes called the 'out-of-bag' data) through the tree, one can learn posteriors in each leaf node (which are the channel).The channel thus gives scores for each data point denoting the probability of that data point belonging to a specific class.Using separate sets of data to learn the encoder and the channel results in less bias in the estimated posterior in the channels as in 'honest trees' [33][34][35].Finally, the decoder provides the predicted class label using argmax over the posteriors from the channel.
One can generalize the above decomposition by allowing for multiple encoders, as shown in Fig- ure 2B.Given B different encoders, one can attach a single channel to each encoder, yielding B different channels.Doing so requires generalizing the definition of a decoder so that it would operate on multiple channels.Such a decoder ensembles the decisions, because here each channel provides the final output based on the encoder.This is the learning paradigm behind bagging [36] and boosting [37]; indeed, decision forests are a canonical example of a decision function operating on an ensemble of B outputs [38].
Although the task specific structure in Figure 2B can provide useful decision on the corresponding task, they cannot, in general, provide meaningful decisions on other tasks, because those tasks might have completely different class labels.Therefore, in the multi-head structure (Figure 2C) a single encoder is used to learn a joint representation from all the tasks, and a separate channel is learned for each task to get the score or class conditional posteriors for each task, which is followed by each task specific decider [10,11,13].
Modular approaches, such as ProgNN and LMC (Figure 2D), have both multiple encoders and decoders.Connections from past to future encoders enables forward transfer.However, they freeze backward transfer.
Our approach also uses multiple encoders and decoders (Figure 2E).Unlike modular approaches, we allow interaction among encoders through the channels, including both forward and backward interactions.The result is that the channels ensemble representations (learned by the encoders), rather than decisions (learned by the channels as in Figure 2 B).In our algorithms, we push all the data through each encoder, and each channel learns and ensembles across all encoders.When each encoder has learned complementary representations, the channels can leverage that information to improve over single task performance.This approach has applications in few-shot and multiple task scenarios, as well as lifelong learning.2E shows a general structure of our algorithm.As data from a new task arrives, the algorithm first builds a new encoder.Then, it builds the channel for this new task by pushing the new task data through all existing encoders.Thus the channel integrates information across all existing encoders using the new task data, thereby enabling forward transfer.At the same time, if it stores old task data (or can generate such data), it can push that data through the new encoders to update the channels from the old tasks, thereby enabling backward transfer.In either case, new test data are passed through all existing encoders and corresponding channels to make a prediction (see appendix for detailed description of this approach).

Our representation ensembling algorithms Figure
Simple Lifelong Learning Networks in resource growing mode A Simple Lifelong Learning Network (SiLLy-N) ensembles deep networks.For each task, the encoder u t in SiLLy-N is the "backbone" of a deep network (DN).Thus, each u t maps an element of X to an element of R d , where d is the number of neurons in the penultimate layer of the DN.The channels are learned by averaging over k-Nearest Neighbors (k-NN) [39] trained over the d dimensional representations of X .Note that the channel is trained on the d dimensional outputs from the encoders which is much smaller in size than the original training data and hence, the k-NN channels are inexpensive storage-wise (shown later in Figure 3).
Other algorithms could also be used to learn the channels, though we do not pursue them here.The decoder w t outputs the argmax to produce a single prediction.

Simple Lifelong Learning Networks in resource constrained mode
The above resource growing approach is ideal when the upcoming tasks become more and more complex and there is no constraint imposed by the computation and storage budget available.However, real-world scenarios often impose computational constraints.In the constant resource mode, we stop building new encoders after we have reached the computation and the storage budget imposed by the user.As new tasks arrive, we only learn new channels associated with new tasks using the old encoders.Note that this approach completely excludes the need to save old task data after we have reached the budget.
Hereafter, we will use suffix '-M' after the algorithm name whenever we use resource constrained operation of SiLLy-N.Here M is the total number of encoder allowed by the budget.

Additional realization of our approach using random forest as encoder Simple Lifelong Learning
Forest (SiLLy-F) ensembles decision trees or forests.For each task, the encoder u t of SiLLy-F is the representation learned by a decision forest [38,40].The channel then learns the class-conditional posteriors by populating the forest leaves with out-of-task samples, as in "honest trees" [33][34][35].Each channel outputs the posteriors averaged across the collection of forests learned over different tasks.
The decoder w t outputs the argmax to produce a single prediction.
Note that the amount of additional representation capacity added per task by SiLLy-F is a function of the amount and complexity of the data for a new task.Contrast this with SiLLy-N and other deep net based modular or representation ensembling approaches, which a priori choose how much additional representation to add, prior to seeing all the new task data.So, SiLLy-F has capacity, space complexity, and time complexity scale with the complexity and sample size of each task.In contrast, ProgNN, SiLLy-N (and others like it) have a fixed capacity for each task, even if the tasks have very different sample sizes and complexities.

A computational taxonomy of lifelong learners
The space complexity of the learner refers to the amount of memory space needed to store the learner [41].We also study the representation capacity of these algorithms.Capacity is defined as the size of the subset of hypotheses that is achievable by the learning algorithm [42].
We use the soft-O notation Õ to quantify complexity [43].Letting n be the sample size and T be the number of tasks, we write that the capacity, space or time complexity of a lifelong learning algorithm is f (n, t) = Õ(g(n, T )) when |f | is bounded above asymptotically by a function g of n and T up to a constant factor and polylogarithmic terms.For simplifying the calculation, we make the following assumptions: 1.Each task has the same number of training samples.2. Capacity grows linearly with the number of trainable parameters in the model.
3. The number of epochs is fixed for each task.4. For the algorithms with dynamically expanding capacity, we assume the worst case scenario where an equal amount of capacity is added to the hypothesis with an additional task.
Assumption 3 enables us to write time complexity as a function of the sample size.Table 1 summarizes the capacity, space and time complexity of several reference algorithms, as well as our SiLLy-N and SiLLy-F.For space and time complexity, the table shows results as a function of n and T , as well as the common scenario where sample size is fixed per task, and therefore proportional to the number of tasks, n ∝ T .For detailed calculation of time complexity see Appendix A.3.
Parametric lifelong learning methods have a representational capacity which is invariant to sample size and task number.Although the space complexity of some of these algorithms grow (because the number of the constraints stored by the algorithms grows, or they continue to store more data), Table 1: Capacity, space, and time complexity of the representation learned by various lifelong learning algorithms.We show soft-O notation ( Õ(•, •) defined in main text) as a function of n = T t nt and T , as well as the common setting where n is proportional to T .The bottom three rows show algorithms whose space and time both grow quasilinearly with capacity growing.

Parametric
Capacity Space Time Examples SiLLy-F, IBP-WF their capacity is fixed.Thus, given a sufficiently large number of tasks with increasing complexity, in general, eventually all parametric methods will catastrophically forget.EWC [10], Online EWC [13], SI [11], and LwF [12] are all examples of parametric lifelong learning algorithms.Our fixed resource algorithms are also parametric.For comparison, we implement another baseline algorithm and refer to it as Total Replay, which is also parametric.Total Replay replays both old and current task data while learning a new task.Semi-parametric algorithms' representational capacity grows slower than sample size.For example, if T is increasing slower than n (e.g., T ∝ log n), then algorithms whose capacity is proportional to T are semi-parametric.ProgNN [16] is semi-parametric, nonetheless, its space complexity is Õ(T 2 ) due to the lateral connections.Moreover, the time complexity for ProgNN also scales quadratically with n when n ∝ T .Thus, an algorithm that literally stores all the data it has ever seen, and retrains a fixed size network on all those data with the arrival of each new task, would have smaller space complexity and the same time complexity as ProgNN.DF-CNN [17] improves upon ProgNN by introducing a "knowledge base" with lateral connections to each new column, thereby avoiding all pairwise connections.Because these semi-parametric methods have a fixed representational capacity per task, they will either lack the representation capacity to perform well given sufficiently complex tasks, and/or will waste resources for very simple tasks.
SiLLy-N and SiLLy-F eliminate the lateral connections between columns of the network, thereby reducing space complexity down to Õ(T ).Moreover, as shown in Figure 3, memory consumed by new channels is negligible compared to that of memory required for storing the encoders.Note that the time required for updating channels is negligible in comparison with the time required for training a new encoder.Indian Buffet Process for Weight Factors (IBP-WF) is another non-parametric lifelong learning algorithm.
5 Providing intuition of simple lifelong learning through simulations 5.1 Forward and backward transfer in a simple environment Consider a very simple two-task environment: Gaussian XOR and Gaussian Exclusive NOR (XNOR) (Figure 4A, see Appendix A.4 for details).The two tasks share the exact same discriminant boundaries: the coordinate axes.Thus, transferring from one task to the other merely requires learning a bit flip of the class labels.We sample a total 750 samples from XOR, followed by another 750 samples from XNOR.
SiLLy-N and deep network (DN) achieve the same generalization error on XOR when training with XOR data (Figure 4Bi).But because DN does not account for a change in task, when XNOR data appear, DN performance on XOR deteriorates (it catastrophically forgets).In contrast, SiLLy-N continues to improve on XOR given XNOR data, demonstrating backward transfer.Now consider the generalization error on XNOR (Figure 4Bii).Both SiLLy-N and DN are at chance levels for XNOR when only XOR data are available.When XNOR data are available, DN must unlearn everything it  learned from the XOR data, and thus its performance on XNOR starts out nearly maximally inaccurate, and quickly improves.On the other hand, because SiLLy-N can leverage the encoder learned using the XOR data, upon getting any XNOR data, it immediately performs quite well, and then continues to improve with further XNOR data, demonstrating forward transfer (Figure 4Biii).SiLLy-N demonstrates positive forward and backward transfer for all sample sizes, whereas DN fails to demonstrate neither forward nor backward transfer, and eventually catastrophically forgets the previous tasks.

Forward and backward transfer for adversarial tasks
In the context of lifelong learning, we informally define a task t to be adversarial with respect to task t ′ if the true joint distribution of task t, without any domain adaptation, impedes performance on task t ′ .In other words, training data from task t can only add noise, rather than signal, for task t ′ .An adversarial task for Gaussian XOR is Gaussian XOR rotated by 45 • (R-XOR) (Figure 4Aiii).Training on R-XOR therefore impedes the performance of SiLLy-N on XOR, and thus backward transfer becomes negative, demonstrating graceful forgetting [44] (Figure 4Ci).
To further investigate this relationship, we design a suite of R-XOR examples, generalizing R-XOR from only 45 • to any rotation angle between 0 • and 90 • , sampling 100 points from XOR, and another 100 from each R-XOR (Figure 4Cii).Note that we could not run the experiment for a lot of Monte Carlo repetition to have a smooth curve and hence we have shown a regressed curve fitted to the low repetition noisy curve.As the angle increases from 0 • to 45  4Ciii).Together, these experiments indicate that the amount of transfer can be a complicated function of (i) the difficulty of learning good representations for each task, (ii) the relationship between the two tasks, and (iii) the sample size of each.
6 Benchmark data experiments For benchmark data, we build SiLLy-N encoders using the network architecture described in [45].We use the same network architecture for all the benchmarking models.For the following experiments, we consider two modalities of real data: vision and language.

Reference algorithms
We compared our approaches to 15 reference lifelong learning methods.
We also compare two variants of exact replay (Total Replay and Partial Replay) using the code provided in [45].Both Total and Partial Replay store all the data they have ever seen, but Total Replay replays all of it upon acquiring a new task, whereas Partial Replay replays N samples, randomly sampled from the entire corpus, whenever we acquire a new task with N samples.Additionally, we have compared our approach with more constrained ways of replaying old task data, including Averaged Gradient Episodic Memory (A-GEM) [49], Experience Replay (ER) [50] and Task-based Accumulated Gradients (TAG) [51].
For the baseline "None", the network was incrementally trained on all tasks in the standard way while always only using the data from the current task.The implementations for all of the algorithms are adapted from open source codes [17,52]; for implementation details, see Appendix A.2.

Exploring and explaining transfer capabilities via CIFAR 10x10 dataset
The CIFAR 100 challenge [53], consists of 50,000 training and 10,000 test samples, each a 32x32 RGB image of a common object, from one of 100 possible classes, such as apples and bicycles.CIFAR 10x10 divides these data into 10 tasks, each with 10 classes [17] (see Appendix A.5 for details).
Resource growing experiments SiLLy-N and Model Zoo demonstrate positive forward and backward transfer for every task in CIFAR 10x10, in contrast, other algorithms do not exhibit any positive backward transfer (Figure 1 first column).Moreover, they retained their accuracy while improving transfer (Figure 1, bottom row).ProgNN had a similar degree of forward transfer, but zero backward transfer, and requires quadratic space and time in sample size, unlike SiLLy-N which requires quasilinear space and time.
Ablation experiments Our proposed algorithms can improve performance on all the tasks (past and future) by both growing additional resources and replaying data from the past tasks.Below we do two ablation experiments using CIFAR 10X10 to measure the relative contribution of resource growth and replay on the performance of our proposed algorithms.

Constrained resource experiment
In this experiment, we ablate the capability of SiLLy-N to grow additional resources after learning 4 encoders.We also reduce the number of channels and nodes at each encoder layer by four times to keep the total number of parameters similar to the other constantresource-algorithms.As shown in the top row of Figure 5, SiLLy-N-4 still shows positive forward and backward transfer with constant resources.However, the accuracy for SiLLy-N-4 gets reduced compared to that of resource growing SiLLy-N in Figure 1.Note that all the baseline algorithms  Bottom row: Replaying old task data impacts backward transfer while keeping forward transfer unchanged for SiLLy-N.
have negative backward transfer.This experiment indicates that constant resource mode operation for SiLLy-N may be advantageous when we have a lot of tasks to learn and have a decent amount of storage budget available.We will elaborate the above point later with a large scale dataset (food1k).

Controlled replay experiment
In this experiment, we train four different versions of SiLLy-N sequentially on the 10 tasks from CIFAR 10X10.The only difference between different versions of the algorithms is the amount of old task data replayed.In four different versions of each algorithm, we replay 40%, 60%, 80% and 100% of the old task data respectively.As apparent from Figure 5 bottom, replaying old task data has no effect on forward transfer, but replaying more data improves backward transfer as the number of tasks increases.

Experiment using pretrained encoders
In this experiment, we explore the effect of using pretrained encoders on the performance of SiLLy-N.For this experiment only, we use ResNet 50, as weights for the model pretrained on imagenet dataset is publicly available with the python package: Keras [54].
We initialize the weights of the encoders with the pretrained weights and do not freeze any layer during training.As shown in Figure 6, pretraining the encoders results in better accuracy and forward transfer, but less backward transfer for SiLLy-N.However, as it is unclear how to use pretrained encoders for other baseline approaches, we do not use pretrained encoders for SiLLy-N in other experiments for fair comparison.
Adversarial experiments Consider the same CIFAR 10x10 experiments above, but, for Task 2 through 9, randomly permute the class labels within each task, rendering each of those tasks adversarial with regard to the first task (because the labels are uninformative).Figure 7A indicates that backward transfer for SiLLy-N show positive backward transfer even with such label shuffling (the other algorithms did not demonstrate positive backward transfer).Now, consider a Rotated CIFAR experiment, which uses only data from the first task, divided into two equally sized subsets (making two tasks), where the second subset is rotated by different amounts (Figure 7, right).Backward transfer of SiLLy-N is nearly invariant to rotation angle, whereas the other approaches are far more sensitive to rotation angle.Note that zero rotation angle corresponds to the two tasks having identical distributions.The fact that other algorithms fail to transfer even in this setting suggests that they may not ever be able to positively backwards transfer.See Appendix A.5 for additional experiment using CIFAR 10X10.SiLLy-N is nearly invariant to rotations, whereas other approaches are more sensitive to rotation.

Further investigating transfer in additional datasets with more classes, tasks, and/or samples
Spoken Digit In this experiment, we used the Spoken Digit dataset [55].As shown in Figure 8 first column, SiLLy-N shows positive backward and forward transfer between the spoken digit tasks, in contrast to other methods, some of which show only forward transfer, others show only backward transfer, with none showing both, and some showing neither.See Appendix A.5 for details of the experiment.

FOOD1k 50X20 Dataset
In this experiment, we use Food1k which is a large scale vision dataset consisting of 1000 food categories from Food2k [56].FOOD1k 50X20 splits these data into 50 tasks with 20 classes each.For each class, we randomly sampled 60 samples per class for training the models and used rest of the data for testing purpose.Because on the CIFAR experiments Model Zooperforms the best among the reference resource growing models, and LwF is the best performing resource constrained algorithm, we only use them as the reference models for the large scale experiment to avoid heavy computational cost.As shown in Figure 8 second column, SiLLy-N performs the best among all the algorithms on this large dataset.
In lifelong learning, we are often primarily concerned with situations in which we have a small number of samples per task.If we have enough samples per tasks, the learning agent need not transfer knowledge from other tasks.However, below we also experiment with non-trivial lifelong learning setting where sample per task is high.

Split Mini-Imagenet
In this experiment, we have used the Mini-Imagenet dataset [51].The dataset was split into 20 tasks with 5 classes each.Each task has 2400 training samples and 600 testing samples.As shown in Figure 1 fourth column, we get positive forward and backward transfer for SiLLy-N.
However, although samples per task is lower compared to that of 5-dataset, it is still quite high.Hence, Model Zoo outperforms all the algorithms in this experiment.

5-dataset
In this experiment, we have used 5-dataset [51].It consists of 5 tasks from five different datasets: CIFAR-10 [53], MNIST, SVHN [57], notMNIST [58], Fashion-MNIST [59].All the monochromatic images were converted to RGB format, and then resized to 3 × 32 × 32.As shown in Appendix Table 3, training samples per task in 5-dataset is relatively higher than that of low data regime typically considered in lifelong learning setting.However, as shown in Figure 8 fourth column, SiLLy-N show less forgetting than most of the reference algorithms.On the other hand, Model Zoo shows comparatively better performance in relatively high task data size setup.Recall that SiLLy-N is based on bagging, and Model Zoo is based on boosting.It is well known that boosting often outperforms bagging when sample sizes are large2 .

Constant Resource Mode Operation
The binary distinction we made above, algorithms either build resources or reallocate them, is a false dichotomy, and biologically unnatural.In biological learning, systems develop from building to fixed resources, as they grow from juveniles to adults.To explore this continuum of amount of resources to grow, we experiment on FOOD1k 50X20 dataset using the constant resource mode operation of SiLLy-N as described in Section 3. We evaluate the performance of SiLLy-N for different number of encoder budget.As shown in Figure 9, performance of SiLLy-N saturates after 30 encoders, though with only 5 encoders, still demonstrates forward and backward transfer.shows more positive forward and backward transfer while operating with less parameters compared to other baseline approaches on tabular data.
7 SiLLy-F on tabular data In this experiment, we experiment with SiLLy-F, an additional realization of our approach using random forests as encoders (described in Section 3).We flatten the CIFAR 10X10 data and use them as tabular data.We train two other best performing baseline algorithms, SiLLy-N and Model Zoo and use three fully connected hidden layers, each having 2000 nodes, as encoders.As shown in Figure 10, SiLLy-F performs the best among all the approaches.This experiment shows our approach can be used as a general structure to do lifelong learning using other machine learning models as encoder.

Discussion
We introduced representation ensembling as a simple approach for lifelong learning.
Two specific algorithms, SiLLy-N and SiLLy-F, achieve both forward and backward transfer, by leveraging resources learned for other tasks without undue computational burdens.In this paper, we have mainly focused on task-aware setting, because it is simpler.Future work will extend our approach to more challenging task-unaware settings.Our code, including code to reproduce the experiments in this manuscript, is available from http://proglearn.neurodata.io/.
ell for helpful discussions.This work is graciously supported by the Defense Advanced Research Projects Agency (DARPA) Lifelong Learning Machines program through contracts FA8650-18-2-7834 and HR0011-18-2-0025.Research was partially supported by funding from Microsoft Research and the Kavli Neuroscience Discovery Institute.
.1 Literature review Prior work illustrates that ensembling learners can yield huge advantages in a wide range of applications.For example, in classical machine learning, ensembling trees leads to state-of-the-art random forest [38] and gradient boosting tree algorithms [63].Similarly, ensembling networks shows promising results in various real-world applications [64,65].Authors from [66] used weighted ensemble of learners in a streaming setting with distribution shift.TrAdaBoost [67] boosts ensemble of learners to enable transfer learning.In continual learning scenarios, many algorithms have been built on these ideas by ensembling dependent representations.For example, Learn++ [68] boosts ensembles of weak learners learned over different data sequences in class incremental lifelong learning settings [69].Model Zoo [47] uses the same boosting approach in task incremental lifelong learning scenarios.
Another group of algorithms, ProgNN [16] and DF-CNN [17] learn a new "column" of nodes and edges with each new task, and ensembles the columns for inference (such approaches are commonly called 'modular' now).The primary difference between ProgNN and DF-CNN is that ProgNN has forward connections to the current column from all the past columns.This creates the possibility of forward transfer while freezing backward transfer.However, the forward connections in ProgNN render it computationally inefficient for a large number of tasks.DF-CNN gets around this problem by learning a common knowledge base and thereby, creating the possibility of backward transfer.
Recently, many other modular approaches have been proposed in the literature that improve on ProgNN's capacity growth.These methods consider the capacity for each task being composed of modules that can be shared across tasks and grown as necessary.For example, PackNet [70] starts with a fixed capacity network and trains for additional tasks by freeing up portion of the network capacity using iterative pruning.Veniat [26] trains additional modules with each new task, and the old modules are only used selectively.Another paper [46] improved the memory efficiency of the modular methods by adding new modules according to the complexity of the new tasks.Authors in [71] proposed non-parametric factorization of the layer weights that promotes sharing of the weights between tasks.However, all of modular methods described above lack backward transfer because the old modules are not updated with the new tasks.Dynamically Expandable Representation (DER) [72] proposed an improvement over the modular approaches where the model capacity is dynamically expanded and the model is fine-tuned by replaying a portion of the old task data along with the new task data.This approach achieves backward transfer between tasks as reported by the authors in the experiments.Another strategy for building lifelong learning machines is to use total or partial replay [45,73,74].Replay approaches keep the old data and replay them when faced with new tasks to mitigate catastrophic forgetting.However, as we will illustrate, previously proposed replay algorithms do not demonstrate positive backward transfer in our experiments, though they often do not forget as much as other approaches.
Our approach builds directly on previously proposed modular and replay approaches with one key distinction: in our approach, representations are learned independently.Empirically, for low sample sizes random forests (which learn independent trees) typically outperform gradient boosted trees (which learn dependent trees) [62,75,76].Because our approach of representation ensembling is similar to that of random forest, we expect learning independent representations to outperform learning dependent representations in these scenarios as well.This phenomenon is empirically shown in the main text Figure 1 and 8. Independent representations also have computational advantages, as doing so merely requires quasilinear time and space, and can be learned in parallel.possible representation algorithm.Algorithms 1, 2, 3, and 4 provide pseudocode for adding encoders, updating channels, and making predictions for any SiLLy-X algorithm.Whenever the learner gets access to a new task data, we use Algorithm 1 to train a new encoder for the corresponding task.We split the data into two portions -one set is used to learn the encoder and the other portion is called the held out or out-of-bag (OOB) data which is returned by Algorithm 1 to be used by Algorithm 2 to learn the channel for the corresponding task.Note that we push the OOB data through the in-task encoder and the whole dataset through the cross-task encoders to update the channel, i.e, learn the posteriors according to the new encoder.Then we use Algorithm 3 to replay the old task data through the new encoder and update their corresponding channels.Finally, while predicting for a test sample, we use Algorithm 4. Given the task identity, we use the corresponding channel to get the average estimated posterior and predict the class label as the argmax of the estimated posteriors.
Algorithm 1 Add a new SiLLy-X encoder for a task.OOB = out-of-bag.

Input:
( ▷ update the old task channels

A.2 Reference Algorithm Implementation Details
The same network architecture was used for all baseline deep learning methods.Following the work in [45], the 'base network architecture' consisted of five convolutional layers followed by two-fully connected layers each containing 2000 nodes with ReLU non-linearities and a softmax output layer.The convolutional layers had 16, 32, 64, 128 and 254 channels, they used batch-norm and a ReLU non-linearity, they had a 3x3 kernel, a padding of 1 and a stride of 2 (except the first layer, which had a stride of 1).This architecture was used with a multi-headed output layer (i.e., a different output layer for each task) for all algorithms using a fixed-size network.For ProgNN and DF-CNN the same architecture was used for each column introduced for each new task, and in our SiLLy-N this architecture was used for the transformers u t (see above).In these implementations, ProgNN and DF-CNN have the same architecture for each column introduced for each task.Among the reference algorithms, EWC, O-EWC, LwF, SI, Total Replay and Partial Replay results were produced using the repository https://github.com/GMvandeVen/progressive-learning-pytorch.For ProgNN and DF-CNN we used the code provided in https://github.com/Lifelong-ML/DF-CNN.For all other reference algorithms, we modified the code provided by the authors to match the deep net architecture as mentioned above and used the default hyperparameters provided in the code.

A.3 Training Time Complexity Analysis
Consider a lifelong learning environment with T tasks each with n ′ samples, i.e., total training samples, n = n ′ T .For all the algorithm with time complexity Õ(n), the training time grows linearly with more training samples.We discuss all other algorithms with nonlinear time complexity below.
EWC Consider the time required to train the weights for each task in EWC is k c n ′ and each task adds additional k l n ′ time from the regularization term.Here, k c and k l are both constants.Therefore, time required to learn all the T tasks can be written as: Total Replay Consider the time to train the model on n ′ samples is k c n ′ .Therefore, time required to learn all the T tasks can be written as: ProgNN Consider the time required to train each column in ProgNN is k c n ′ and each lateral connection can be learned with time k l n ′ .Therefore, time required to learn all the T tasks can be written as:

A.4 Simulated Results
In each simulation, we constructed an environment with two tasks.For each, we sample 750 times from the first task, followed by 750 times from the second task.These 1,500 samples comprise the training data.We sample another 1,000 hold out samples to evaluate the algorithms.For SiLLy-N, we have used a deep network (DN) architecture with two hidden layers each having 10 nodes.Similarly, for SiLLy-N experiments we did 100 repetitions and reported the results after smoothing it using moving average with a window size of 5.
Gaussian XOR Gaussian XOR is two class classification problem with equal class priors.Conditioned on being in class 0, a sample is drawn from a mixture of two Gaussians with means ± 0.5, 0.5 T , and variances proportional to the identity matrix.Conditioned on being in class 1, a sample is drawn from a mixture of two Gaussians with means ± 0.5, −0.5 T , and variances proportional to the identity matrix.
Gaussian XNOR is the same distribution as Gaussian XOR with the class labels flipped.Rotated XOR (R-XOR) rotates XOR by θ • degrees.

CIFAR 10x10 Repeated Classes
We also considered the setting where each task is defined by a random sampling of 10 out of 100 classes with replacement.This environment is designed to demonstrate the effect of tasks with shared subtasks, which is a common property of real world lifelong learning tasks.Supplementary Figure 2 shows transfer of SiLLy-F and SiLLy-N on Task 1.
Spoken Digit experiment In this experiment, we used the Spoken Digit dataset provided in https: //github.com/Jakobovski/free-spoken-digit-dataset.The dataset contains audio recordings from six different speakers with 50 recordings for each digit per speaker (3000 recordings in total).The experiment was set up with six tasks where each task contains recordings from only one speaker.For each recording, a spectrogram was extracted using Hanning windows of duration 16 ms with an overlap of 4 ms between the adjacent windows.The spectrograms were resized down to 28 × 28.The extracted spectrograms from eight random recordings of '5' for six speakers are shown in Figure 4.For each Monte Carlo repetition of the experiment, spectrograms extracted for each task were randomly divided into 55% train and 45% test set.The experiment is summarized in Figure 5.Note that we could not run the experiment on other 5 reference algorithms using the code provided by their authors.

1
arXiv:2004.12908v19 [cs.AI] 11 Jun 2024 All our code and experiments are open source to facilitate reproducibility.S iL Ly -N * P ro g N N * M o d e l Z o o * L M C * C o S C L * D F -C N N * L w F P a rt ia l R e p la y To ta l R e p la y A --N * P ro g N N * M o d e l Z o o * L M C * C o S C L * D F -C N N * L w F P a rt ia l R e p la y To ta l R e p la y A -G E M N o n e -N * P ro g N N * M o d e l Z o o * L M C * C o S C L * D F -C N N * L w F P a rt ia l R e p la y To ta l R e p la y A -G E M N o n e -N * P ro g N N * M o d e l Z o o * L M C * C o S C L * D F -C N N * L w F P a rt ia l R e p la y To ta l R e p la y A -G E M N o n e

Figure 2 :
Figure 2: Schemas of composable hypotheses. A. Single task learner.B. Ensembling decisions (as output by the channels) is a well-established practice, including random forests and gradient boosted trees.C. Learning a joint representation or D. learning future representations depending on the past encoders was previously used in lifelong learning scenarios, but encoders were not trained independently as in E. Note that the new encoders in E interact with the previous encoders through the channel layer (indicated by red arrows), thereby, enabling backward transfer.Again the old encoders interact with the future encoders (indicated by black arrows), thereby, enabling forward transfer.

Figure 3 :
Figure 3: Storage space as a function on number of tasks in CIFAR 10X10.Memory consumed by SiLLy-N is dominated by the encoder size.The size of DF-CNN remains constant throughout.

Figure 4 :
Figure 4: SiLLy-N demonstrates forward and backward transfer.(A) 750 samples from: (Ai) Gaussian XOR, (Aii) XNOR, which has the same optimal discriminant boundary as XOR, and (Aiii) R-XOR, which has a discriminant boundary that is uninformative, and therefore adversarial, to XOR. (Bi) Generalization error for XOR, and (Bii) XNOR.SiLLy-N outperforms DN on XOR when XNOR data is available, and on XNOR when XOR data are available.(Biii) Forward and backward transfer of SiLLy-N are positive for all sample sizes.(Ci) In an adversarial task setting, SiLLy-N gracefully forgets XOR, whereas DN catastrophically forget and interfere.(Cii) Backward Transfer is maximum positive with respect to XOR when the optimal decision boundary of θ-XOR is similar to that of XOR (e.g.angles far from 45 • ), and negative otherwise.The dashed line shows the regression line fitted through the original points.(Ciii) Backward Transfer is a nonlinear function of the source training sample size (XOR sample size is fixed at 500).

S
iL Ly -N -4 E W C To ta l R e p la y P a rt ia l R e p la y L w

Figure 5 :
Figure 5: Ablation experiments on SiLLy-N using CIFAR 10X10.Top row: SiLLy-N uniquely shows both positive forward and backward transfer while operating with the same number of parameters as other baseline constant parameter approaches.

Figure 6 :
Figure 6: Pretrained encoder on CIFAR 10X10.Using pretrained encoders results in better transfer and accuracy for SiLLy-N.

Figure 7 :
Figure 7: Extended CIFAR 10x10 experiments.A. Shuffling class labels within tasks two through nine with 500 samples each demonstrates SiLLy-N can still achieve positive backward transfer, and that the other algorithms still fail to transfer.B.SiLLy-N is nearly invariant to rotations, whereas other approaches are more sensitive to rotation.

Figure 8 :
Figure 8: Performance summary on vision and speech benchmark datasets with varying sample sizes per task.Dataset on the right has more samples per task.Imagenet and five dataset have really high sample size per task which is a non-trivial lifelong learning setting.SiLLy-N performs the best in the low sample size regime on the left two columns which is desirable in lifelong learning.See appendix Figure 3 and 5 for the extended results.

Figure 9 :
Figure 9: SiLLy-N in constant resource mode operation.Improvement in Transfer between tasks becomes negligible at nearly 30 encoders.The user can choose to operate in lower budget with less transfer.

Figure 10 :
Figure 10: Our proposed approach can be used with random forest as encoders on tabular data (SiLLy-F).SiLLy-F

6 M 9 M 1 Figure 1 :
Figure1: Performance summary on vision and audition benchmark datasets using Veniat's[26]'s statistics.See Figure1for caption details.Note that the results here look nearly identical other than the y-axis labels.

Figure 2 :
Figure 2: SiLLy-N and SiLLy-F transfer knowledge effectively when tasks share common classes.Each task is a random selection of 10 out of the 100 CIFAR-100 classes.Both SiLLy-F and SiLLy-N demonstrate monotonically increasing transfer efficiency for up to 20 tasks.

Figure 3 :
Figure 3: Extended results on the different vision experiments.This plot contains algorithms not shown in Figure 8.
Fourier Transform Spectrogram of Number 5

Figure 4 :Figure 5 :
Figure 4: Spectrogram extracted from eight different recordings of six speakers uttering the digit 'five'.
The 45 • -XOR is the maximally adversarial R-XOR.Thus, as the angle further increases, Backward Transfer increases back up to ≈ 0.18 at 90 • , which has an identical discriminant boundary to XOR.Moreover, when θ is fixed at 25 • , Backward Transfer increases at different rates for different sample sizes of the source task (Figure • , Backward Transfer gradually decreases for SiLLy-N. Add a new SiLLy-X channel for the current task.t ← u t ′ .update_channel(xt , y t , v t ) ▷ update the channel for task t using the old encoders Update SiLLy-X channel for the previous tasks. v

Table 1 :
Predicting a class label using SynX.Hyperparameters for SiLLy-N in CIFAR 10X10, Five Datasets, Split Mini-Imagenet, FOOD1k experiments.Note that we use the same hyperparameters for all the experiments.

Table 4 :
Hyperparameters for SiLLy-F in spoken digit experiment.