Combining Ensemble Kalman Filter and Reservoir Computing to predict spatio-temporal chaotic systems from imperfect observations and models

Prediction of spatio-temporal chaotic systems is important in various fields, such as Numerical Weather Prediction (NWP). While data assimilation methods have been applied in NWP, machine learning techniques, such as Reservoir Computing (RC), are recently recognized as promising tools to predict spatio-temporal chaotic systems. However, the sensitivity of the skill of the machine learning based prediction to the imperfectness of observations is unclear. In this study, we evaluate the skill of RC with noisy and sparsely distributed observations. We intensively compare the performances of RC and Local Ensemble Transform Kalman Filter (LETKF) by applying them to the prediction of the Lorenz 96 system. Although RC can successfully predict the Lorenz 96 system if the system is perfectly observed, we find that RC is vulnerable to observation sparsity compared with LETKF. To overcome this limitation of RC, we propose to combine LETKF and RC. In our proposed method, the system is predicted by RC that learned the analysis time series estimated by LETKF. Our proposed method can successfully predict the Lorenz 96 system using noisy and sparsely distributed observations. Most importantly, our method can predict better than LETKF when the process-based model is imperfect.


Introduction
In Numerical Weather Prediction (NWP), it is required to obtain the optimal estimation of atmospheric state variables by observations and process-based models which are both imperfect. Observations of atmospheric states are sparse and noisy, and numerical models inevitably include biases. In addition, models used in NWP are known to be chaotic, which makes the prediction substantially difficult. To accurately predict the future atmospheric state, it is important to develop methods to predict spatio-tempral chaotic dynamical systems from imperfect observations and models.
Traditionally, data assimilation methods have been widely used in geosciences and NWP systems. Data assimilation is a generic term of approaches to estimate the state from observations and model outputs based on their errors. The state estimated by data assimilation is used as the initial value, and the future state is predicted by models alone. Data assimilation is currently adopted in operational NWP systems. Many data assimilation frameworks have been proposed, e.g. 4D variational methods (4D-VAR; Bannister, 2017), Ensemble Kalman Filter (EnKF; Houtekamer & Zhang, 2016), or their derivatives, and they have been applied to many kinds of weather prediction tasks, such as the prediction of short-term rainfall events (e.g. Sawada et al., 2019;Yokota et al., 2018), and severe storms (e.g. Zhang et al., 2016). Although data assimilation can efficiently estimate the unobservable state variables from noisy observations, the prediction skill is degraded if the model has large biases.
On the other hand, model-free prediction methods based on machine learning have been receiving much attention recently. Many previous studies have successfully applied machine learning to predict chaotic dynamics. Vlachas et al. (2018) successfully applied Long-Short Term Memory (LSTM; Hochreiter & Schmidhuber, 1997) to predict the dynamics of the Lorenz96 model, Kuramoto-Sivashinski Equation, and the barotropic climate model which is a simple atmospheric circulation model. Asanjan et al. (2018) showed that LSTM can accurately predict the future precipitation by learning satellite observation data. Nguyen & Bae (2020) successfully applied LSTM to generate area-averaged precipitation prediction for hydrological forecasting.
In addition to LSTM, Reservoir Computing (RC), which was first introduced by Jaeger & Haas (2004), has been found to be suitable to predict spatio-temporal chaotic systems. Pathak et al. (2017) successfully applied RC to predict the dynamics of Lorenz equation and Kuramoto-Sivashinski Equation. Lu et al. (2017) showed that RC can be used to estimate state variables from sparse observations if the whole system was perfectly observed as training data. Chattopadhyay et al. (2019) revealed that RC can predict the dynamics of the Lorenz 96 model more accurately than LSTM and Artificial Neural Network. In addition to the accuracy, RC also has an advantage in computational costs.
RC can learn the dynamics only by training a single matrix just once, while other neural networks have to train numerous parameters and need many iterations . Thanks to this feature, the computational costs needed to train RC is cheaper than LSTM and Artificial Neural Network.
However, Vlachas et al. (2020) revealed that the prediction accuracy of RC is degraded when all of the state variables cannot be observed. It can be a serious problem since the observation sparsity is often the case in geosciences and the NWP systems. Brajard et al. (2020) pointed out this issue and successfully trained the Convolutional Neural Network with sparse observations, by combining with EnKF. However, their method needs to iterate the data assimilation and training, which is computationally expensive and infeasible toward the real-world problem. Dueben & Bauer (2018) mentioned that the spatio-temporal heterogeneity of observation data made it difficult to train machine learning models, and they suggested to use the model or reanalysis as training data. Weyn et al. (2019) successfully trained machine learning models using the atmospheric reanalysis data.
We aim to propose the novel methodology to predict spatio-temporal chaotic systems from imperfect observations and models. First, we reveal the limitation of the stand-alone use of RC under realistic situations (i.e., imperfect observations and models). Then, we propose a new method to maximize the potential of RC to predict chaotic systems from imperfect models and observations, which is even computationally feasible. As Dueben & Bauer (2018) proposed, we make RC learn the analysis data series generated by a data assimilation method. Our new method can accurately predict from imperfect observations. Most importantly, we found that our proposed method is more robust to model biases than the stand-alone use of data assimilation methods.

Lorenz 96 model and OSSE
We used a low dimensional spatio-temporal chaotic model, the Lorenz 96 model (L96), to perform experiments with various parameter settings. L96 is a model introduced by Lorenz & Emanuel (1998) and has been commonly used in data assimilation studies (e.g. Kotsuki et al., 2017;Miyoshi, 2005;Penny, 2014;Raboudi et al., 2018). L96 is recognized as a good testbed for the operational NWP problems (Penny, 2014).
In this model, we consider a ring structured and dimensional discrete state space 1 , 2 , … , (that is, is adjacent to 1 ), and define the system dynamics as follows: where stands for the force parameter. Each term of this equation corresponds to advection, damping and forcing respectively. It is known that various settings of state dimension , forcing term and initial values result in chaotic solutions. The time width Δ = 0.2 corresponds to one day in real atmospheric motion from the view of Lyapunov time (Miyoshi, 2005).
As we use this conceptual model, we cannot obtain any observational data or "true" phenomena that correspond to the model. Instead, we adopted Observing System Simulation Experiment (OSSE). We first prepared a time series by integrating equation (1) and regarded it as the "true" dynamics (called Nature Run). Observation data can be calculated from this time series adding some perturbation: where ∈ ℝ ℎ is the observation value, is the × ℎ observation matrix, ∈ ℝ ℎ is the observational error whose each element is independent and identically distributed on Gaussian distribution (0, ) for observation error .
In each experiment, the form of L96 used to generate Nature Run is unknown, and the model used to make prediction can be different from that for Nature Run. The difference between the model used for Nature Run and that used for prediction corresponds to the model's bias in the context of NWP.

Local Ensemble Transform Kalman Filter
We used the Local Ensemble Transform Kalman Filter (LETKF, Hunt et al., 2007) as the data assimilation method in this study. LETKF is one of the ensemble-based data assimilation methods, which is considered to be applicable to the NWP problems in many previous studies (Sawada et al., 2019;Yokota et al., 2018). LETKF is also used for the operational NWP in some countries (e.g. Schraff et al., 2016).
LETKF and the family of ensemble Kalman filters have two steps; forecast and analysis. The forecast step makes the prediction from the analysis variables of current time to the time when the next observation is obtained (this time width is called "assimilation window"). Considering the stochastic error in the model, system dynamics can be represented as follows (hereafter the subscript stands for the variable at time , and the time width of corresponds to the assimilation window): where ∈ ℝ is the forecast variables, − ∈ ℝ is the analysis variables, ℳ: ℝ → ℝ is the model dynamics operator, ∈ ℝ is the stochastic error and ( , ) means the Gaussian distribution with mean 0 and × covariance matrix . Using the computed state vector , observation variables can be estimated as follows: where ∈ ℝ is the estimated observation value, ℋ: ℝ → ℝ is the observation operator and ∈ ℝ is the observation error extracted from ( , ) . Since the error in the model is assumed to follow the Gaussian distribution, forecasted state can also be considered as a random variable from the Gaussian distribution if ℳ is linear. In this situation, the probability distribution of (and also ) can be parametrized by mean ̅̅̅ ( ̅̅̅ ,) and covariance matrix ( ). Their temporal evolution can be calculated based on equation (3) as follows: where is the × matrix representation of ℳ. Hereafter the means of and are expressed without overlines for convenience.
Next, in the analysis step, this forecast state is updated using actual observation . and are generated as follows: where is the linear observation operator of equation (4). This method is called Kalman Filter. Kalman Filter is a good approximation when the dynamics is linear. However, it is difficult to apply it to nonlinear and large problems.
If either the model operator ℳ or observation operator ℋ is nonlinear, we cannot directly use this method. If the state space dimension is high, it is difficult to keep × covariance matrix on the memory.
One of the methods that solve these problems is EnKF. EnKF uses an ensemble of state variables to represent the probability distribution. The forecast step of equation (5) then becomes as follows: where ,( ) is the th ensemble member of forecast value at time , is the number of ensemble members and is the matrix whose th column is the deviation of the th ensemble member from the ensemble mean.
The analysis step of EnKF has some variants including LETKF. LETKF first determines the mean and covariance of the analysis ensemble, ̅̅̅ , and then computes the analysis ensemble. As the derivation of equation (6), we get ̅̅̅ and from forecast ensemble as follows: where ,̃ stands for the mean and covariance of the analysis ensemble calculated in the ensemble subspace. As equation (7), we can consider the analysis covariance as the product of the analysis ensemble matrix: where is the matrix whose th column is the variation of the th ensemble member from the mean for the analysis ensemble. Therefore, decomposing ̃ of equation (8) into square root, we can get each analysis ensemble member at time as follows: where is the th column of in the first equation. A covariance inflation parameter is multiplied to take measures for the tendency of data assimilation to underestimate the uncertainty of state estimate. See Hunt et al.
(2007) for more detailed derivation. Now, we can return to the equation (7) and iterate forecast and analysis step.
As in the real application, we consider the situation that the observations are not available in the prediction period.
Predictions are made by the model alone, using the latest analysis state variables as the initial condition. This way of making prediction is called "Extended Forecast", and we call this prediction "LETKF-Ext" in this study.

Reservoir Computing
We use Reservoir Computing (RC) as the machine learning framework. RC is a type of Recurrent Neural Network, which has a single hidden layer called reservoir. Figure 1 shows the architecture. As mentioned in the Section 1, the previous works have shown that RC can predict the dynamics of spatio-temporal chaotic systems.
The state of the reservoir layer at timestep is represented as a vector ∈ ℝ , which evolves given the input vector ∈ ℝ as follows: where is the × input matrix which maps the input vector to the reservoir space, and is the × where is the × output matrix which maps the reservoir state to the state space, and : ℝ → ℝ is the operator for nonlinear transformation. The nonlinear transformation is essential for the accurate prediction (Chattopadhyay et al., 2019). It is important that and are fixed and only will be trained. Therefore, the computational cost required to train RC is small and it is an outstanding advantage of RC compared to the other neural network frameworks.
In the training phase, we set the switch in the Figure 1 to the training configuration. Given a training data series { , , … , }, we can generate the reservoir state series { , , … , + } by equation (11) . By using the training data and reservoir state series, we can determine the matrix by ridge regression. We minimize the following square error function with respect to : where ‖ ‖ = and is the ridge regression parameter (normally a small positive number). The optimal value can be determined analytically as follows: where is the × identity matrix and , are the matrix whose ℎ column is the vector ( ), , respectively.
Then, we can shift to the predicting phase. Before we predict with the network, we first need to "spin up" the reservoir state. The spin up process was done by giving the time series before the initial value { − , − + , … , − } to the network and calculate the reservoir state right before the beginning of the prediction via equation (11). After that, the output layer is connected to the input layer, and the network becomes recursive. In this configuration, the output value of equation (12) is used as the next input value of equation (11). Once we give the initial value , the network will iterate equation (11) and (12) spontaneously, and the prediction will be yielded.
Considering the real application, it is natural to assume that the observation data can only be used as the training data and the initial value for the RC prediction. In this paper we call this type of prediction "RC-Obs".

Combination of RC and LETKF
As discussed so far and we will quantitatively discuss in the section 4, LETKF-Ext and RC-Obs have contrasting advantages and disadvantages. LETKF-Ext can accurately predict even if the observation is noisy and/or sparsely Therefore, the combination of LETKF and RC has a potential to push the limit of these two individual prediction methods and realize accurate and robust prediction. The weakness of RC-Obs is that we can only use the observational data directly, which is inevitably sparse in the real application, although RC is vulnerable to this imperfectness. In our proposed method, we make RC learn the analysis time series generated by LETKF instead of directly learning observation data. Since LETKF's analysis variables are of full grid, it is expected that we can efficiently train RC in our proposed method. We call the prediction by this method "RC-Anl".
Our proposed combination method is expected to predict more accurately than RC-Obs since the training data always exist in all the grid points, even if the observation is sparse. Also, especially if the model is substantially biased, the analysis time series generated by LETKF is more accurate than the model output itself. It means that RC-Anl is expected to be able to predict more accurately than LETKF-Ext.

Experiment Design
To generate the Nature Run, L96 with = 8, = 8 was used, and it was numerically integrated by 4th order Runge-Kutta method by time width Δ = 0.005. Before calculating the Nature Run, the L96 equation was integrated for 1440000 timesteps for spin up. In the following experiment, term in the model was changed to represent the model bias.
The setting for LETKF was based on Miyoshi & Yamane (2007). In equation (8), each row of observation covariance were divided by the value w calculated as follows: where is the distance between each observation point and each analyzed point. The shape of equation (8) differs by the analyzed grid points, so each row of and ̃ should be calculated separately. In equation (10), a "covariance inflation factor", which was set to 1.05 in our study, was multiplied to ̃ in each iteration. Ensemble size was set to 20.
The configuration of RC used in this study was similar to Chattopadhyay et al. (2019), but was slightly modified.
Parameter settings used in the RC experiments are shown in Table 1. The nonlinear transformation function for the output layer in equation (12) is as follows: where is the th element of . In the prediction phase, we used the data for 100 timesteps before the prediction initial time for the reservoir spin up.
We implemented numerical experiments to investigate the performance of RC-Obs, LETKF-Ext and RC-Anl to predict L96 dynamics. First, we evaluated the performance of RC-Obs by comparing with LETKF-Ext under perfect observations (all the grid points are observed with no error) and quantified the effect of the observation imperfectness (i.e. observation error and spatio-temporal sparsity), to investigate the prediction skill of the stand-alone use of RC and LETKF. Second, we evaluated the performance of RC-Anl. We investigated the performance of RC-Anl and LETKF-Ext as the functions of the observation density and model biases. Three prediction frameworks are summarized in Table 2.
In each experiment, we prepared 200000 timesteps of Nature Run. The first 100000 timesteps were used for the training of RC or for the spinning up of LETKF, and the rest of them were used for the evaluation of each method.
Every prediction was repeated 100 times to avoid the effect of the heterogeneity of data. For the LETKF-Ext prediction, the analysis time series of all the evaluation data was firstly generated. Then, the analysis variables for one every 1000 timestep was taken as the initial conditions and total 100 prediction runs were performed. For the RC-Obs prediction, evaluation data were equally divided into 100 sets and the prediction was identically done for each set. For the RC-Anl prediction, the analysis time series of training data were used for training, and the prediction was performed using the same initial condition as LETKF-Ext. Each prediction set of LETKF-Ext, RC-Obs, and RC-Anl corresponds to the same time range.
The prediction accuracy of each method was evaluated by taking the average of RMSE of 100 sets for each timestep.
We call this metric mean RMSE (mRMSE), and can be represented as follows: where is the number of the steps elapsed from the prediction initial time, ( ) ( ) is the th nodal value of the th prediction set at time and ( ) ( ) is the corresponding value of Nature Run. Using this metric, we can see how the prediction accuracy is degraded as time elapses from initial time.  Next, we evaluated the sensitivity of the prediction skill of both LETKF-Ext and RC-Obs to the imperfectness of the observations. Figure S1 and Figure S2 show the effect of the observation error and frequency on the prediction skill, respectively. Both methods showed a similar level of robustness for the change of the observation frequency and the observation error.

Results
However, if we reduce the number of the observed grid points, the prediction accuracy of RC-Obs becomes catastrophically worse while LETKF-Ext is robust to the reduced number of the observed grid points. Figures 4a and   4b show the sensitivity of the prediction accuracy of LETKF-Ext and RC-Obs, respectively, to the number of observed grid points. Even though we can observe a small part of the system, the accuracy of LETKF-Ext changed only slightly. On the other hand, the accuracy of RC-Obs gets substantially worse when we remove a single observed grid point. As assumed in the section 2.4, we verified that RC-Obs is significantly sensitive to the observation sparsity.
We tested the prediction skill of our newly proposed method, RC-Anl, under imperfect models and sparse observations. Here, we used the observation error = 1.0. Figure 5 shows Moreover, when the model used in LETKF is biased, RC-Anl outperforms LETKF-Ext. Figure 6 shows the change of the mRMSE time series when changing the model biases. The number of the observed points was set to 4. The term in equation (1) was changed from the true value 8 (the value of the model for Nature Run) as the model bias, and the accuracy of LETKF-Ext and RC-Anl is plotted. The accuracy of LETKF-Ext was slightly better than that of RC-Anl when the model was not biased ( = 8; green line). However, when the bias is large (e.g. = 10; yellow line), RC-Anl showed the better prediction accuracy.
We confirmed this result by comparing the mRMSE value of RC-Anl and LETKF-Ext at the specific forecast leadtime. Figure 7 shows the value of (80) (see equation (17)) as the function of the value of the term. Both two lines that shows the skill of RC-Anl (blue) and LETKF-Ext (red) are convex downward and have a minimum at = 8, meaning that the accuracy of both prediction methods are the best when the model is not biased. In addition, as long as value is in the interval [7.5, 8.5], LETKF-Ext has the better accuracy than RC-Anl. However, if the model bias become larger than that, RC-Anl becomes more accurate than LETKF-Ext. As the bias increases, the difference between the (80) of two methods becomes larger, and the superiority of RC-Anl becomes more obvious. We found that RC-Anl can predict more accurately than LETKF-Ext when the model is biased.
We also checked the robustness for the training data size. Figure S3 shows the change of the accuracy of RC-Anl by changing the size of training data from 100000 to 1000 timesteps. We confirmed that the prediction accuracy did not change until the size was reduced to 25000 timesteps. Although we have used a large size of training data (100000 timesteps; 68 model years) so far, the results are robust to the reduction of the size of the training data.

Discussion
By comparing the prediction skill of RC-Obs and LETKF-Ext, we confirmed that RC-Obs can predict with accuracy comparable to LETKF-Ext, if we have perfect observations. This result is consistent with Chattopadhyay et al. (2019), Pathak et al. (2017) or P. R. Vlachas et al. (2020), and we can expect that RC has a potential to predict various kinds of spatio-temporal chaotic systems.
However, Vlachas et al. (2020) revealed that the prediction accuracy of RC is substantially degraded when the observed grid points are reduced, compared to other machine learning techniques such as LSTM. Our result is consistent with their study, and we found that the prediction accuracy of RC-Obs was significantly degraded by just removing one observation grid point. In contrast, Chattopadhyay et al. (2019) showed that RC can predict the multiscale chaotic system correctly even though only the largest scale dynamics is observed. Comparing these results, we can suggest that the states in the scale of dominant dynamics should be observed almost perfectly to accurately predict the future state by RC.
Therefore, when we use RC to predict spatio-temporal chaotic systems with sparse observation data, we need to interpolate them to generate the appropriate training data. However, the interpolated data inevitably includes errors even if the observation data itself has no error, so it should be verified that RC can predict accurately by training data with some errors. Previous works such as Chattopadhyay et al., 2019, or P. R. Vlachas et al., 2020 have not considered the impact of error in the training data. We found that the prediction accuracy of RC degrades as the error in training data grows, but the degradation rate is not so large (if all the training data of all the grid points are obtained). We can expect from this result that RC trained with the interpolated observation data can predict accurately to some extent, but the interpolated data should be as accurate as possible.
In this study, LETKF was used to prepare the training data for RC, since LETKF can interpolate the observations and reduce their error at the same time. We showed that our proposed approach correctly works. Brajard et al. (2020) also made Convolutional Neural Network (CNN) learn the dynamics from sparse observation data and successfully predict the dynamics of the L96 model. However, as mentioned in the introduction section, Brajard et al. (2020) needed to iterate the learning and data assimilation until they converge, because it replaced the model used in data assimilation with CNN. Although their model-free method has an advantage that it was not affected by the processbased model's reproducibility of the phenomena, it is computationally expensive and probably infeasible in many real-world problems. Contrary, we need to train RC just one time, because we use the process-based model (i.e. data assimilation method) to prepare the training data. We overcome the problem of computational feasibility. Note also that the computational cost to train RC is much cheaper than the other neural networks.
The good performance of our proposed method supports the suggestion of Dueben & Bauer (2018), in which machine learning should be applied to the analysis data generated by data assimilation methods as the first step of the application of machine learning to weather prediction. As Weyn et al. (2019) did, we successfully trained the machine learning model with the analysis data.
Most importantly, we also found that the prediction by RC-Anl is more robust to the model biases than the extended forecast by LETKF (i.e. LETKF-Ext). This result suggests that our method can be beneficial in various real problems, as the model in real applications inevitably contains some biases. Pathak, Wikner, et al. (2018) developed the hybrid prediction system of RC and a biased model. Although Pathak, Wikner et al. (2018) successfully predicted the spatiotemporal chaotic systems using the biased models, they needed perfect observations to train their RC. The advantage of our proposed method is that we allow both models and observation networks to be imperfect.
Our study was implemented with the 8-dimensional L96 system, and it is unclear whether our proposed method is applicable to other spatio-temporal chaotic systems with larger state spaces, including the real NWP models. However, in previous works, RC has been successfully applied to many other large chaotic systems. Especially, Pathak, Hunt, et al. (2018) indicated that RC can be applied to predict the dynamics of substantially high dimensional Kuramoto-Sivashinski equation using the "reservoir parallelization". They divided the state space to some local groups and used different reservoirs for each local group. As we did not change the RC architecture itself, our method also has a potential to predict other high dimensional spatio-temporal chaotic systems by adopting this parallelization strategy.
In NWP problems, it is often the case that homogenous observation data of high resolution are not available over a wide range of time and space, which can be an obstacle to applying machine learning to NWP tasks (Dueben & Bauer, 2018). We revealed that RC is robust for the temporal sparsity of observations, and RC can be trained with relatively small training data sets. These results imply that our proposed method can be applicable to various realistic problems.

Conclusion
The prediction skills of the extended forecast with LETKF (LETKF-Ext), RC that learned the observation data (RC-Obs), and RC that learned the LETKF analysis data (RC-Anl) were evaluated under imperfect models and observations, using the Lorenz 96 model. We found that the prediction by RC-Obs is substantially vulnerable to the sparsity of the observation network. Our proposed method, RC-Anl, can overcome this vulnerability. In addition, RC-Anl could predict more accurately than LETKF-Ext when the process-based model is biased. Our new method is robust to the imperfectness of both models and observations so that it is feasible to apply it to the real NWP problem.
Further studies on more complicated models or operational atmospheric models are expected.