Efficient ensemble data assimilation for coupled models with the Parallel Data Assimilation Framework: Example of AWI-CM

Data assimilation integrates information from observational measurements with numerical models. When used with coupled models of Earth system compartments, e.g. the atmosphere and the ocean, consistent joint states can be estimated. A common approach for data assimilation are ensemble-based methods which utilize an ensemble of state realizations to estimate the state and its uncertainty. These methods are far more costly to compute than a single coupled model because of the required integration of the ensemble. However, with uncoupled models, the methods also have been shown to exhibit a particularly good 5 scaling behavior. This study discusses an approach to augment a coupled model with data assimilation functionality provided by the Parallel Data Assimilation Framework (PDAF). Using only minimal changes in the codes of the different compartment models, a particularly efficient data assimilation system is generated that utilizes parallelization and in-memory data transfers between the models and the data assimilation functions and hence avoids most of the file reading and writing, and also model restarts during the data assimilation process. The study explains the required modifications of the programs on the example 10 of the coupled atmosphere-sea ice-ocean model AWI-CM. Using the case of the assimilation of oceanic observations shows that the data assimilation leads only to small overheads in computing time of about 15% compared to the model without data assimilation and a very good parallel scalability. The model-agnostic structure of the assimilation software ensures a separation of concerns in that the development of data assimilation methods can be separated from the model application. Copyright statement. TEXT 15

applied to separate models simulating e.g. the atmospheric dynamics or the ocean circulation. However, in recent years coupled models of different Earth system compartments have become more common. In this case, the compartment models frequently exchange information at the interface of the model domains to influence the integration of the other model compartment. For 25 example, in coupled atmosphere-ocean models the fluxes through the ocean surface are dynamically computed based on the physical state of both the atmosphere and the ocean are exchanged in between both compartments. For model initialization, DA should be applied to each of the compartments. Here, the DA can either be performed separately in the different compartment domains, commonly called weakly-coupled DA, or it can be performed in a joint update, called strongly-coupled DA. Only strongly coupled DA is expected to provide fully dynamically consistent state estimates.

30
A recent overview of methods and issues in coupled DA is provided by Penny et al. (2017). By now the weakly coupled assimilation is the common choice for assimilation into coupled models and recent studies assess the effect of this assimilation approach. For atmosphere-ocean coupled models, different studies either assimilated observations of one compartment into the observed compartment (e.g. Kunii et al. (2017); Mu et al. (2020)) or observations of each compartment into the corresponding one (e.g. Zhang et al. (2007); Liu et al. (2013); Han et al. (2013); Chang et al. (2013); Lea et al. (2015); Karspeck et al. (2018); 35 Browne et al. (2019)). The research question considered in these studies is usually to which extent the assimilation into a coupled model can improve predictions in comparison to the assimilation into uncoupled models. Partly, the mentioned studies used twin experiments assimilating synthetic observations to assess the DA behavior.
Strongly coupled DA is a much younger approach, which is not yet well established. Open questions for strongly coupled DA are for example how to account for the different temporal and spatial scales in the atmosphere and the ocean. Strongly coupled 40 DA is complicated by the fact that DA systems for the ocean and atmosphere have usually been developed separately and often use different DA methods. For example, Laloyaux et al. (2016) used a 3D variational DA in the ocean, but 4D variational DA in the atmosphere. The methodology lead to a quasi-strongly coupled DA. Frolov et al. (2016) proposed an interface-solver approach for variational DA methods, which leads to a particular solution for the variables close to the interface. Strongly coupled DA was applied by Sluka et al. (2016) in a twin experiment using an EnKF with dynamically estimated covariances Ensemble DA (EnDA) methods use an ensemble of model state realizations to represent the state estimate (usually the ensemble mean) and the uncertainty of this estimate given by the ensemble spread. The filters perform two alternating phases: In the forecast phase the ensemble of model states is integrated with the numerical model until the time when observations are 95 available. At this time, the analysis step is computed. It combines the information from the model state and the observations taking into account the estimated error of the two information sources and computes an updated model state ensemble, which represents the analysis state estimate and its uncertainty.
The currently most widely used ensemble filter methods are ensemble-based Kalman filters based on the Ensemble Kalman filter (Evensen, 1994;Houtekamer and Mitchell, 1998;Burgers et al., 1998). When incorporating the observations during the 100 analysis step, these filters assume that the errors in the state and the observations are Gaussian distributed. This allows to formulate the analysis step just using the two leading moments of the distributions, namely the mean and covariance matrix.
Another class of EnDA methods are particle filters (e.g., van Leeuwen, 2009). While particle filters do not assume Gaussianity of error distributions, they are difficult to use with high-dimensional models because particular adaptions are required to avoid that the ensemble collapses to a single member due to the so-called 'curse of dimensionality' (see Snyder et al., 2008). Methods 105 to make particle filters usable for high-dimension systems were reviewed by van Leeuwen et al. (2019). One strategy is to use the observational information already during the forecast phase to keep the ensemble states close to the observations. This approach requires that some DA functions are already executed during the forecast phase. The realization in the implementation strategy will be discussed in Sec. 3.2.

Filter algorithms 110
To be able to discuss the particularities of coupled DA with respect to ensemble filter, here the error-subspace transform Kalman filter (ESTKF, Nerger et al., 2012b) is reviewed. The ESTKF is an efficient formulation of the EnKF that has been applied in different studies to assimilate satellite data into sea-ice ocean models (e.g. Kirchgessner et al., 2017;Mu et al., 2018;Androsov et al., 2019) and biogeochemical ocean models (e.g. Pradhan et al., 2019;Goodliff et al., 2019).

115
In the analysis step at the time t k , the ESTKF transforms a forecast ensemble X f k of N e model states of size N x stored in the columns of this matrix into a matrix of analysis states X a k as where x f k is the forecast ensemble mean state and 1 Ne is a vector of size N e holding the value one in all elements. Further, w k is a vector of size N e which transforms the ensemble mean andW is a matrix of size N e × N e which transforms the ensemble 120 perturbations. Below, the time index k is omitted, as all computations in the analysis refer to the time t k .
The forecast ensemble represents an error-subspace of dimension N e − 1 and the ESTKF computes the ensemble transformation matrix and vector in this subspace. Practically, one computes an error-subspace matrix by L = X f T where T is a projection matrix with j = N e rows and i = N e − 1 columns defined by (2) 125 Below, the equations are written using X f and T rather than L as this leads to a more efficient formulation.
A model state vector x f and the vector of observations y with dimension N y are related through the observation operator H by where is the vector of observation errors, which are assumed to be a white Gaussian distributed random process with obser-130 vation error covariance matrix R. For the analysis step, a transform matrix in the error-subspace is computed as This matrix provides ensemble weights in the error-subspace. The factor ρ with 0 < ρ ≤ 1 is called the "forgetting factor" (Pham et al., 1998) and is used to inflate the forecast error covariance matrix. The weight vector w k and matrixW are now given by where A 1/2 is the symmetric square root which is computed from the eigenvalue decomposition USU T = A −1 such that For high-dimensional models a localized analysis is computed following Nerger et al. (2006). Here, each vertical column of 140 the model grid is updated independently by a local analysis step. For updating a column, only observations within a horizontal influence radius l are taken into account. Thus, the observation operator is local and computes an observation vector within the influence radius l from the global model state. Further, each observation is weighted according to its distance from the water column to down-weight observations at larger distances (Hunt et al., 2007). The weight is applied by modifying matrix R −1 in Eqns. (4) and (5). The localization weight for the observations is computed from a correlation function with compact support 145 given by a 5th-order polynomial with a shape similar to a Gaussian function (Gaspari and Cohn, 1999). The localization leads to individual transformation weights w k andW for each local analysis domain.

Weakly-coupled ensemble filtering
In weakly coupled DA, the EnKF is applied in the coupled model to a single compartment or separately to several of the compartments. Given that the analysis is separate for each involved compartment, the filter is applied as in a single-compartment 150 model. Thus, in practice several EnKFs compute the analyses updates independently before the next forecast phase is started with the updated fields from the different compartments.

Strongly-coupled ensemble filtering
To discuss strongly-coupled filtering, let us assume a two-compartment system (perhaps the atmosphere and the ocean). Let x A and x O denote the separate state vector in each compartment. For strongly-coupled DA, both are joined into a single state 155 vector x C .
Using the joint forecast ensemble X f C in Eq. (1) of the ESTKF one sees that the same ensemble weights w,W are applied to both x A and x O . The weights are computed using Eqns. (4) to (6). These equations involve the observed ensemble HX f C , the observation vector y, and the observation error covariance matrix R. Thus, for strongly coupled DA, the updated weights depend on which compartment is observed. If there are observations of both compartments they are jointly used to compute the 160 weights. If only one compartment is observed, e.g having only ocean observations y O , then we also have HX f C = (HX f ) O and the weights are only computed from these observations. Thus, through Eq. (1), the algorithm can directly update both compartments x A and x O using observations of just one compartment.
An interesting aspect is that when one runs separate assimilation systems for the two compartments with the same filter methodology, one can compute a strongly-coupled analysis by only exchanging the parts of y, HX f , and R in between both 165 compartments and then initializing the vectors containing observational information from all compartments in the assimilation system of each compartment. If there are only observations in one of the compartments, one can also compute the weights in that compartment and provide them to the other compartment. Given that y and R are initialized from information that is usually stored in files, one can also let the DA code coupled into each compartment model read these data and only exchange the necessary parts of HX f . While this discussion shows that technically it is straightforward to apply strongly-coupled DA 170 with these filter methods, one has to account for the model parallelization, which is discussed in Section 3.3.

Setup of data assimilation program
This section describes the assimilation framework and the setup of the DA program. First an overview of PDAF is given (Sec. 3.1). The code modifications for online-coupling are described in Sec. 3.2, the modifications of the parallelization are described in Sec. 3.3. Finally, Sec. 3.4 explains the aspect of the call-back functions.

175
The setup builds on the strategy introduced by Nerger and Hiller (2013). Here, the discussion focuses on the particularities when using a coupled model consisting of separate executable programs for each compartment. While we here describe both the features for weakly and strongly coupled DA, AWI-CM-PDAF in version 1.0 is only coded with weakly-coupled DA into the ocean. This is motivated by the fact that the weakly-coupled DA into a coupled climate model has already different properties than DA in an uncoupled model. In particular, the initial errors in the coupled AWI-CM are much larger than in a 180 simulation of FESOM using atmospheric forcing. Mainly this is because in FESOM the forcing introduces information about the weather conditions, while AWI-CM only represents the climate state. Thus studying weakly-coupled DA, which is still used in most applications, has a value on its own. Strongly coupled DA will be supported in the AWI-CM-PDAF model binding in the future.

185
PDAF (Nerger and Hiller, 2013, http://pdaf.awi.de) is free open-source software that was developed to simplify the implementation and application of ensemble DA methods. PDAF provides a generic framework containing fully implemented and parallelized ensemble filter and smoother algorithms like the LETKF (Hunt et al., 2007), the ESTKF (Nerger et al., 2012b), or the nonlinear NETF method (Tödter and Ahrens, 2015) and related smoothers (e.g., Nerger et al., 2014;Kirchgessner et al., 2017). Further, it provides functionality to adapt a model parallelization for parallel ensemble forecasts as well as routines 190 for the parallel communication linking the model and filters. Analogous to many large-scale geoscientific simulation models, PDAF is implemented in Fortran and is parallelized using the Message Passing Interface standard (MPI, Gropp et al., 1994) as well as OpenMP (OpenMP, 2008). This ensures optimal compatibility with these models, while it is still usable with models coded, e.g., in the programming language C.
The filter methods are model-agnostic and only operate on abstract state vectors as described for the ESTKF in Sec. 2. This 195 allows to develop the DA methods independently from the model and to easily switch between different assimilation methods.
Any operations specific to the model fields, the model grid, or to the assimilated observations are performed in program routines provided by the user based on existing template routines. The routines have a specified interface and are called by PDAF as call-back routines, i.e. the model code calls routines of PDAF, which then call the user routines. This call structure is sketched in Fig. 1. Here, an additional yellow 'interface routine' is used in between the model code and the PDAF library routine. This 200 interface routine is used to define parameters for the call to the PDAF library routines, so that these do not need to be specified in the model code. Thus, only a single-line call to each interface routine is added to the model code, which keeps the changes to the model code to a minimum.
The motivation for this call structure is that the call-back routines exist in the context of the model (i.e. the user space) and can be implemented like model routines. In addition, the call-back routines can access static arrays allocated by the model, 205 e.g. through Fortran modules or C header files. For example, this can be used to access arrays holding model fields or grid information. This structure can also be used in case of an offline-coupling using separate programs for the model and the analysis step. However, in this case the grid information is not already initialized by the model and has to be initialized by a separate routine. Using the interfaces and user routines provided by PDAF, it can also be used with models implemented in C or C++, or can be combined with Python. For coupled models consisting of multiple executables, this call structure is used for 210 each compartment model.

Augmenting a coupled model for ensemble data assimilation
Here, only the online-coupling for DA is discussed. As described before, the offline-coupling uses separate programs for the model and the DA program and model restart files to transfer information about the model states between both programs. Generally, the same code for the user routines can be used for online and offline coupled DA. The difference is that in the 215 online coupling, model information like the model grid are initialized by the model code and usually stored in e.g. Fortran modules. For offline coupled DA one could use the same variable names, and the same names for the modules. Thus, one would need to implement routines that initialize these variables.
The strategy to augment a coupled model with DA functionality is exemplified here using the AWI climate model (AWI-CM, Sidorenko et al., 2015). The model consists of the atmospheric model ECHAM6 (Stevens et al., 2013), which includes 220 the land surface model JSBACH, and the finite-element sea-ice ocean model (FESOM, Danilov et al., 2004;Wang et al., 2008). Both models are coupled using the coupler library OASIS3-MCT (Ocean-Atmosphere-Sea-Ice-Soil coupler -Model Coupling Toolkit, Valcke, 2013). OASIS3-MCT computes the fluxes between the ocean and the atmosphere and performs the interpolation between both model grids. The coupled model consists of two separate programs for ECHAM and FESOM, which are jointly started on the computer so that they can exchange data via the Message Passing Interface (MPI, Gropp et al., 225 1994). OASIS-MCT is linked into each program as a library. For further details on the model, we refer to Sidorenko et al. (2015).
The online coupling for DA was already discussed in Nerger and Hiller (2013) for an earlier version of the ocean model used in the AWI-CM. Here, an updated coupling strategy is discussed that requires less changes to the model code. While the general strategy for online coupling of the DA is the same as in the previous study, we privde here a full description for 230 completeness. Further, we discuss the particularities of the coupled model. program, the parallelization is initialized ('init. parallelization'). After this step, all involved processes of the program are active (for the parallelization aspects see Sec. 3.3). Subsequently, the OASIS coupler initializes the parallelization for the coupled model, by separating the processes for ECHAM and FESOM. Thus, after this point, the coupler can distinguish the different model compartments. Now, the model itself is initialized, e.g. the model grid for each compartment is initialized and the initial fields are read from files. Further, information for the coupling will be initialized like the grid configuration, which 240 is required by the coupler to interpolate data in between the different model grids. This completes the model initialization and the time stepping is computed. During the time stepping, the coupler exchanges the interface information between the different compartments. After the time stepping some post-processing can be performed, e.g. writing time averages or restart files to disk.
The right hand side of Fig. 2 shows the required additions to the model code as yellow boxes. These additions are calls to 245 subroutines that interface between the model code and the DA framework. In this way, only single-line subroutine calls are added, which might be enclosed in preprocessor checks to allow to activate or deactivate the data-assimilation extension at compile time. The additions are done in both the codes of ECHAM and FESOM and here we discuss them in general. The added subroutine calls have the following functionality: model instance ('model task'), the model is modified to run an ensemble of model tasks. This routine is inserted directly after the parallelization is started. So all subsequent operations of the program will act in the modified parallelization.
In the coupled model this routine is executed before the parallelization of the coupler is initialized. In this way also the coupler will be initialized for an ensemble of model states.
-Init_PDAF: In this routine the PDAF framework will be initialized. The routine is inserted into the model codes so that 255 it is executed after all normal model initialization is completed; thus just before the time-stepping loop. The routine specifies parameters for the DA, which can be read from a configuration file. Then, the initialization routine for PDAF, named 'PDAF_init' is called, which performs the PDAF-internal configuration and allocates the internal arrays, e.g. the array of the ensemble states. Further, the initial ensemble is read from input files. As this reading is model-specific, it is performed by a user-provided routine that is called by PDAF as a call-back routine (see Sec. 3.4). After the framework is 260 initialized, the routine 'PDAF_get_state' is called. This routine writes the information from the initial ensemble into the field arrays of the model. In addition, the length of the initial forecast phase, i.e. the number of time steps until the first analysis step, is initialized. For the coupled model, 'PDAF_init' and 'PDAF_get_state' are called in each compartment.
However, some parameters are distinct. For example, the time step size of ECHAM if 450s, while it is 900s for FESOM.
Hence, the number of time steps in the forecast phase of one day are different in the compartments.

265
-Assimilate_PDAF: This routine is called at the end of each model time step. For this, it is inserted into the model codes of ECHAM and FESOM at the end of the time stepping loop. The routine calls a filter-specific routine of PDAF that computes the analysis step of the selected filter method, for example 'PDAF_assimilate_lestkf' for the localized ESTKF. This routine of PDAF also checks whether all time steps of a forecast phase have been computed. Only if this is true, the analysis step is executed while otherwise the time stepping is continued. If additional operations for the DA 270 are required during the time stepping, like taking into account future observations in case of the advanced equivalentweights particle filter (EWPF, van Leeuwen, 2010) or collecting observed ensemble fields during the forecast phase for a 4-dimensional filtering (Harlim and Hunt, 2007), these are also performed in this filter-specific routine. For the coupled model, the routine is called in both ECHAM and FESOM. Then, 'PDAF_assimilate_lestkf' will check for the analysis time according to the individual number of time steps in the forecast phase. The analysis step will then be executed 275 in each compartment according to the configuration of the assimilation. In the implementation of AWI-CM-PDAF 1.0, the analysis is only performed in FESOM. Thus, while 'PDAF_assimilate_lestkf' is also called in ECHAM, is does not assimilate any data.
-Finalize_PDAF: This routine is called at the end of the program. The routine includes calls to the routine 'PDAF_print_info', which print out information about execution times of different parts of the assimilation program as measured by PDAF 280 as well as information about the memory allocated by PDAF.
Compared to the implementation strategy discussed in Nerger and Hiller (2013), in which the assimilation subroutine is only called after a defined number of time steps, this updated scheme allows to perform DA operations during the time stepping loop. To use this updated scheme, one has to execute the coupled model with enough processors so that all ensemble members can be run at the same time. This is nowadays easier than in the past because the number of processor cores is much larger in 285 current high-performance computers compared to the past.
Apart from the additional subroutine calls, a few changes were required in the source codes of ECHAM, FESOM, and OASIS3-MCT which are related to the parallelization. These changes are discussed in Sec. 3.3.

Parallelization for coupled ensemble data assimilation
The modification of the model parallelization for ensemble DA is a core element of the DA online coupling. Here, the 290 parallelization of AWI-CM and the required changes for the extension for the DA are described. For FESOM, as a singlecompartment model, the adaption of the parallelization was described by Nerger et al. (2005) and Nerger and Hiller (2013). A similar parallelization was also described by Browne and Wilson (2015). For the online-coupling of PDAF with the coupled model TerrSysMP, the setup of the parallelization was described by Kurtz et al. (2016). While for TerrSysMP a different coupling strategy was used, the parallelization of the overall system is essentially the same as discussed here for AWI-CM. The 295 parallelization for the DA is configured by the routine init_parallel_pdaf. In general this is a template routine, which can be adapted by the user according to the particular needs. Nonetheless, by now the default setup in PDAF was directly usable in all single-compartment models to which PDAF was coupled. Compared to the default setup in PDAF for a single-compartment model, we have adapted the routine to account for the existence of two model compartments.
Like other large-scale models, AWI-CM is parallelized using the Message-Passing Interface standard (MPI, Gropp et al., 300 1994). MPI allows to compute a program using several processes with distributed memory. Thus, each process has only access to the data arrays that are allocated by this process. Data exchanges between processes are performed in form of parallel communication, i.e. the data is explicitly sent by one process and received by another process. All parallel communication is performed within so-called communicators, which are groups of processes. When the parallel region of a program is initialized, the communicator MPI_COMM_WORLD is initialized, which contains all processes of the program. In case of AWI-CM when 305 the two executables for ECHAM and FESOM are jointly started, they share the same MPI_COMM_WORLD so that parallel communication between the processses running ECHAM and those running FESOM is possible. Further communicators can be defined by splitting MPI_COMM_WORLD. This is used to define groups of processes both for AWI-CM and for the extension with PDAF.
For AWI-CM without data-assimilation extension, the parallelization is initialized by each program at the very beginning.

310
Then, a routine of OASIS-MCT is called which splits MPI_COMM_WORLD into two communicators: one for ECHAM (COMM_ECHAM) and one for FESOM (COMM_FESOM). These communicators are then used in each of the compartment models and together they build one model task that integrates one realization of the coupled model state. MPI_COMM_WORLD is further used to define one process each for ECHAM and FESOM, which perform the parallel communication to exchange flux information. Important is here, that OASIS-MCT is coded to use MPI_COMM_WORLD to define these communicators.

315
Each of the compartment models then uses its group of processes for all compartment-internal operations. Each model uses a domain-decomposition, i.e. each process computes a small region of the global domain in the atmosphere or the ocean.
The distribution of the processes is exemplified in Fig. 3(a) for the case of 6 processes in MPI_COMM_WORLD. Here, the communicator is split into 4 processes for COMM_FESOM (green) and 2 for COMM_ECHAM (blue).
For the ensemble DA, the parallelization of AWI-CM is modified. Generally, the introduction of the ensemble adds one ad-320 ditional level of parallelization to a model, which allows us to concurrently compute the ensemble of model integrations, i.e. several concurrent model tasks. In AWI-CM augmented by the calls to PDAF, the routine init_parallel_pdaf modifies the parallelization. Namely MPI_COMM_WORLD is split into a group of communicators for the coupled model tasks (COMM_CPLMOD), as exemplified for an ensemble of 4 model tasks in Fig. 3(b) indicated by the different color shading.
Subsequently, OASIS-MCT splits each communicator COMM_CPLMOD into a pair COMM_ECHAM and COMM_FESOM 325 (third line in Fig. 3(b)). To be able to split COMM_CPLMOD, the source code of OASIS-MCT needs to be modified replacing MPI_COMM_WORLD by COMM_CPLMOD, because OASIS-MCT uses MPI_COMM_WORLD as the basis for the communicator splitting (see also Kurtz et al., 2016, for the required modifications). With this configuration of the communicators, AWI-CM is able to integrate an ensemble of model states by computing all model tasks concurrently.
Two more communicators are defined in init_parallel_pdaf for the analysis step in PDAF. Here, a configuration is used that  Fig. 3(b). Finally, the communicator COMM_FILTER (line 5 of Fig. 3(b)) is defined, which contains all processes of the first model task. Note that compared to the single-compartment case discussed in Nerger et al. (2005) and Nerger and Hiller (2013), the major change is that each model task is split into the communicators COMM_FESOM and COMM_ECHAM, which are, however, only used for the model integration. In addition, COMM_FILTER includes the processes of both model 340 compartments of the first model task.
This configuration is used to perform strongly-coupled DA, because it allows the communication between processes of ECHAM with processes of FESOM. In a weakly-coupled application of DA, COMM_FILTER is initialized so that two separate communicators are created, one for all sub-domains of FESOM and another one for all sub-domains of ECHAM as shown in Fig. 3(c). In practice one can achieve this by using the already defined communicators COMM_FESOM and COMM_ECHAM 345 of model task 1. Because these two communicators are initialized after executing init_parallel_pdaf, one has to overwrite COMM_FILTER afterwards in, e.g., init_PDAF. With this configuration the assimilation can be performed independently for both compartments.

Call-back routines for handling of model fields and observations
The call-back routines are called by PDAF to perform operations that are specific for the model or the observations. The that writes a local state vector after the local analysis correction into the full state vector). In addition, there is a routine that determines the number of observations within the influence radius around the vertical column and a routine to fill this local observation vector from a full observation vector.
pre-and post-processing (blue): To give the user access to the ensemble before and after the analysis step, there is a pre/post-processing routine. Here, one typically computes the ensemble mean and writes it into a file. Further, one could implement consistency checks, e.g. whether concentration variables have to be positive, and can perform a correction to the state variables if this is not fulfilled.

Scalability
To assess the parallel performance of the assimilation system described above, AWI-CM is run here in the same global con-

385
In the initial implementation AWI-CM-PDAF 1.0, the assimilation update is only performed as weakly coupled DA in the ocean compartment. The state vector for the assimilation is composed of the 2-dimensional sea surface height, and the 3-dimensional model fields temperature, salinity and the three velocity components. The DA is started on January 1st, 2016 and satellite observations of the sea surface temperature obtained from the European Copernicus initiative (data set SST_GLO_SST_L3S_NRT_OBSERVATIONS_010_010 available at https://marine.copernicus.eu), interpolated to the model 390 grid, are assimilated daily. The assimilation is multivariate so that the SST observations influence the full oceanic model state vector through the ensemble estimated cross-covariances that are used in the ESTKF. The initial ensemble was generated using second-order exact sampling (Pham et al., 1998) from the model variability of snapshots at each 5th day over one year. the ensemble mean was set to a model state for January 1, 2016 from a historical (climate) run of AWI-CM. No inflation was required in this experiment, i.e. a forgetting factor ρ = 1.0 (see Eq. 4) was used. Even though, we only perform weakly coupled 395 DA here, we expect that the compute performance would be similar in case of strongly coupled DA, as is explained in Sec. 6. For a fixed ensemble size but varying number of processes for ECHAM and FESOM, the scalability of the program is determined by the scalability of the models (see, e.g., Nerger and Hiller, 2013). To access the scalability of the assimilation system for varying ensemble size, experiments over 10 days were conducted with varying ensemble sizes between N e = 2 and N e = 46. The length of these experiments is chosen to be long enough so that the execution time is representative to assess the 400 scalability. However, the assimilation effect will be rather small for these 10 analysis steps. The number of processes for each model task was kept constant at 72 processes for ECHAM and 192 processes for the more costly FESOM. The experiments were conducted on the Cray XC40 system 'Konrad' of the North-German Supercomputer Alliance (HLRN). In the experiments, the longest forecast time was up to 16% larger than the shortest time, which occurred for N e = 24. This variability is partly caused by the time for DA coupling (see discussion below), but also by the fact that the semi-implicit time stepping of FESOM leads to varying execution times. Further influence have the parallel communication within each compartment at each time step and the communication for the model coupling by OASIS3-MCT that is performed at each model hour. The execution time for 415 these operations depends on how the overall program is distributed over the computer. As the computer is also used by other applications, it is likely that the application is widely spread over the computer so that even different compute racks are used.
This can even lead to the situation that the processors for a single coupled model task of ECHAM and FESOM, but also a single model instance of ECHAM or FESOM, are not placed close to each other. If the processors are distant, e.g. in different racks, the communication over the network will be slower than for a compact placement of the processors. To this end, also 420 the execution time will vary when an experiment for the same ensemble size is repeated. Nonetheless, repeated experiments showed that the timings in Fig. 5  to the fact that here the communication happens in the communicators COMM_COUPLE, which are spread much wider over the computer than the communicators for each coupled model task (COMM_CPLMOD) as is visible in Fig. 3. However, even though the number of ensemble states to be gathered and scattered in the communication for the DA coupling varies between 2 and 46, there is no obvious systematic increase in the execution time. In particular, for N e = 40 the execution time is almost identical to that of N e = 2.

435
Further variation in dependence on the ensemble size is visible for the pre-/post-step operations (red line). This variation is mainly due to the operations for writing the ensemble mean state into a file. In contrast, the analysis step shows a systematic time increase. The time for computing the analysis for N e = 46 is about seven times as long as for N e = 2. This is expected from the computational complexity of the LESTKF algorithm (see Vetra-Carvalho et al., 2018). However, also the LESTKF performs MPI communication for gathering the observational information from different process domains. When repeating 440 experiments with the same ensemble size we found a variation of the execution time for the analysis step of up to 10%.

Performance tuning
To obtain the scalability discussed above important optimization steps have been performed. First, it is important that each coupled model instance is, as far as possible, placed compactly in the computer. Second, one has to carefully consider the disk operations performed by the ensemble of coupled model tasks.

445
For the first aspect, one has to adapt the run script. The coupled model is usually started with a command line like (or any other suitable starter for an MPI-parallel program) such that FESOM and ECHAM are run using N O and N A processes, respectively. For the DA one could simply change this by replacing N O by N e × N O and N A by N e × N A to provide enough processes to run the ensemble. This is analogous to the approach used when running a single-compartment model. However, 450 changing the command line in this way will first place all MPI tasks for the FESOM ensemble in the computer followed by all MPI tasks for the ECHAM ensemble. Accordingly, each ocean model will be placed distant from the atmospheric model to which it is coupled. Using this execution approach, the time for the forecasts discussed above increased by a factor of four, when the ensemble size was increased from 2 to 46. For a more efficient execution, one has to ensure that the ocean-atmosphere pairs are placed close to each other. This is achieved with a command line like which contains as many FESOM-ECHAM pairs as there are ensemble members. With this approach, the time increase of the forecast was reduced to about 40% for the increase from N e = 2 to N e = 46.
For the second issue regarding disk operations, one has to take into account that the direct outputs written by each coupled ensemble task are usually not relevant because the assimilation focuses on the ensemble mean state. To this end, one generally 460 wants to deactivate the outputs written by the individual models and replace them by outputs written by the pre-/post-step routine called by PDAF. If the model does not allow to fully switch off the file output, it usually helps to set the output interval of a model to a high value (e.g. a year for a year-long assimilation experiments). However, in case of AWI-CM this strategy still resulted in conflicts of the input/output operations so that the models from the different ensemble tasks tried to write into the same files, which serialized these operations and increased the execution time. To avoid these conflicts it helped to distribute 465 the execution of the different ensemble tasks to different directories, e.g. combined with a prior operation in the run script to generate the directories and distribute the model executables and input files. This distribution avoids that two model tasks writes into the same file and improves the performance of the ensemble DA application. In this configuration, the performance results of Sec. 4.1 were obtained. Another benefit of separate execution 470 directories is that ensemble restarts can be easily realized. Given that each model task write its own restart files in a separate directory, a model restart is possible from these files without any adaptions to the model code. Note, that the approach of separate directories is also possible for the ensemble DA in case of a single (uncoupled) model like a FESOM-only simulation using atmospheric forcing data as e.g. applied by Androsov et al. (2019).

475
Applications based on the AWI-CM-PDAF 1.0 code are Mu et al. (2020), where the focus is on the effect on sea ice, and Tang et al. (2020), who discuss the reaction of the atmosphere on assimilating ocean observations. Here, we demonstrate the functionality of the data assimilation system in an experiment assimilating SST data over the year 2016. An ensemble of 46 states is used, which is the maximum size, used in the scalability experiment discussed above.
The assimilation is performed with a localization radius of 500 km using the regulated localization function by Nerger et al. 480 (2012a). The same SST observations as in Sec. 4.1 are assimilated, which are treated as in Tang et al. (2020). The resolution of the observations is 0.1 o and hence higher than the resolution of the model in most regions. Since the model grid is unstructured with varying resolution, super-observations are generated by averaging onto the model grid. The observation error standard deviation for the assimilation was set to 0.8 o C and observations whose difference from the ensemble mean is more than two standard deviations are excluded from the assimilation. This approach excludes about 22% of the observations at the initial first 485 analysis step. The number of excluded observations shrinks during the course of the assimilation and after one month less than 5% of the days observations are excluded. The assimilation further excludes observations at grid points for which the model contains sea ice because of the mis-match of the satellite data representing ice-free conditions, while ice is present on modeled ocean surface. Two experiments are performed: The experiment FREE runs the ensemble without assimilating observations while the experiment DA-SST assimilates the SST data. To validate the assimilation with independent observations, temperature and salinity profiles from the EN4 data set (EN4.2.1) of the UK MetOffice (Good et al., 2013) are used. This collection of in situ data contains about 1000 to 2000 profiles per day at depths between the surface and 5000 m depth. Figure 8 shows the RMSE of the experiment DA-SST relative to the RMSE for 500 the free run. Hence values below one indicate improvements. For the temperature a gradual improvement is visible during the first 100 days. The error reduction reach about 40% during the year. On average, the RMSE is reduced by 14% from 1.85 o C to 1.40 o C. The variations in the RMSE, e.g. the elevated values around day 250 are due to the varying coverage and location of the profiles in the EN4 data set. For the salinity the effect of the DA is lower. While the RMSE of the salinity first increases during the first month, it is reduced from day 60, but until day 140 it is sometimes larger than at the initial time. Partly the 505 RMSE is reduced by up to 23% at day 144. On average over the full year of the experiment, the RMSE of salinity is reduced by 5.6%. This smaller effect on the salinity is expected because there are no strong correlations between the SST and the salinity at different depths. The improvements of the model fields by the DA of SST is mainly located in the upper 200m of the ocean.
For the temperature the RMSE is reduced by 15.2% in the upper 200m, but only 3.0% below 200m. This is also an expected effect because the correlations between SST and subsurface temperature are largest in the mixed layer of the ocean.

Discussion
The good scalability of the assimilation system allows to perform the assimilation experiment of Sec. 5 over one full year with daily assimilation in slightly less than 4 hours, corresponding to about 53,000 core-hours. As such the system is significantly faster than the coupled ensemble DA application by Karspeck et al. (2018), who reported to complete one year in 3 to 6 weeks with an ensemble of 30 states and about one million core-hours per simulation year. However, both systems are not directly 515 comparable. Karspeck et al. (2018) used atmospheric and ocean models with 1 • resolution. Thus the atmosphere had a higher resolution than used here, while the ocean resolution was comparable to the coarse FESOM resolution in the open ocean, which was then regionally refined. Given that both model compartments in AWI-CM scale to larger processor numbers than we used for the DA experiment, we expect that the DA into AWI-CM with ECHAM at a resolution of T127 (i.e. about 1 • ) could be run at a similar execution time as for T63 given that a higher number of processors would be used. Further Karspeck et al. (2018) 520 applied the DA also in the atmosphere, while here only oceanic data was assimilated. Given that the atmospheric analysis step would typically be applied after each 6th hour, the time for the DA coupling and the analysis steps would increase. However, we don't expect that a single atmospheric analysis step would require significantly more time than the ocean DA so that due to the parallelization the overall run time should not increase by more than 10-20%. Further, we expect a similar scalability in case of strongly coupled DA. The major change for strongly coupled DA is to communicate the observations in between the 525 compartments as mentioned above. This communication will only be small part of the analysis time.
Important for the online-coupled assimilation system is that there is obviously no significant time required for re-distributing the model field (i.e. the time for the DA coupling discussed in Sec. 4.1). Furthermore there is no transpose of the ensemble array to be performed, which was reported to be costly by Karspeck et al. (2018). Here, the implementation of the analysis step uses the same domain-decomposition as the models and hence only the full ensemble for each process sub-domain has to 530 collected by the DA coupling. Thus, only groups of up to 46 processes communicate with each other in this step.
The online-coupled assimilation system avoids any need for frequent model restarts. Actually, the initial model startup of AWI-CM took about 95 seconds and the finalization of the model with writing restart files tool another 15 seconds. Thus, these operations take about 3.3 times longer than integrating the coupled model for one day. If the DA would be performed in a separate program coupled to AWI-CM through files, these operations would be required each model day. In addition, the 535 assimilation program would also need to read these restart files and write new restart files after the analysis step. Assuming that these observations take about 15 seconds, like the finalization of the coupled model, the execution time would increase by a factor of 4 for offline-coupled DA compared to online-coupled DA.
The code structure using interface routines inserted into the model code and case-specific call-back routines makes the assimilation framework highly flexible. Further, the abstraction in the analysis step, which uses only state and observation 540 vectors without accounting for the physical fields allows one to separate the development of advanced DA algorithms from the development of the model. Thus, a separation of concerns is ensured, which is mandated for efficient development of complex model codes and their adaptions to modern computers (Lawrence et al., 2018). The separation allows that, as soon as a new DA method is implemented, all users with their variety of models can use this method by updating the PDAF library. To ensure compatibility of different versions of the library, the interfaces to the PDAF routines are kept unchanged. However for a new 545 filter additional call-back routines might be required, e.g. a routine to compute the likelihood of an ensemble according to the available observations in case of the nonlinear ensemble transform filter (NETF, Tödter and Ahrens, 2015) or a particle filter. The abstraction in the analysis step and the model-agnostic code structure also allow to apply the assimilation framework independent of the specific research domain. E.g. applications of PDAF with a geodynamo model (Fournier et al., 2013), hydrological applications (Kurtz et al., 2016), ice shield modeling (Gillet-Chaulet, 2020), and volcanic ash clouds (Pardini 550 et al., 2020) have been published.
The example here uses a parallelization in which the analysis step is computed using the first model task and the same domain decomposition as the model. Other parallel configurations are possible. E.g., one could compute the analysis step not only using the processes of model task 1, but for processes of several or all model tasks. This could be done by either using a finer domaindecomposition than in the model integrations, or by e.g. distributing different model fields onto the processes. These alternative 555 parallelization strategies are, however, more complex to implement and hence not the default in PDAF. A further alternative, which is already supported by PDAF, is to dedicate a set of processes for the analysis step. In this case, the DA coupling would communicate all ensemble members to these separate processes. However, these processes would idle during the forecast phase. To this end, separating the processes for the analysis step would mainly be a choice if the available memory on the first model task is not sufficient to execute the analysis step. Also in this case, the distribution of the analysis step over several 560 processors would reduce the required memory. For the parallel configuration of AWI-CM-PDAF in Fig. 3, a particular order of the processes is assumed. This order originates from the startup procedure of MPI and is determined by the command line which start the program. Thus, for other models one might need a different setup, which can usually be obtained by only modifying the routine init_parallel_pdaf. Further, the default version of this routine splits the communicator MPI_COMM_WORLD.
However, for other models a different suitable communicator might be split if not all processes participate in the time stepping.

565
This can be the case when, e.g., an OI-server is used that reserves processes exclusively for the file operations. To provide flexibility to adapt to such requirements, the routine init_parallel_pdaf is compiled with the model and is not part of the core routines of the PDAF library.
While the fully-parallel execution of the assimilation program is very efficient, it is limited by the overall job size allowed on the computer. The maximum ensemble size was here limited by the batch job size of the used computer. The model used 570 in the example here can scale even further than e.g. the 192 processes used for FESOM and 72 processes for ECHAM. Thus, using the same computer, one could run a larger ensemble with less processes per model and accordingly a larger run time, or a smaller ensemble with less run time. The number of processes should be set so that the requirements on the ensemble size for a successful assimilation can be fulfilled. Nonetheless, the ensemble DA is computationally demanding and for larger applications, one might need to obtain a compute allocation at larger computing sites, like national compute centers.

Conclusions
This study discussed the parallel data assimilation framework (PDAF) and its use to create a coupled data assimilation program by augmenting the code of a coupled model and using in-memory data transfers between the model and the data assimilation software. The implementation strategy was exemplified for the coupled ocean-atmosphere model AWI-CM for which two separate programs for the ocean and atmosphere where augmented. However, the strategy can be easily used for other model 580 systems consisting of a single or multiple executables.
The implementation of a DA system based on PDAF consist in augmenting the model codes with calls to routines of the assimilation framework. These routines modify the parallelization of the model system, so that it becomes an ensemble model. Care has to be taken that in the coupled model the pairs of atmosphere and ocean model compartments are placed close to each other in the computer, which can be achieved by specifying these pairs in the command starting the parallel program. The 595 time to collect this ensemble information before the analysis step and to distribute it afterwards showed significant variations from run to run. These variations are due to the fact that the large compute application is widely spread over processors of the computer. Anyway, no systematic time increase was observed when the ensemble size was increased and the time was only up to about 6% of the time required for the forecasting. Distributing the different models over separate directories improved the scalability because it avoided possible conflicts the in file handling which can be serialized by the operating system of the 600 computer. PDAF provides a model-agnostic framework for the efficient data assimilation system as well as filter and smoother algorithms. As such it provides the capacity to ensure a separation of concerns between the developments in the model, observations, and the assimilation algorithms. Functionality to interface between the model, which operates on physical fields, and the assimilation code, which only work on abstract state vectors, has to be provided in a case-specific manner by the users based on 605 code templates. This also holds for the observation handling. While there are typical observational data sets for the different Earth system compartments, the observation operator links the observations with the model fields on the model grid. Thus, the observation operator has to be implemented taking into account the specific character of the model grid like the unstructured structure of FESOM's grid.
The current implementation of AWI-CM-PDAF only contains the assimilation into the ocean component, while the assimi-610 lation into the atmosphere is technically prepared. First studies (Mu et al., 2020;Tang et al., 2020) base on this implementation.
In future work we plan to add the assimilation of atmospheric observations, and to complete the implementation of stronglycoupled data assimilation, which requires the exchange of observations in between the ocean and atmosphere.
Code availability. The model-binding for AWI-CM-PDAF 1.0 used in this study is archived at Zenodo (Nerger et al., 2019a) The PDAF code (version 1.14 was used here), as well as a full code documentation and a usage tutorial are available at http://pdaf.awi.de. The source The colors and lines mark processes that are grouped as a communicator. Different shades of the same color mark the same communicator type (e.g. four orange communicators COMM_FESOM). For COMM_COUPLE each communicator is spread over the model tasks. The numbers mark the rank index of a process in a communicator.  profile observations. Shown are the relative RMSEs for temperature (blue) and salinity (green). Temperature is more strongly improved than salinity.