Concurrent calculation of radiative transfer in the atmospheric simulation in ECHAM-6.3.05p2

. The scalability of the atmospheric model ECHAM6 at low resolution, as used in palaeoclimate simulations, suffers from the limited number of grid points. As a consequence, the potential of current high-performance computing architectures cannot be used at full scale for such experiments, particularly within the available domain decomposition approach. Radiation calculations are a relatively expensive part of the atmospheric simulations, taking up to approximately 50 % or more of the total runtime. This current level of cost is achieved by calculating the radiative transfer only once in every 2 h of simulation. In response, we propose extending the available concurrency within the model further by running the radiation component in parallel with other atmospheric processes to improve scalability and performance. This paper introduces the concurrent radiation scheme in ECHAM6 and presents a thorough analysis of its impact on the performance of the model. It also evaluates the sci-entiﬁc results from such simulations. Our experiments show that ECHAM6 can achieve a speedup of over 1.9 × using the concurrent radiation scheme. By performing a suite of stand-alone atmospheric experiments, we evaluate the inﬂuence of the concurrent radiation scheme on the scientiﬁc results. The simulated mean climate and internal climate variability by the concurrent radiation generally agree well with the classical radiation scheme, with minor improvements in the mean atmospheric circulation in the Southern Hemisphere and the atmospheric teleconnection to the Southern Annular Mode. This empirical study serves as a successful example that can stimulate research on other concurrent components in atmospheric modelling whenever scalability becomes challenging


Introduction
Earth system modelling has traditionally been a computationally demanding domain with a continual increase in complexity and resolution. It is a major application of highperformance computing and represents a large variety of climate processes with diverse computational profiles and disparate performance optimization requirements. A primary subsystem of a typical Earth system model (ESM) is the atmospheric simulation, which resolves several physical processes including radiative transfer.
Radiative transfer is one of the most expensive parts for coarse-and low-resolution atmospheric simulations. This process is resolved to respond to the changing state of the chemical species which interact with the radiation (Balaji et al., 2016;Salby, 1996;Wallace and Hobbs, 2006). Solar energy is the driving force for the atmosphere through radiative transfer, which is the only physical process that is capable of exchanging energy between a planet like the Earth and the rest of the universe (Wallace and Hobbs, 2006). Energy transfer in the atmosphere involves electromagnetic radiation that can be separated into shortwave and longwave parts: shortwave, emitted by the sun, and longwave, emitted by the Earth's surface and the atmosphere (Wallace and Hobbs,7440 M. R. Heidari et al.: Concurrent radiation scheme in the atmospheric model ECHAM-6.3.05p2 -including greenhouse gases, aerosols, and clouds -that interact with electromagnetic radiation through the mechanisms of absorption, scattering, and emission. The level of interaction strongly depends on the state of the atmospherics particles (evolving by advection, cloud processes, and chemistry) and the optical properties (the wavelengths and intensity) of the incident radiation (Wallace and Hobbs, 2006).
In principle, the absorption of solar radiation by the atmosphere and the Earth's surface must be balanced by the longwave emission to space from the terrestrial radiation (Salby, 1996). It is crucial for atmospheric models to accurately represent the radiative transfer process (Rasp, 2019). Solving the problem is in essence straightforward (Wallace and Hobbs, 2006). However, this can be quite computationally demanding in practice, despite the simplifying approximations adopted in the radiation component (Balaji et al., 2016;Wallace and Hobbs, 2006). As a result, in most climate models around the world, this component is not called in every time step (Balaji et al., 2016). On the contrary, it is calculated at a coarser time step than the rest of the atmospheric physics, entirely pursuing a performance improvement rather than fulfilling any other technical objective (Balaji et al., 2016).
Over the years, various techniques have been used to represent radiative transfer in different models and maintain its calculation cost within acceptable limits. Morcrette (2000) discusses the effect of the temporal or spatial sampling techniques of the radiation inputs on the forecasts and analyses at the European Centre for Medium-Range Weather Forecasts (ECMWF). In a different approach, the radiative calculations are computed on a coarser grid than the one on which all other physical processes are implemented. Morcrette et al. (2008) report the implementation of a reduced grid for radiation in the ECMWF Integrated Forecasting System (IFS) model.
Resolving radiation transfer on coarser time and spatial resolutions, however, can lead to errors in weather and climate simulations. A report by Hogan and Bozzo (2015) describes a computationally efficient solution to this problem. It suggests updating the surface longwave and shortwave fluxes in every time step and grid point according to the local skin temperature and albedo. Hogan and Hirahara (2016) propose a careful treatment of the cosine of the solar zenith angle in models that calculate radiation every 3h. This solution aims at reducing the negative impacts of the biases that occur due to discrete sampling of solar zenith angle. A follow-up study by Hogan and Bozzo (2018) introduces a flexible new radiation scheme (ecRAD) in the ECMWF model which is around 41 % faster than the previous package. The report shows some improvements in the skill of weather forecasts by calling the radiation scheme more frequently for the same overall computational cost.
Another proposal is based on the re-organization of the radiation component within the atmospheric models. In this approach, the radiative transfer is calculated in separate tasks in parallel with the rest of the model, in pursuit of improved scalability and performance. In classical atmospheric modelling, all physical processes including solar radiation are resolved sequentially with respect to each other, thus creating a prohibitively long latency in the overall runtime. Reorganizing the radiation component as separate parallel tasks allows for the simultaneous calculation of radiative transfer along with the other atmospheric processes and removes the long response latency from the component. Mozdzynski and Morcrette (2014) report a re-organization of the radiation calculations in IFS at ECMWF and demonstrate the radiation-in-parallel configuration, in which calculating radiative transfer is performed on separate MPI (Message Passing Interface) processes in parallel with the rest of the model.
In a similar effort but in a broader sense, Balaji et al. (2016) propose coarse-grained component concurrency (CCC) to increase the level of concurrency within ESMs. This approach suggests re-organizing more lowerand higher-level components in parallel with each other and exploiting fine-grained parallelism within each component individually. Additionally, the report demonstrates the result of applying this approach to the radiation component of the Geophysical Fluid Dynamics Laboratory (GFDL) Flexible Modeling System (Balaji, 2004). In this use case, the atmospheric radiative transfer is configured to run in parallel with the atmospheric dynamics and all other atmospheric physics. This technique uses shared-memory parallelism and divides the available threads in an OpenMP (Open Multi-Processing) region between the parallel components.
This paper concentrates on the performance optimization of the radiation scheme in ECHAM6, which is the sixth generation of the atmospheric general circulation ECHAM . The model was developed at the Max Plank Institute for Meteorology (MPI-M) in Hamburg. It is the traditional atmospheric component of the coupled Earth system model MPI-ESM, as described by Giorgetta et al. (2013). ECHAM6 benefits from spectral and finitedifference methods in five different grid resolutions, ranging from the coarse (CR) and low resolution (LR) to the very high resolution (XR). The CR or T31 is a truncation to 31 wave numbers in the spectral part and corresponds to a horizontal spatial resolution of 96 × 48 points in longitude and latitude. The LR or T63, however, corresponds to 192 × 96 points. Its special prominence in this research is due to its application in the German climate modelling initiative (PalMod), which aims at simulation of a complete glacial cycle (i.e. about 120 000 years) from the last interglacial to the Anthropocene (https://www.palmod.de, last access: 30 November 2021). However, a serious caveat remains as to the feasibility of such an ambitious project, which should be acknowledged in advance. In particular, a major concern has been raised over the poor performance of ECHAM6 suffering from the limited number of grid points at setups used in palaeoclimate simulation. For this reason, the performance optimization of the model is instrumental in ensuring the viability of such long-time simulations.
Experiments reveal that the radiation component is one of the most expensive computational parts in ECHAM6, at least for palaeoclimate simulation. Endeavours to adopt higher optimized radiative calculations will therefore be opportune for PalMod experiments. In response, two solutions have been investigated in parallel within the PalMod project to alleviate the computational burden of the radiative transfer on the atmospheric simulations in ECHAM6: single-precision arithmetic and the concurrent radiation scheme. The technique of mixed-precision calculations is a practice of reducing the time to solution of scientific algorithms whenever lower accuracy is permitted. Empirical studies have revealed that radiation calculations can benefit from reduced-precision arithmetic. Cotronei and Slawig (2020) discuss the results of applying single-precision arithmetic to the radiation calculations in ECHAM6. They indicate that this mathematical treatment accelerates the component by about 40 %, while the overall runtime of the model is reduced by 18 %.
This paper, on the other hand, presents a report on the concurrent radiation scheme applied to the atmospheric model ECHAM6 and provides a thorough analysis of the performance and stability of the model. Calculating radiative transfer in parallel with other atmospheric processes can potentially affect the model's accuracy since the radiation fields will always lag one more radiation time step behind in comparison with the classical scheme. This lag may have negative impacts on physical processes that benefit from a tighter coupling in time with radiation. The boundary layer clouds, particularly stratocumulus, are a good example. They are maintained by longwave cooling at cloud tops once they are formed. This could explain why Hogan and Bozzo (2018) found that calling radiation more frequently led to more skilful forecasts of near-surface temperature and low cloud cover.
In contrast to the OpenMP approach used by Balaji et al. (2016), the concurrent radiation scheme opts for the MPI parallelization in order to fully exploit the potential of higher concurrency in the model. In addition, encapsulating the radiation calculations in a distinct component and name space realizes the idea of separation of concerns (SoC), which results in more degrees of freedom. An immediate benefit includes the independent development and optimization of the radiation component from the main model pursuing higher throughput, which is essential for the ambitious long simulation runs of the PalMod project. This architectural merit also enables the potential of combining the virtues of the concurrent radiation scheme with other appropriate optimized solutions such as "single-precision arithmetic in ECHAM radiation" (Cotronei and Slawig, 2020) in the future.
The paper is organized as follows: Sect. 2 describes the classical approach to the radiation calculations in ECHAM6. In Sect. 3, we introduce the solution of a new radiation scheme in the model. Finally, the performance analysis and the scientific evaluation of the new scheme are presented in Sects. 4 and 5, respectively.

The classical radiation scheme in ECHAM6
Radiative transfer in ECHAM6 is represented with PSrad/RRTMG (a postscript to the Rapid Radiative Transfer Model for general circulation models -GCMs; Pincus and Stevens, 2013) for both shortwave and longwave parts of the electromagnetic spectrum . The radiation component is one of the most expensive components within atmospheric physics. As a result, this component is stepped forward at a slower rate than the rest of the atmospheric physics in ECAHM6 as well as most of the climate models around the world (Balaji et al., 2016). In its flagship configuration, ECHAM6 updates the optical properties of radiation every 2 h, except for very high-resolution (T255) simulations in which the radiation is calculated hourly. Figure 1 shows the organization of the classical radiation scheme in ECHAM6. As is apparent in this figure, the radiative transfer is generally calculated once in every n normal atmospheric time steps, i.e. t rad = n × t atm .
In this scheme, the radiation results are used beyond the state of the input tracers, which may be as much as t rad − t atm (approximately 2 h) behind. Figure 1 schematically gives a clear account of what happens. As it shows, there are multiple normal atmospheric time steps between two consecutive radiation time steps. The atmospheric calculations are provided with old feedback from the radiation component within the normal time steps. On non-radiation time steps between updates of the optical properties, longwave irradiance is rescaled based on the surface temperature, while shortwave irradiance is rescaled by the zenith angle . Infrequent calculations of the radiative heating may result in numerical instability in climate models, as described by Pauluis and Emanuel (2004). As pointed out by Balaji et al. (2016), the use of the lagged state can be viewed as a potential source of discrepancy between the cloud field and the "cloud shadow field" seen by the radiation component. This therefore introduces numerical errors in atmospheric models and becomes considerably worse at higher resolutions (Xu and Randall, 1995;Balaji et al., 2016). Although the choice of larger radiation time steps ( t rad > t atm ) evidently reduces the overall runtime of simulations, the radiative portion is still considered relatively high for some configurations in ECHAM6. As shown in Fig. 2, the radiative calculations take up almost 40 % to 58 % of the total simulation time at the CR resolution (which is also one of the PalMOD settings), depending on the number of MPI processes assigned to the model.
The entire crux of the problem in the classical radiation scheme can be attributed to the sequential organization of the components inside the model. In this architecture, atmospheric processes are stepped forward one at a time in every 7442 M. R. Heidari et al.: Concurrent radiation scheme in the atmospheric model ECHAM-6.3.05p2 Figure 1. The organization of the classical radiation scheme in ECHAM6: the radiative transfer is resolved sequentially with respect to the other atmospheric physics and dynamics, and it is stepped forward at a slower rate than the other atmospheric processes. time step, and the computation time of each component directly contributes to the overall simulation time. In particular, the radiation component significantly delays the following calculation of other atmospheric physics and dynamics during the entire course of simulation. As is apparent in Fig. 1, this architecture prolongs the radiation time step in proportion to the high computational cost of the radiation component. It will be shown in the next section that this long response time of the radiation calculations, however, is not inevitable and can be avoided by re-organizing the component inside the model. Moreover, the sequential organization of components in the classical ECHAM6 creates another obstacle that hinders the optimization of the model. In fact, ECHAM6 traditionally benefits from MPI and implements domain decomposition parallelism to expedite the computations. As Fig. 3 shows, the radiation calculations display a higher scalability than the main model in this framework. This can be attributed to the column-wise organization of the atmospheric physics and the embarrassingly parallel nature of the workload. In other words, since the individual columns (iterating the k index in an (i, j, k) discretization) have no crossdependency in (i, j ), it allows for fine-grained parallelism (Balaji et al., 2016). This is the reason why the radiation component can intrinsically scale better beyond the limitations of the main model. Such a higher scalability is instrumental in reducing the high computational cost of the radiation calculations. However, the sequential architecture of the classical Figure 3. The scaling curve of the radiation component vs. the atmospheric model ECHAM6 shows that the radiative transfer has a higher scalability and keeps scaling beyond the limitations of the main model. ECHAM6 restrains the benefit by forcing the radiation component to use the same computational resources as the rest of the model. As a consequence, the component is hindered by the limited scalability of the whole model. In the next section, however, it will be shown that re-organization of the components in ECHAM6 is essential to improve the overall performance of the model.

The concurrent radiation scheme
It was discussed in the previous section that, even in light of larger radiation time steps, radiation calculations impose a daunting cost on the atmospheric simulation at coarse and low resolution in ECHAM6. It was also shown that the sequential treatment of resolving atmospheric processes is implicated in the long response time of the radiation component in every radiation time step and restricts the benefit of the higher scalability of the radiation calculations.
The concurrent radiation scheme, on the other hand, puts forward a feasible solution to the problem. It implements an additional level of parallelism inside the model by applying coarse-grained component concurrency to the radiation calculations. This approach eliminates the high response latency from the radiation time step and paves the way for higher scalability in the model. In contrast to the classical scheme, the concurrent radiation scheme starts resolving the radiative transfer much earlier before the next radiation time step ar- In the first (radiation) time step, ATM (the main model) sends the input data to RAD (the radiation component), but it has to wait a long time until it receives the results from RAD. In the following radiation time steps, however, the data exchange takes place immediately one after the other (i.e. ATM first receives the results of radiation calculations and immediately provides the input data to RAD for the next radiation calculations). This way, ATM is supposed to experience a minimum idle time when it interacts with RAD.
rives. As a result, the main model receives feedback from the radiation component much faster upon the request. This technique minimizes the response latency of the component and reduces the overall simulation time. In this approach, the radiative transfer is calculated concurrently with other atmospheric processes along the course of normal time steps. The coupling fields are also exchanged between the radiation component and the main model within the radiation time steps but without the typical delay experienced in the classical scheme. Figure 4 describes a method for casting the radiative transfer as a concurrent component using distributedmemory computing. This technique organizes the radiation component and the main model on separate MPI processes and enables the concurrent calculation of the radiative transfer and other atmospheric processes.
Due to the close time dependency of radiation on evolving model fields, the concurrency between the atmosphere and radiation component is only enacted between consecutive radiation time steps. In addition, the synchronization between the concurrent components takes place during each radiation time step. As is shown in the next section, the interprocess synchronization overhead, however, is negligible compared to the cost of the radiation calculations. The data exchange between the radiation and the atmospheric model benefits from the communication library YAXT (Behrens et al., 2014), which was developed at the German Climate Computing Center (DKRZ) in Hamburg. YAXT simplifies the formulation of the communication problem and generates suitable communication objects to efficiently execute the data exchange. The library is specially suited for bulk communication as given in this use case. This is due to the automatic generation of MPI data types that enable direct access to model data without requiring additional data copies or packing and unpacking overhead to create messages. YAXT is built on top of MPI, takes high-level descriptions of arbitrary domain decomposition, and automatically derives an efficient collective data exchange. Figure 5 shows the use of the YAXT library for coupling two concurrent components with different domain decomposition.
In addition, running the concurrent components on separate MPI processes is a prerequisite to applying arbitrary domain decomposition to the radiation calculations. It offers a way forward to scale the component beyond the limits traditionally imposed by the atmospheric model. Hence, the concurrent radiation scheme prepares the ground for improving the physical consistency between the radiative and physicochemical atmospheric states. This approach can accordingly minimize the discrepancy between t rad and t atm (Pauluis and Emanuel, 2004;Xu and Randall, 1995;Balaji et al., 2016). Furthermore, the independent resource allocation feature is essential for an efficient load balancing between concurrent components and parallel efficiency of the model. The next section presents these merits in more detail with some concrete examples.

Results I: performance analysis
This section presents the performance evaluation of the concurrent radiation scheme in ECHAM6 in comparison with the classical approach. It should be emphasized that the new scheme utilizes almost the same original implementation of radiation calculations with a radically different orchestration. The new organization therefore exerts a major impact on the overall simulation time rather than the pure computational performance of the radiation component. In consequence, the performance evaluation presented in this section explicitly aims at the assessment of the whole model and will not be limited to the radiation component. For the purpose of this study, a new version of ECHAM6 (based on ECHAM-6.3.05p2) is deployed with both classical and concurrent radiation schemes which can be configured to calculate the radiative transfer with or without separate MPI processes. Experiments are performed on the Mistral supercomputer at DKRZ on a machine configuration with Intel Haswell processors (E5-2680v3 12C 2.5 GHz) and a Mellanox FDR Infiniband high-speed interconnect. All runs allocate a layout of one MPI process per CPU core on the computing nodes equipped with two processors which are exclusively dedi-cated to the experiments. The performance results presented in this paper are obtained from the Atmospheric Model Intercomparison Project (AMIP) experiments performing simulations at the CR from 1976 to 1981.
To understand the impact of the concurrent radiation scheme on the overall performance, it is useful to extract the scaling curves of the model with both radiation schemes and study the gained speedup via the concurrent radiation scheme. For this purpose, the frequency at which the radiative transfer is calculated is by default every 2 h in all runs, i.e. t rad = 8 × t atm with the normal time step of 15 min. Additionally, once the model is configured to use the concurrent radiation scheme, an equal number of MPI processes, and thus identical domain decomposition, is assigned to the radiation component and the main model. Figure 6 compares two scaling curves which reflect the performance of the model with the classical (the blue curve) and concurrent radiation schemes (the red curve). The horizontal axis shows the total number of MPI processes allocated by the model. It is worth emphasizing that ECHAM6 (using the classical radiation scheme) uses the same MPI processes to calculate radiative transfer and other atmospheric processes. However, when the model is configured to use the concurrent radiation scheme, half of the allocated MPI processes are exclusively dedicated to the radiation calculations and the other half to the rest of the atmospheric physics and dynamics. The y axis, on the other hand, reflects the throughput of the model in terms of the number of simulated years per day (SYPD). As can be inferred from Fig. 6, ECHAM6 can achieve only a maximum performance of 450 SYPD at 576 MPI processes using the classical radiation scheme at the CR resolution. However, it yields a significant improvement using the concurrent radiation scheme and reaches a maximum performance of 734 SYPD at 1152 MPI processes. It is noteworthy that, due to the limited number of grid points at the CR resolution, running the classical model at higher domain decomposition is not justified theoretically. Needless to say, it does not attain any significant performance improvement in practice either, as asserted in Fig. 3 where the scaling curve of ECHAM6 tends to flatten towards the end. This should explain why the blue curve in Fig. 6 stops at 576 MPI processes as opposed to the red curve scaling beyond. On this account, the concurrent radiation scheme acquires a new significance as it becomes conducive to higher scalability of the model.
The red curve in Fig. 7 displays the methodical speedup of the model using the concurrent radiation scheme. Here, the methodical speedup means improved runtime of the model by making use of the proposed concurrency, in contrast to the classical definition of speedup, whereby additional resources are used for the same computation. The methodical speedup is therefore the ratio of the overall performance of the model using the concurrent radiation scheme (using 2N resources) divided by the performance of the model using the classical radiation scheme (using N resources). The method-ical speedup S is defined as follows.
SYPD concurrent =The number of simulated years per day with the concurrent radiation scheme (using 2N MPI processes) (1) SYPD classical =The number of simulated years per day with the classical radiation scheme (using N MPI processes) On this account, for each point on the speedup curve(s), the number of resources assigned to the model using the classical radiation scheme is half of the resources allocated by the model using the concurrent radiation scheme. Hence, the x axis indicates only the total number of allocated MPI processes to the model if the concurrent radiation scheme is used by the model. However, the model allocates half of the MPI processes shown at the x axis when it adopts the classical radiation scheme. The red curve shows that the model achieves an actual speedup ranging from a minimum of 1.6× to over 1.9× when it benefits from the concurrent radiation scheme. The blue curve, however, shows the asymptotic speedup which the model would achieve if there were no cross-dependency, and thus no communication latency, between the radiation component and the main model.
Moreover, the coupling fields between the radiation component and the rest of the model are exchanged at every radiation time step and potentially contribute to the overall simulation time. There is nevertheless mounting evidence in Fig. 8 indicating that the communication overhead compared to both radiation and total runtime of the model is negligible (less than 1.4 %) and therefore has little impact on the total simulation time. Consequently, the performance of the model (using concurrent radiation scheme) is mainly affected by the relative cost of the radiation calculations, which has a strong dependency on the number of allocated computing resources. Figure 9 shows the efficiency of the resource utilization when the model runs the concurrent radiation scheme. The parallel efficiency of the concurrent radiation scheme is defined as the ratio of the methodical speedup S to the relative number of allocated resources R, as shown below.
Ratio of allocated resources = R = (resources assigned to the concurrent radiation scheme) (resources assigned to the classical radiation scheme) Parallel efficiency = E = S R = S 2 (5) Figure 6. The scaling curves of ECHAM6 using the classical and concurrent radiation schemes. Figure 7. The red curve shows the methodical speedup of ECHAM6 using the concurrent radiation scheme. The blue curve shows the methodical speedup that the model would achieve asymptotically.
As is visible in the figure, the model achieves a parallel efficiency of 80 % or more across the scaling curve. However, attaining the maximum parallel efficiency requires an optimal distribution of workload among MPI processes. A close investigation reveals a load imbalance between the concurrent components inside the model. This problem starts appearing when the radiation and other atmospheric calculations are configured to use identical domain decomposition and allocate an equal number of MPI processes. The measured parallel efficiency in Fig. 9 suggests that the MPI processes assigned to the radiation calculations experience an idle time during the course of simulation. Figure 10 depicts the length of time that the concurrent components have to remain idle while waiting for the slower component to catch up and become ready to exchange the radiation results. The red and blue curves show the respective idle times that the radiation component (denoted by RAD) and other atmospheric processes handled by the main model (denoted by ATM) experience. The idle times appear when the radiative transfer is resolved faster or slower than the other atmospheric calculations. At lower numbers of MPI processes, as shown in Fig. 10, calculating the radiative transfer takes longer and thus forces ATM to wait a relatively long time in each radiation time step for feedback from RAD. At higher numbers of MPI processes, however, RAD scales better and finishes the calculations faster than ATM. It therefore has to wait for the arrival of the next radiation time step so that the radiation results can be transferred to the main model. It should be noted that Fig. 10 shows the maximum length of time that an MPI process (assigned to RAD or ATM) has to wait for its peer MPI process (assigned to the other component) until it catches up. From Fig. 10, it can reasonably be inferred that the total idle time experienced by ATM and RAD becomes minimum at 384 MPI processes. This can also explain why the parallel efficiency in Fig. 9 has an extremum at this point. It is also apparent in Fig. 10 that the radiation component experiences a longer idle time as the number of MPI processes increases. This behaviour accounts for the higher scalability of the radiation component, which was already reflected in Fig. 3.  The load imbalance between the concurrent processes is directly affected by the number of MPI processes assigned to the radiation component as well as the frequency at which the radiative transfer is calculated. Figure 11 schematically illustrates a contrived configuration in which the radiation component is forced to remain idle almost half of the total runtime. To remove such an idle time, Fig. 12 provides an inspiring example. It takes advantage of the higher scalability of the radiation component and reduces the radiation time step to half. This new configuration eliminates the idle time from the MPI processes assigned to the radiation component and hence removes the load imbalance between the concurrent components successfully. As a consequence, the resource utilization of the model improves significantly and attains a parallel efficiency close to 100 % without affecting the achieved speedup. This example presents a viable solution to decrease the gap between t rad and t atm in pursuit of a more consistent atmospheric model.
The concurrent radiation scheme, however, puts forward a general solution to remove the load imbalance between the radiation component and the main model. This solution pro-vides a remedy for the idle time imposed on the main model at some configurations (such as 48, 96, 192, 288, or 384 MPI processes as shown in Fig. 10) which exhibit a suboptimal resource efficiency due to the slow calculation of radiative transfer. In this approach, the radiation component is enabled to adopt finer domain decomposition and allocates a higher number of resources (in comparison to the main model) in order to catch up with the fast calculation of other atmospheric processes. By the same token, Fig. 13 suggests a configuration in which the radiation component adopts coarser domain decomposition and allocates a lower number of MPI processes compared to the main model. This arrangement is also a remedy to remove the load imbalance at the configurations (such as 576, 768, and 1024 MPI processes as shown in Fig. 10) in which the radiation component experiences a long idle time due to the slow calculation of other atmospheric processes.
In addition, the concurrent radiation scheme offers an opportunity for coupling the radiation component to the other atmospheric processes at every normal time step (i.e. t rad = t atm ). This feature can ultimately bring the model to physical consistency between the radiative and physicochemical atmospheric states, although probably with a negligible impact on the model's accuracy. It is notable that the current implementation of the concurrent radiation scheme in ECHAM6 already provides the technical support for the adoption of finer or coarser domain decomposition for the radiation calculations. In particular, the YAXT library simplifies the data exchange between the concurrent components with disparate domain decomposition. The scientific viability of these schemes, however, requires further investigations, and the results will be presented in a follow-up paper.     We evaluate the concurrent radiation scheme in ECHAM-6.3.05 at the CR and LR configurations, hereafter termed as CR CRR and LR CRR , which differ in their horizontal grid spacings. Both CR CRR and LR CRR share the vertical resolution of 47 levels. The classical radiation scheme has been tuned to optimize the simulated climate (Mauritsen et al., 2019). However, the concurrent radiation scheme at these two configurations has not been individually tuned. The parameters of the convection scheme, e.g. the convection conversion rate for cloud water to rain in the concurrent radiation scheme, are the same as in the classical scheme. The evaluation and documentation of the concurrent radiation scheme in ECHAM-6.3 are based on the Atmospheric Model Intercomparison Project (AMIP) historical experiments. Experiments were performed according to the AMIP protocol (Eyring et al., 2016). The historical forcings include the emissions of short-lived species and long-lived greenhouse gases, solar forcing, stratospheric aerosol forcing, monthly sea surface temperatures, and sea ice concentrations. Experiments span from 1960 to 2013. The monthly output from 1980 to 2013 is taken for analysis. AMIP experiments performed with the classical ECHAM-6.3.05, as well as the observations and reanalysis, serves as a reference. Observational and reanalysis data used in this paper are listed in Table 1.

Mean state
We first evaluate the simulated mean climate by the concurrent radiation scheme. CRUTEM4 observations are interpolated to the model grids for the comparison against simulated surface air temperature (SAT). Across all configurations, global annual bias in SAT over land exhibits similar spatial patterns (Figs. 14a, c and S1a, c in the Supplement). A general warm bias occurs over eastern Siberia, central Asia, the Tibetan Plateau, North America, and central Australia, while sporadic cold biases exist in south Asia, Africa, and northern South America. For a rough assessment, global root mean square error (RMSE) is calculated for SAT. Horizontal resolution exhibits a larger impact on the SAT bias than the radiation scheme. Both radiation schemes show a smaller SAT bias in LR relative to CR (Figs. 14a, c and S1a, c). The concurrent radiation scheme illustrates a somewhat ambiguous reduction in the annual SAT bias, although it is insignificant at the 95 % confidence interval (Fig. 14e).
All experiments suffer from severe precipitation bias over the tropics compared to the GPCP satellite measurements (Figs. 14b,d and S1b,d). Specifically, excessive precipitation is simulated over the tropical Indian Ocean, western tropical Pacific, and western tropical Atlantic Ocean, while deficient precipitation occurs over the east tropical Pacific Ocean, South America, and the eastern tropical Atlantic Ocean. Consistent with the response of SAT bias, the global RMSE decreases along with a refinement of horizontal resolution, whereas the concurrent radiation scheme leads to larger bias in the precipitation. Particularly, the precipitation bias over the equatorial region of South America and western equatorial Africa increases in the concurrent radiation (Figs. 14f and S1f), which is likely linked to the intensification of local SAT bias (Figs. 14e and S1e). Clouds play a complex and crucial role in the Earth's radiation budget, which significantly affects the simulated surface climate. To understand the biases in SAT, we evaluate the errors in cloud feedback by estimating cloud radiative forcing (CRF) following (Cess et al., 1990): where F all-sky refers to radiative flux at the top of the atmosphere (TOA) and F clear-sky represents the radiative flux assuming the absence of clouds. CRF is positive when clouds warm the surface atmosphere and vice versa. The radiative effect of clouds consists of two competing components in the radiation budget: warming impact on the surface through the emission of longwave (LW) radiation and cooling impact by shading for the shortwave (SW) radiation. The sign and spatial structure of CRF bias are largely determined by the SW CRF bias, which is partially compensated for by the LW CRF bias (Figs. 15 and S2). CRF exhibits a larger bias in CR relative to LR with both radiation schemes (Figs. 15a,b,d,e,and S2a,b,d,e). In CR CRR and CR SEQ , strong negative SW CRF biases exist over regions of upwelling and the western coast of Australia, whereas strong positive CRF biases occur over the Arctic, the Southern Ocean, the eastern tropical Indian Ocean, the northern equatorial Pacific, and Atlantic Ocean. A refinement of horizontal resolution largely alleviates negative CRF biases over the upwelling regions, whereas the CRF bias over other regions barely changes. Over the tropics, the concurrent radiation scheme shows a slight increase in the bias compared to the classical radiation scheme (Figs. 15g,h and S2g,h). Increased CRF bias in the concurrent radiation scheme is largely attributed to the response of SW CRF errors (Figs. 15g and S2g), with a negligible contribution from the errors in LW CRF (Figs. 15h and S2h). Across all experiments, there are widespread discrepancies between the spatial distribution of biases in the CRF and SAT except over the Tibetan Plateau, where a positive CRF bias agrees well with the warm SAT bias. To further explore such inconsistency, biases in the net SW radiation at TOA are estimated (Figs. 15c, f and S2c, f). SAT biases over North America, the Tibetan Plateau, and central Australia are associated with the excessive net SW radiation at TOA, resulting from the negative biases of the surface albedo (not shown). In contrast, warm SAT biases over eastern Siberia and tropical South America exist along with deficient SW radiation biases, which suggests a dynamical cause for the biased SAT. A misrepresentation of the atmospheric circulation in ECHAM-6.3.05 may be responsible for this discrepancy. The concurrent radiation Table 1. Observation and reanalysis data for model evaluation.

Variable
Validation scheme exhibits strong negative biases in the net SW radiation at TOA in the tropics relative to the classical radiation scheme (Figs. 15i and S2i). This implies that model parameters concerning the cloud formation should be carefully adjusted for the concurrent radiation scheme. Specifically, the relative humidity threshold for cloud formation in the upper troposphere and the lowest model level should be changed to improve the match with observational records (Mauritsen and Roeckner, 2020).
The zonal-mean temperature biases in the troposphere are smaller than in the stratosphere for all experiments (Figs. 16 and S3). In the lower troposphere (between 850 and 1000 hPa), a cold bias amounting to −4 K occurs over the Antarctic for all experiments. Thermal biases in the midtroposphere are relatively small. Moderate warm biases up to 2 K extend from the tropics to the Arctic. In the lower stratosphere (between 250 and 100 hPa), all experiments suffer from prominent warm biases over the tropics and mid- latitudes, as well as severe cold biases over the high latitudes in both hemispheres. Such biases are significantly reduced in LR ( Fig. S3a and c) relative to CR (Fig. 16a and c) experiments, consistent with the notion by Stevens et al. (2013) that zonal-mean temperature bias can be significantly reduced by enhancing the horizontal resolution. The concurrent radiation scheme barely affects the biases in zonal-mean temperature in CR simulations, yet it slightly reduces the biases in the lower stratosphere between 40 and 90 • S (Figs. 16e and S3e).
The patterns of zonal-mean westerly wind biases are common to all configurations (Figs. 16b,d and S3b,d) and reflect the meridional structure of temperature biases (Figs. 16a,c and S3a,c). In CR CRR and CR SEQ , temperature biases drive a northward shift of westerly winds between 30 and 60 • S. LR CRR and LR SEQ exhibit more alleviated westerly wind biases than their CR counterpart due to reduced temperature bias. In the tropics, westerly wind biases are characterized by overly strong easterly biases in the low and mid-troposphere and large westerly biases in the upper troposphere. The concurrent radiation scheme intensifies the easterly wind biases over the tropics in CR and LR (Figs. 16f and S3f). As suggested by Stevens et al. (2013), the tropical bias pattern is an indication of excessive heating associated with the deep convection.

Climate variability
To explore the simulated climate variability by the concurrent radiation, we present an analysis of the El Niño-Southern Oscillation (ENSO) teleconnection and interannual variability in the extratropics in the form of the Northern and Southern Annular Mode.

ENSO feedbacks and teleconnections
The El Niño-Southern Oscillation (ENSO) is the leading mode of interannual variability in the tropical Pacific. ENSO is mainly characterized by variations of sea surface temperature (SST) in the eastern and central equatorial Pacific. Multiple negative and positive coupled atmosphereocean processes that either favour or suppress the growth of SST anomalies govern the ENSO behaviour (Philander, 1989;Neelin, 1998;Jin et al., 2006). Equatorial Pacific SST anomalies associated with ENSO can affect the tropical convection and result in zonal shifts of the Walker circulation (Philander, 1989;Bayr et al., 2014). Further, the changes in convection stimulate Rossby waves that propagates to the middle and high latitudes through the atmospheric bridge. Many studies (Guilyardi et al., 2004;Guilyardi, 2006;Guilyardi et al., 2009) have shown that the atmospheric model dominates the simulated ENSO properties in coupled climate models. Therefore, it is crucial to evaluate the ENSO teleconnections that project the influence of ENSO globally.
The covariance between the global DJF (December-January-February) precipitation anomalies and the DJF precipitation anomalies over the Niño4 region is shown in Figs. 17 and S4. CR SEQ and LR SEQ typically capture the pattern over the tropical Pacific, tropical Atlantic, and North America (Figs. 17c and S4c) as indicated by the GPCP ob-servations. The concurrent radiation scheme exhibits similar teleconnection patterns as the classical radiation scheme, yet it influences the magnitude of the response to ENSO. CR CRR underestimates the magnitude of atmospheric teleconnection over the Maritime Continent (Fig. 17b), whereas LR CRR overestimates the teleconnection over this region (Fig. S4b). Additionally, CR CRR and LR CRR exhibit an artificially positive response in the southern tropical Indian Ocean relative to their counterpart using the classical radiation scheme.
Covariance of DJF geopotential height anomalies at 500 hPa (Z500) and normalized DJF precipitation anomalies over the Niño4 region are calculated to investigate the diabatic forcing of the tropical Pacific on the boreal winter atmospheric circulation in the Northern Hemisphere (Figs. 17 and S4). ERA-Interim depicts positive covariance over Canada and southern Greenland and negative relations over the northeast Pacific, north Atlantic, and Siberia. CR SEQ is able to simulate the pattern over the northeast Pacific, Canada, and Greenland. However, it fails to capture the component over Siberia, and the signal over the North Atlantic is shifted northeastward. Increasing the horizontal resolution to LR shows no improvement for the teleconnection pattern with the classical radiation scheme. CR CCR retains the main features of ENSO teleconnection as shown by CR SEQ , whereas LR CCR exhibits a weak response of atmospheric circulation to the diabatic forcing over the tropical Pacific compared to ERA-Interim (Fig. 17d). This implies that a tuning for the convection scheme in the concurrent radiation scheme is required.

Northern Annular Mode
The Northern and Southern Hemisphere Annular Mode (NAM and SAM, also known as the Arctic and Antarctic Oscillation) are the leading modes of variability of the extratropical circulation in both hemispheres. Both annular modes explain the month-to-month and year-to-year (especially the cold season) variability of the atmospheric circulation, which exhibit pronounced impacts on the climate in the mid-latitudes and the polar region (McAfee and Russell, 2008;Kidston et al., 2009). The NAM is defined as the leading empirical orthogonal function (EOF) of the DJF sea level pressure (SLP) in the Northern Hemisphere. As shown by the ERA-Interim reanalysis data, the NAM is characterized by zonally symmetric structures (Fig. 18a). There are large and positive loadings over the polar cap region, surrounded by zonally ring-shaped negative loadings centred over the northeast Pacific and the North Atlantic. The leading mode explains 30.4 % of total variance. All experiments show larger explained variance of the NAM than ERA-Interim (Figs. 18 and S5). Horizontal resolution does not show a strong influence on the NAM pattern, except that CR CRR and CR SEQ exhibit slightly higher explained variance relative to their LR counterparts. The concurrent radiation scheme captures the centres of action better than the classical scheme, especially over the North Atlantic sector. The centre of action over the North Atlantic is shifted eastward in LR SEQ compared to ERA-Interim and LR CRR .
The atmospheric teleconnection associated with the NAM is calculated by regressing the DJF SAT anomalies on the time series of principal components (PC1) corresponding to the leading EOF (Figs. 18b,d,f and S5b,d,f). Associated with the positive phase of the NAM, warm SAT anomalies occur over Europe, Siberia, and the western United States, while cold temperature anomalies exist over western Alaska, far eastern Russia, Greenland, and eastern Canada (Fig. 18b). CR SEQ significantly underestimates the response over Greenland, Baffin Bay, and Siberia but overestimates the response over western America. Additionally, an artificially negative response occurs over western Canada (Fig. 18b). CR CCR shares similarities with CR SEQ , with minor differences in the magnitude of the response, which is also seen in the LR experiments. The atmospheric teleconnection simulated by two LR experiments generally agrees better with ERA-Interim than the CR simulations, which reproduce the widespread negative SAT anomalies over Europe and Siberia ( Fig. S5d  and f).

Southern Annular Mode
Similarly, the JJA (June-July-August) SAM is defined as the leading EOF of JJA SLP anomalies in the Southern Hemisphere. Similar to the NAM, the SAM depicts large loadings over the Antarctic and three surrounding centres of ac-tion over the southern Pacific, Indian, and Atlantic Ocean at approximately 45 • S (Fig. 19a). CR CCR and CR SEQ simulate slightly westward-shifted centres of action over the southern Indian Ocean (Fig. 19c and e). The explained variances by the leading EOFs are overestimated in these two experiments. A refinement of horizontal resolution improves the simulation of the SAM pattern. Among all experiments, LR CCR shows the best agreement with ERA-Interim (Figs. 19a and S6c). This may be linked to small biases in the mean atmospheric circulation (Fig. S3b).
The SAT response to the positive phase of SAM is computed by regressing the SAT anomalies on the PC1 of SLP in the Southern Hemisphere (Figs. 19 and S6). There are warm temperature anomalies over most of the Antarctic and Ross Sea, while cold anomalies exist over the Weddell. LR CCR and LR SEQ underestimate the SAT response over the Weddell Sea, whereas CR CCR and CR SEQ fail to reproduce the SAT response over both the Ross Sea and Weddell Sea (Figs. 19d,f and S6d,f). Overall, an increase in the horizontal resolution improves the simulated climate variability and its atmospheric teleconnections in both the Northern and Southern Hemisphere; however, the concurrent radiation scheme barely changes these features.

Conclusions
This paper presents the implementation of the concurrent radiation scheme in the atmospheric model ECHAM6 and demonstrates its impact on the performance and stability of the model. A detailed analysis shows that the radiative transfer is a relatively expensive component of the model, especially for the CR resolution and LR resolution, which are also used in palaeoclimate simulation. Although the component exhibits a higher scalability profile, the study reveals that it cannot freely scale to its full potential due to the sequential architecture of the classical ECHAM6. The concurrent radiation scheme, on the other hand, organizes the radiation component in parallel with the rest of the atmospheric physics and dynamics and hides its long computation time. The experiments explicitly asserted a noticeable model speedup across the scaling curve with a strong dependency on the relative computational cost of the radiative transfer to the other atmospheric processes. Unlike the classical scheme, this approach enables the radiation component to adopt any viable domain decomposition arbitrarily, which may differ from the main model's configuration. The component can accordingly follow a different scaling scheme and benefit from the higher scalability of radiation calculations. This salient feature can eventually decrease the discrepancy between the radiation time step t rad and normal atmospheric time step t atm with the objective of creating more physical consistency in the model.
The simulated mean climate and internal climate variability by the classical and concurrent radiation scheme have been evaluated. A suite of AMIP experiments has been performed on CR and LR configurations using the two radiation schemes. In terms of long-term mean state biases, e.g. biases in land surface temperature, precipitation, cloud radiative forcings, and zonal-mean temperature and wind, a refinement of horizontal resolution exhibits better agreement with the observations. The concurrent radiation scheme generally yields similar results as the classical radiation, except some minor improvements in the mean atmospheric circulation in the Southern Hemisphere. Regarding the climate variability and associated atmospheric teleconnections, LR simulations agree better with ERA-Interim than their CR counterparts. The concurrent radiation scheme on LR improves the atmospheric teleconnection to the SAM, which is likely linked to the alleviated bias in the mean circulation. On the other hand, the classical radiation scheme on LR shows better atmospheric teleconnection to ENSO than the concurrent radiation. One possible reason is that the classical radiation scheme in ECHAM6 has been properly tuned towards the observations. To conclude, the concurrent radiation scheme presented in this study substantially improves the scalability of ECHAM6, with the major features in the mean climate and internal variability retained.
Code and data availability. The source code of the atmospheric model ECHAM6 adopted for the project PalMod (for the concurrent execution of radiative transfer) and used for generating the plots presented in this paper is available under https://doi.org/10.35089/WDCC/SC_PalMod_ECHAM6 (MPI-M and DKRZ, 2021). The output data for generating the plots presented in this paper are available under https://doi.org/10.5281/zenodo.4589140 (Heidari et al., 2021).
Author contributions. MRH wrote the Abstract and Sects. 1, 2, 3, and 4, and he contributed to the Conclusions. He also conducted the performance analysis and performed the required AMIP experiments. ZS wrote Sect. 5 and contributed to the Conclusions. He performed the AMIP experiments required in Sect. 5 and analysed the output. ED contributed to the AMIP experiment required for the performance analysis. JB wrote the excerpt on the YAXT library and provided detailed advice on the performance analysis. HB had the major role in funding acquisition and supervision of the project leading to this publication. He also provided detailed advice on the direction of the paper and the performance analysis. All co-authors contributed to the research and the text in this paper.
Competing interests. The contact author has declared that neither they nor their co-authors have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.