Reply on RC3

“...Our experiments show that ECHAM6 can achieve a speedup over 1.9x using the concurrent radiation scheme. By performing a suite of stand-alone atmospheric experiments, we evaluate the influence of the concurrent radiation scheme on the scientific results. The simulated mean climate and internal climate variability by the concurrent radiation generally agree well with the classical radiation scheme, with minor improvements in the mean atmospheric circulation in the Southern Hemisphere and atmospheric teleconnections associated with the Southern Annular Mode. This empirical study serves as a successful example ...”

" Radiative transfer is one of the most expensive parts for coarse and low-resolution atmospheric simulations." (3.) Referee: L21: It is an overstatement to say that the shortwave and longwave are *widely* separated; in fact there is around 12 W m-2 of solar energy at wavelengths longer than 4 microns, which is traditionally regarded as the longwave domain.
Authors: L21 was modified as the following sentence: "Energy transfer in the atmosphere involves electromagnetic radiation that can be separated into short and long wave parts." Authors: Thanks to the referee's suggestions, we add the following paragraph after L40 to cite the other studies: "Resolving radiation transfer on coarser time and spatial resolutions can however lead to errors in weather and climate simulations. Authors in (Hogan and Hirahara, 2016) examine the biases that occur due to discrete sampling of solar zenith angle in models which calculate radiation every 3h and propose a careful treatment of the cosine of the solar zenith angle to mitigate the negative impacts. A report by (Hogan and Bozzo, 2015) describes a computationally efficient solution to the problems raised in models that call the radiation scheme infrequently in time or on a reduced spatial grid. They suggest updating the surface longwave and shortwave fluxes in every time step and grid point according to the local skin temperature and albedo. A follow-up study by (Hogan and Bozzo, 2018) introduces a flexible new radiation scheme (ecRAD) for the ECMWF model which is around 41% faster than the previous package. The report shows some improvements in the skill of weather forecasts by calling the new radiation scheme more frequently for the same overall computational cost." (5.) Referee: L46: Some mention must be made here of the potential down-side of radiation in parallel, which is that the fluxes and heating rates fed to the rest of the model will be "older" by around one radiation timestep than in the traditional approach of radiation in series. The impact on forecast skill was not really studied by Mozdzynski & Morcrette, but could be important. In the ECHAM context, the classical configuration involves radiation fields computed at a particular time being used in the rest of the model for the following 0-2 hours (with some corrections for surface temperature and sun position, but not for clouds). In the concurrent scheme, the radiation fields are not 0-2 hours but 2-4 hours old. The impact on model fields is something you address later in this paper, but it needs to be mentioned here in the introduction as an important consideration. One physical process that benefits from a tighter coupling in time with radiation is boundary-layer clouds, particularly stratocumulus: when they form they are maintained by longwave cooling at cloud top. This could have been one of the reasons why Hogan & Bozzo (2018, Fig. 6) found that calling radiation more frequently led to more skillful forecasts of nearsurface temperature *and* low cloud cover.
Authors: Thanks to the recommendation by the referee, we improve the paragraph starting at L80 to give the message and augment it by the example and reasoning suggested by the referee: This paper, on the other hand, presents a report on the concurrent radiation scheme applied to the atmospheric model ECHAM6 and provides a thorough analysis on the performance and accuracy of the model. Calculating radiative transfer in parallel with other atmospheric processes can potentially affect the model's accuracy since the radiation fields will always lag one more radiation time step behind in comparison with the classical scheme. This lag may have negative impacts on physical processes that benefit from a tighter coupling in time with radiation. The boundary-layer clouds, particularly stratocumulus, are a good example. They are maintained by longwave cooling at cloud tops once they are formed. This could explain why (Hogan and Bozzo, 2018) found that calling radiation more frequently leads to more skillful forecasts of near-surface temperature and low cloud cover.
(6.) Referee: Fig. 1 reproduces Fig. 1 of Giorgetta et al. (2013), except for the addition of a small radiation box -in the interests of shortening the paper it should be removed. Fig. 5 is a small change that doesn't really illustrate the concept of concurrent radiation -all you need is Figs. 2 and 6, which could be combined into a single figure with two panels. Referee: I understand that the red line in Fig. 9 should be the ratio of the red and blue lines in Fig. 8, but it doesn't look like that in that it is always larger than 1.6, when Fig. 8 shows that concurrent radiation is sometimes slower than classical radiation. Is this because the X axis is different, i.e. in one it is the total number of MPI processes and in the other it is the number used for just one part of the model? Surely it should be the total number of MPI processes allocated in both instances, but perhaps I misunderstand something. This needs to be clarified, and a fair comparison shown.

Authors:
The curves in Fig 9 show the methodical speedup of the model using the concurrent RAD scheme. The methodical speedup means the improved runtime of the model by making use of the concurrent radiation scheme, in contrast to the classical definition of speedup, where additional resources are used for the same computation. The methodical speedup is therefore the ratio of the overall performance of the model using the concurrent radiation scheme divided by the performance of the model using the classical radiation scheme. The X axis shows the number of MPI processes assigned to the concurrent RAD scheme. Half of the resources (shown by X axis) are assigned to the model when it adopts the classical scheme. For each point on the curves, we do the following.
The model is configured with the concurrent RAD scheme and allocates a number of resources shown by X-axis. We measure the performance (simulated years per day SYPD) of the model as SYPD concurrent. Then, the model is configured with the classical RAD scheme and allocates HALF of the number of resources shown by X-axis. We measure the performance (simulated years per day SYPD) of the model as SYPD classical Methodical speedup = SYPD concurrent / SYPD classical We modify the text at L203, which is as follows (note that Fig 9 becomes  "The red curve in Figure 7 displays the methodical speedup of the model using the concurrent radiation scheme. Here, the methodical speedup means the improved runtime of the model by making use of the proposed concurrency, in contrast to the classical definition of speedup, where additional resources are used for the same computation. The methodical speedup is therefore defined as the ratio of the overall performance of the model using the concurrent radiation scheme (using 2N resources) divided by the performance of the model using the classical radiation scheme (using N resources). On this account, for each point on the speedup curve(s), the number of resources assigned to the model using the classical radiation scheme is half the resources allocated by the model using the concurrent radiation scheme. Hence, the X-axis indicates only the total number of allocated MPI processes to the model if the concurrent radiation scheme is used by the model. However, the model allocates half of the MPI processes shown at the X-axis when it adopts the classical radiation scheme. The red curve shows that…" "The concurrent radiation scheme, however, puts forward a general solution to remove the load imbalance between the radiation component and the main model. This solution provides a remedy for the idle time imposed on the main model at some configurations (such as 48, 96, 192, 288 or 384 MPI processes as shown in Figure 10) which exhibit a suboptimal resource efficiency due to the slow calculation of radiative transfer. In this approach, the radiation component is enabled to adopt finer domain decomposition and allocates a higher number of resources (in comparison to the main model) in order to catch up with the fast calculation of other atmospheric processes. By the same token, Figure 13 suggests a configuration in which the radiation component adopts coarser domain decomposition and allocates a lower number of MPI processes compared to the main model. This arrangement is also a remedy to remove the load imbalance at the configurations (such as 576, 768 and 1024 MPI processes as shown in Figure 10) in which the radiation component experiences a long idle time due to the slow calculation of other atmospheric processes.
In addition, the concurrent radiation scheme offers an opportunity for coupling the radiation component to the other atmospheric processes at every normal time step (i.e. ∆t rad = ∆t atm ). This feature can ultimately bring the model to the physical consistency between the radiative and physicochemical atmospheric states, albeit probably with a negligible impact on the model's accuracy. It is notable that the current implementation of the concurrent radiation scheme in ECHAM6 already provides the technical support for the adoption of finer or coarser domain decomposition for the radiation calculations. In particular, the YAXT library simplifies the data exchange between the concurrent components with disparate domain decomposition. The scientific viability of these schemes, however, requires further investigations and the results will be presented in a follow-up paper." (9.) Referee: In the evaluation of the concurrent radiation scheme (Figs. 17-30) for a particular variable, the bias is shown for the concurrent and classical model versions, and the reader is expected to try to pick out the differences by eye which is not really possible. Far more useful would be to show the bias for just one of these versions, and then the difference between concurrent and classical, plus, crucially, some stippling to show where the changes are statistically significant. A particular area of interest would be in the marine stratocumulus regions where radiation and cloud processes are coupled on quite a fast timescale. From what I can see in the figures shown, there appears to be no significant effect of concurrent radiation on any of these variables (except possibly in Fig.  22), but it would really help to show difference plots to be sure.

Authors:
The differences between the concurrent and the classical radiation is added to the figures as suggested, along with hatching indicating the significance ( Figure R1 and R2). The referee is correct that the concurrent radiation does not exhibit much significant effect on the surface temperature or precipitation, nor on the zonal mean temperature and zonal wind.   (d) indicate the climatological zonal mean zonal wind for ERA-interim. Differences in (e) SAT and (f) precipitation between the concurrent and classical radiation experiments. Hatching indicates the differences are significant at the 95% confidence interval using Students' ttest.
(10.) Referee: Figs. 19-21: I don't see the need to show the total cloud radiative effect in addition to the longwave and shortwave components, since the latter two fingerprint specific cloud errors in models, whereas the total is simply a messy mixture of the two. Therefore I suggest Fig. 19 is removed. The captions for Figs. 20 and 21 are misleading as they should say they are the bias in cloud radiative forcing rather than in fluxes.