WRF-ML v1.0: A Bridge between WRF v4.3 and Machine Learning Parameterizations and its Application to Atmospheric Radiative Transfer

. In numerical weather prediction (NWP) models, physical parameterization schemes are the most computationally expensive components, despite being greatly simplified. In the past few years, an increasing number of studies have demonstrated that machine learning (ML) parameterizations of subgrid physics have the potential to accelerate and even outperform conventional physic-based schemes. However, as the ML models are commonly implemented using the ML libraries written in Python, 5 very few ML-based parameterizations have been successfully integrated with NWP models due to the difficulty of embedding Python functions into Fortran-based NWP models. To address this issue, we developed a coupler to allow the ML-based param-eterizations to be coupled with a widely used NWP model, i.e., the Weather and Research Forecasting (WRF) model. Similar to the WRF I/O methodologies, the coupler provides the options to run the ML model inference with exclusive processors or the same processors for WRF calculations. In addition, to demonstrate the effectiveness of the coupler, the ML-based radiation 10 emulators are trained and coupled with the WRF model successfully.


Introduction
Numerical weather prediction (NWP) models have become the most important tools for operational weather forecasting and have many applications in different domains, including energy, traffic, logistics, and planning (Coiffier, 2011;Pu and Kalnay, 2018). The physics-based parameterizations are essential in the NWP models as they approximate some processes that are 15 either too small scale to be explicitly resolved at the model grid resolution or too complex and not fully understood. Although many simplifications are made to parameterizations to reduce the computational cost, the calculations of physical parameterizations still account for a significant portion of the total computational time of NWP models (Wang et al., 2019). Also, the parameterization schemes often contain uncertain parameters estimated from more faithful high-resolution simulations with statistical models (Stensrud, 2013). For example, the parameters in radiative transfer parameterizations can be fitted to the out-20 put of the most accurate line-by-line model (Clough et al., 2005), and parameters in cloud turbulence parameterizations can be inferred from large-eddy simulations (LES) (Mason, 1989). However, other parameters can only be learned from observations as the related governing equations are unknown (Schneider et al., 2017).

1
One alternative is to train machine learning (ML) models to replace the traditional physics-based parameterization schemes.
The ML-based parameterizations have the potential to outperform traditional parameterizations with higher computational effi-25 ciency. For example, the radiative transfer parameterization scheme is one of the most computationally expensive components in NWP models, and it has the longest history of developing ML-based radiation emulators. Chevallier et al. (1998Chevallier et al. ( , 2000 developed an NN-based longwave radiation parameterization (NeuroFlux) and has been used operationally in the European Centre for Medium-Range Weather Forecasts (ECMWF) four-dimensional variational data assimilation system. The Neu-roFlux is seven times faster than the previous scheme with comparable accuracy (Janisková et al., 2002). Recently, Song and 30 Roh (2021) developed and used the neural network (NN) based radiation emulators in the operational weather forecasting model in the Korea Meteorological Administration. They demonstrated that using NN-based emulators frequently can improve real-time weather forecasting in terms of accuracy and speed compared to the original method, which infrequently uses the original radiation parameterization.
Similarly, ML-based emulators have been developed for other parameterization schemes in NWP models. Rasp et al. (2018) 35 successfully developed and coupled an NN-based convection parameterization into an aquaplanet general circulation model (GCM). They showed that the NN-based parameterization was able to perform multi-year simulations, of which results were close to that of the original simulations. For planetary boundary layer (PBL) parameterization, Wang et al. (2019) used the inputs and outputs from Yonsei University (YSU) PBL scheme of the Weather Research Forecast (WRF) model to develop the NN-based parameterizations. They showed that the NN-based parameterization successfully simulated the vertical profiles of 40 velocities, temperature, and water vapor within the PBL over the entire diurnal cycle. Urban land surface models (ULSMs) are reasonably fast, but none of them is best at predicting all the main surface fluxes (Grimmond et al., 2010(Grimmond et al., , 2011. One solution is to run an ensemble of ULSMs coupled to the NWP models to improve predictions, which is technically difficult to implement and would require more computational time due to running multiple ULSMs simultaneously. Meyer et al. (2022a) developed an urban NN (UNN) to emulate the ensemble mean of 22 ULSMs. They demonstrate that the WRF model coupled with the 45 UNN is more accurate than WRF with a single well-known USLM. Grundner et al. (2021) demonstrated the potential of an NN-based cloud cover parameterization to accurately reproduce the sub-grid scale cloud fraction to improve the low-resolution climate model predictions.
In many cases, ML-based parameterization schemes are only evaluated in offline settings due to the practical challenge of coupling ML-based emulators with NWP models. However, the offline performance of ML-based emulators does not neces-50 sarily reflect their online performance when coupled with other components of NWP models. ML-based parameterizations are often trained using high-level programming languages like Python and easy-to-use ML libraries such as Pytorch (Paszke et al., 2017) and Keras (Gulli and Pal, 2017). However, the NWP codes are mainly written in Fortran, making it difficult to integrate directly with ML-based emulators. Researchers used multiple approaches to circumvent these issues. Krasnopolsky (2014) developed a single-layer NN software for developing NN-based radiation emulators in NWP models (Belochitski and 55 Krasnopolsky, 2021;Krasnopolsky et al., 2010;Song and Roh, 2021). Some researchers wrote a Fortran module that reproduces NN architectures using matrix-matrix multiplication with the weights saved from offline training Hatfield et al., 2021;Rasp et al., 2018), which only works well for simple NN architectures such as the fully connected (FC) NNs, and becomes increasingly difficult to implement for more complex NN structures. Ott et al. (2020) produced an opensource software library, the Fortran-Keras Bridge (FKB), which enables users to access many features of the Keras API in 60 Fortran and implement NNs in Fortran code. However, the FKB can only support FC or dense layers NNs. Therefore, researchers cannot fully use the most advanced NN structures, such as convolution, self-attention, and variational autoencoder structures, to build powerful ML-based emulators for online applications. Wang et al. (2022) proposed an NN-GCM coupler to allow Python-based NNs to communicate with the Fortran-based GCM. Although using the coupler can make the ML-based emulators be deployed efficiently, its requirement for data transfer on a hard disk makes it hard to achieve speedup.

65
To address the abovementioned problems, we developed a WRF-ML coupler that allows any ML-based parameterization schemes to be coupled with the WRF model. The WRF model is selected as it is a popular open-source NWP model used by many researchers and operational organizations. Also, to demonstrate the applicability of the coupler, we train and couple the ML-based radiation emulators with the WRF model. The ML-based radiation emulators were studied most among all the physical parameterization schemes and coupled with the ECMWF Integrated Forecasting System (IFS), the National Centers 70 for Environmental Prediction (NCEP) Climate Forecast System (CFS), and the WRF model in the previous studies (Song and Roh, 2021;Meyer et al., 2022b). Also, the rapid radiative transfer model for general circulation models (RRTMG; Iacono et al., 2008) is selected for radiation as it is widely used in weather and climate models, such as NCEP Global Forecast System (GFS), and the China Meteorological Administration (CMA) Global/Regional Assimilation and Prediction System (GRAPES).
To train the ML-based radiation emulators, the WRF model is run to generate a dataset for training and evaluation. We also 75 illustrate the performance of the ML-based radiation emulators in the online setting. Lastly, we compare the accuracy and computational costs of the WRF simulations coupled with ML-based radiation emulators with the WRF simulations using original RRTMG schemes.
The remainder of the paper is organized as follows. Section 2 describes the original and ML-based RRTMG module. Section 3 introduces the WRF I/O quilt processors and the WRF-ML coupler. Section 4 briefly presents the WRF model setup and 80 ML-based radiation emulators. Section 5 presents the analysis and results in online settings. Conclusions and discussions are in Section 6.
2 Description of Original and ML-based RRTMG model

Description of Original RRTMG Module
The RRTMG is developed by the Atmospheric and Environmental (AER) and comprises shortwave (SW) and longwave (LW) 85 schemes that calculate SW and LW radiation fluxes and heating rates, respectively. Figure 1 presents the pseudo-code for the original radiation module in the WRF model. At the beginning of the radiation driver module, the input and output variables are initialized. Then, the corresponding radiation calculations are started according to the parameters (i.e., ra_lw_physics and ra_sw_physics) specified in the configuration file (i.e., namelist.input). When the RRTMG schemes are selected, the subroutine rrtmg_lwinit is called to perform calculations needed to initialize the longwave calculations. Next, the subroutines rrtmg_lwrad 90 and rrtmg_swrad are called to perform the RRTMG SW and LW radiation calculations. The original radiation computations process an atmospheric column at a time and loop over all the horizontal grid points. The three-dimensional variables associated with the grid cell (indexed as (i, j, k) in the WRF model output files) are indexed as (i, k, j) in memory by WRF to make it more cache friendly and increase the cache hit rate. The index i, j, and k represent the 95 west-east, south-north, and vertical directions, respectively. With the ML-based RRTMG model, inference can be implemented on a batch of data, which requires the input data to be packed into batches. The input features of ML-based radiation emulators have dimensions of W × H, where W and H represent the number of features and vertical levels, respectively. Therefore, the original three-dimensional variables are reshaped to be indexed as (i, j, k) to match the dimensions requirement of ML-based input features (done by preprocess_layout in Figure 2). The ML model inference is completed within the function infer_run.

100
Unlike physics-based radiation schemes, ML-based schemes do not need to calculate intermediate varaibles ::::::: variables : such as optical depth. Instead, the ML-based model accomplishes the mapping from the input features to outputs. After inference is finished, ML model outputs will be post-processed to match the dimensions of the original WRF data array to continue model simulations.

WRF-ML coupler 105
This subsection briefly describes the WRF I/O quilt server techniques as a similar method is adopted in the WRF-ML coupler to have exclusive servers for ML model inference. Then the methodologies of the WRF-ML coupler are described in detail.

Description of WRF I/O quilt severs
The WRF model is designed to run efficiently on parallel distributed-memory computing systems. To do this, WRF applies domain decomposition to split the entire domain into nprocs (equal to n x × n y ) number of subdomains so that the total amount 110 of work is divided into nprocs compute tasks, where n x and n y are the number of processors along the west-east and southnorth directions. Furthermore, the WRF model contains an I/O layer to gather all the distributed data arrays onto the master Message Passing Interface (MPI) rank 0 while other MPI ranks block until the master completes the write. However, as the number of domain sizes increases or resolution increases, leading to an increasing amount of information to write, the increasing time cost of writing output files becomes a bottleneck for the overall performance. Therefore, the WRF model also 115 provides an option to use asynchronous I/O (quilt) servers that deal exclusively with I/O so that the compute processors can continue without waiting for writing output files. More specifically, WRF allows users to specify the number of groups of I/O servers to allocate (nio_groups) and the number of I/O ranks per group (nio_tasks_per_group). Similar to the I/O quilt servers, the WRF-ML coupler also provides the option to use several processors for ML model inference exclusively, of which more details are described in the following subsection. While GPUs are typically more powerful than CPUs, both CPUs and GPUs 120 are supported for inference. CPUs are cheaper than GPUs and more widely available, and using CPUs for inference can also avoid extra costs due to GPU-CPU data transfer. in the namelist.input file. The difference between the inference processors and the asynchronous I/O servers is that the inference processors apply the synchronous method, as WRF calculations run sequentially, and the subsequent calculations of the WRF model rely on the outputs from ML-based parameterizations. Usually, the number of inference processors is much 130 smaller than that of WRF compute processors, so a single inference processor must process data from multiple processors.

Description of the WRF-ML coupler
The inference processor listens and receives data from corresponding WRF compute processors through the MPI transmission while blocking the compute processes (see the right part in Figure 4). After inference is finished, the inference process sends outputs of ML-based schemes back to the compute processors to continue the standard WRF calculations. In addition, the ninfer_tasks_per_group and ninfer_groups can be set to -1 to execute the ML model inference using the same processors as for 135 WRF compute processes. Then, memory copy is not needed between WRF compute processors and inference processors.
The functions for implementing the ML-based model inference are all written in Python, commonly used by the machine learning community. To allow the Fortran-based WRF code to call those Python functions, it is necessary first to link the Fortran executables to the system Python library. iso_c_binding is an intrinsic module defined in Fortran 2003 for interoperability with C. As shown in the left part of Figure 3, the iso_c_binding is imported within the Fortran code. Although no C code is needed, 140 the C Foreign Function Interface (CFFI) for Python is used to generate c-style binding interfaces within the Python script (see the middle part of Figure 3). The interfaces for C-function are also defined in the Python scripts, which are used to generate a dynamic library that Fortran modules can link.
This WRF-ML coupler allows the ML models to be easily coupled with the WRF model with minimal effort. Also, researchers are able to make full use of the state-of-art ML model structures. This methodology is made open-source in the hope 145 that more and more machine learning researchers will participate in improving the traditional NWP models.

Experiment Setup
The ML-based radiation emulator is coupled with the WRF model to demonstrate the coupler's practicality. This section firstly describes the WRF model setup, ML-based radiation emulators' network structures, and offline evaluation metrics.

150
In this work, the WRF model version 4.3 is compiled using the GNU Fortran (gfortran version 6.5.1) compiler with the "dmpar" option. To generate a dataset for model training, the WRF model is run using the domain configuration shown in Figure 5. The  (Bougeault and Lacarrère, 1989), RUC land surface model (Smirnova et al., 1997;Smirnova et al., 2000), and RRTMG for both SW and LW radiation (Iacono et al., 2008). The WRF simulations were run for three days per month, and the initialization days were randomly selected. The first two days of every three days' simulations are used for training, and the last day is used for testing. The model generates radiation inputs and outputs every 30 model minutes.

ML-based radiation emulators and offline evaluations
Many researchers (Chevallier et al., 2000;Krasnopolsky et al., 2010;Song and Roh, 2021) have previously used the FC networks to replace the radiation schemes within the operational NWP models. Additionally, Ukkonen (2022) showed that bidirectional recurrent NNs (RNNs) are more accurate than the feed-forward NNs (FNNs) for emulating atmospheric radiative transfer. Yao et al. (2022) also demonstrated that the bidirectional long short-term memory (Bi-LSTM) achieves the best 165 accuracy in radiative transfer parameterization. Therefore, this study has trained and tested both FC networks and Bi-LSTM models. As an increasing number of trainable parameters of a ML model increases both accuracy and computational cost, finding a balance is crucial for applying ML models in operational NWP models. and comparisons can be referenced in Yao et al. (2022). All the ML-based emulators have been converted to ONNX and run using the ONNX Runtime library version 1.7.  Model D 5.355e+00 1.305e+00 9.943e-02 1.614e-01 5.854e+00 4.804e+00 Model E 6.157e+00 1.274e+00 5.091e-02 9.520e-02 6.871e+00 5.576e+00 LSTM models. Similarly, although the number of parameters of model A is about 10.3 and 1.03 times that of models C and D (see Table 1), Model A is still faster than models C and D.
The performance of different ML-based radiation emulators is evaluated on the testing data in the offline setting. Table 2 summarizes the RMSE of fluxes and heating rates which are the final outputs of the radiation schemes. Since SW fluxes at the surface and top-of-atmosphere (TOA) radiation is particularly important to climate and weather prediction, their RMSE is 180 also shown in Table 2. It is shown that model E is the most accurate, with RMSE of SW and LW fluxes of about 6.157 and 1.274 W/m 2 , RMSE of SW and LW heating rates of about 5.091 × 10 −2 and 9.520 × 10 −2 K/day, and SW fluxes at the surface and TOA of about 6.871 and 5.576 W/m 2 . Model D has comparable accuracy to the best-performing model E while having only 1/14 of the parameters and a much shorter inference time. Furthermore, considering that model C only has the smallest number of parameters, model C has relatively good accuracy while the RMSE of SW heating rate is five times higher 185 than model E. On the other hand, models A and B have at least four times higher RMSE than model E in terms of heating rates, with RMSE of SW fluxes at the surface and top-of-atmosphere at least 20 times higher than model E. In summary, it is demonstrated that the Bi-LSTM model is more accurate than FC networks for radiation emulation, and model D achieves the best balance between accuracy and computing efficiency.
Since it is difficult to know the performance of the ML-based emulators in the online settings based on offline evaluation, the the WRF model is also run with the original RRTMG schemes as a reference to evaluate the performance of the ML-based emulators.

Online Evaluation Results
The ML-based radiation emulators are coupled with the WRF model using the WRF-ML coupler. The WRF simulations 195 coupled with ML-based emulators were run using three configurations: zero exclusive processes for inference, four exclusive inference processes on CPUs, and four exclusive inference processes on GPUs. Also, the total number of WRF processes is kept to 24 to ensure fair comparisons. The calculation time of radiation driver using different ML-based radiation emulators is compared. The WRF model is initialized at 12 UTC on a randomly selected day (i.e., November 9th) in the year 2021 and runs for three days, which is not part of the previously used dataset to ensure the ML-based emulators are evaluated on unseen data.

200
Finally, the temperature at 2 meters (T 2m ) and wind speed at 10 meters (W S 10m ) output from the WRF model are compared.

Computational efficiency evaluation of radiation driver
The calculation time of the WRF radiation driver using original RRTMG schemes and ML-based radiation emulators are profiled using the timing function rsl_internal_microclock provided by the WRF model. Table 3 shows that it takes about 1441 milliseconds to run a radiation driver using the original RRTMG schemes. In terms of ML-based radiation emulators,

205
using GPUs for inference is significantly faster than using CPUs for inference, especially for the Bi-LSTM model, as it is less computationally efficient than the FC networks given a similar number of parameters. When GPUs are used for inference, all ML-based emulators except model E are at least three times faster than the original RRTMG schemes. However, whether this speed-up is due to applying ML-based emulators or the upgrade in hardware (using CPUs instead of GPUs ::::: GPUs :::::: instead :: of ::::: CPUs) is unclear, which is beyond the scope of this paper. On the other hand, when CPUs are used for inference, having 210 four processes exclusively for inference is slower than having inference executed on the WRF compute processors, which is probably due to the more significant increase in time cost due to data copy between CPUs than the decrease in time cost from having CPUs exclusively for inference. Model B is more accurate than model A in offline settings but performs much worse than model A in the online evaluation. The MAD of T 2m and W S 10m is greater than 5.6 and 1.3, respectively, for model B, while the MAD of T 2m and W S 10m is slightly less than 2.3 and 1.0, respectively, for models A. Model C performs best among all the three FC networks, with MAD 225 slightly worse than MAD of models D and E. Models D and E have a similar spatial distribution of difference, although model E has about ten times the number of parameters as model D and slightly higher accuracy in offline validation.

Prognostic validation
Overall, model E demonstrates the best performance, with the domain averaged MAD smaller than 0.10 for both temperature at 2 meters and wind speed at 10 meters. Both model D and model E show a comparable forecast with the original WRF simulation over the entire domain. Overall, model D is more suitable to replace the original RRTMG schemes as it is more 230 computationally efficient than model E. Using model D with GPU as inference processors is about 7 times faster.
In summary, the WRF-ML coupler enables all the ML-based radiation emulators coupled : to :::::: couple : with the WRF models efficiently. Moreover, the WRF simulations coupled with ML models are run successfully using CPU or GPU for inference.
Furthermore, the WRF-ML coupler can be a valuable tool for researchers to evaluate their ML-based parameterization in online settings.
235 Figure 8. Differences in wind speed at 10 meters at forecast horizons of 24, 48, and 72 hours between WRF simulations coupled with ML models A, B, C, D, and E and WRF simulation with the original RRTMG schemes for radiation emulation.

Summary and Conclusions
As demonstrated in many previous studies, emulating the subgrid physics using ML models has the potential to be faster, or by training on observations or detailed high-resolution models, even more accurate than the conventional parameterization schemes. However, the fact that ML models are commonly implemented using Python while the NWP models are coded in Fortran makes it challenging to implement the ML-based parameterization in NWP models for operational applications.

240
Previously, researchers usually hard coded the operations of NNs into Fortran, which is time-consuming and prone to error as even minor changes to the ML model require rewriting the Fortran code. Some researchers used simple Fortran-based NN libraries, which could convert existing models trained in Python to models usable in the Fortran framework. However, these methods are only feasible for applying simple, dense, layer-based NNs. In this paper, we propose a WRF-ML coupler that provides researchers with the tool to implement ML models into NWP models with minimal effort. The advantage of 245 this coupler is that more complex and advanced ML model structures are supported, as it can directly communicate with Python-based modules. The coupler is made open source in the hope that more and more researchers can integrate the MLbased parametrization to improve the traditional NWP models for more accurate weather or climate forecast. In addition, the methodologies also allow the integration of any Python scripts into the WRF model. An example of coupling the primitive Python subroutines into the WRF code is also provided in the code repository.

250
Furthermore, the capability of the WRF-ML coupler was demonstrated by coupling the ML-based radiation parameterization into the WRF model. It was also illustrated that ML-based parameterizations' computational efficiency and accuracy should be balanced to achieve the best overall performance for operational applications.
Code and data availability. The source code and data used in this are available at https://doi.org/10.5281/zenodo.7407487 .

260
Competing interests. The authors declare no conflict of interest.
Acknowledgements. This work was supported in part by the Zhejiang Science and Technology Program under Grant 2021C01017.