the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Improving the fine structure of intense rainfall forecast by a designed generative adversarial network
Zuliang Fang
Qi Zhong
Haoming Chen
Xiuming Wang
Zhicha Zhang
Hongli Liang
Accurate short-term quantitative precipitation forecasting (QPF) is critical for disaster prevention, mitigation, and socio-economic activities. However, due to the inherent limitations of numerical weather prediction (NWP) models, precipitation forecasts still exhibit substantial inaccuracies. In recent years, deep learning (DL) techniques have been increasingly applied to improve precipitation forecasts, yet these approaches often produce overly smoothed outputs that fail to meet operational requirements for detail and accuracy. In this study, we propose a Generative Fusion Residual Network (GFRNet), a generative adversarial network (GAN)-based framework that integrates multi-source NWP forecasts to generate 3-hourly quantitative precipitation forecasts for North China up to 24 h in advance. GFRNet employs an adversarial learning mechanism to enhance spatial structure reconstruction, combined with a weighted loss function and carefully designed sampling strategies to address the long-tailed distribution of precipitation and improve model training efficiency. Using independent rainy-season datasets from 2022–2024, we comprehensively evaluate the performance of GFRNet against three NWP models, a linear ensemble method (MSEM), and a deep learning baseline model (FRNet). Results show that GFRNet consistently outperforms the NWPs and baseline models across light, moderate, and heavy rainfall thresholds. Compared to the China Meteorological Administration's highest-resolution regional model (CMA-3KM), GFRNet improves Threat Scores (TS) by 4 %, 28 %, 35 %, and 19 % at the 0.1, 10, 20, and 40 mm thresholds, respectively, and improves Fractions Skill Scores (FSS) by 13 %, 18 %, and 15 % at the 10, 20, and 40 mm thresholds. Moreover, GFRNet consistently achieves the highest Multi-Scale Structural Similarity (MS-SSIM) scores and significantly reduces RMSE, demonstrating robust spatial structure recovery, stable intensity control, and strong generalization ability. These advantages are particularly pronounced in systemic high-impact heavy rainfall events, underscoring the model's operational value. FRNet shows advantages in forecasting heavy precipitation but suffers from high BIAS and weaker generalization, limiting its practical applicability. MSEM exhibits robust performance in light and moderate precipitation scenarios but degrades significantly under extreme precipitation conditions. Overall, GFRNet dynamically fuses multi-source NWP information, balances precipitation intensity and spatial structure, achieves higher forecasting skill, and improves forecast quality across diverse precipitation regimes.
- Article
(10712 KB) - Full-text XML
- BibTeX
- EndNote
Numerical Weather Prediction (NWP) serves as a fundamental tool in routine precipitation forecasting. However, its accuracy is constrained by various factors, including initial condition errors, limited spatial resolution, incomplete physical parameterizations, and approximate boundary conditions, all of which contribute to persistent forecast uncertainties (Sun et al., 2014; Boeing, 2016). As a result, it is challenging for any single numerical model to accurately capture the location, intensity, and structural evolution of precipitation.
In recent years, deep learning (DL), a core technique in artificial intelligence, has been increasingly applied in meteorology, including NWP post-processing, large-scale data assimilation, super-resolution, downscaling, and spatiotemporal prediction (Yang et al., 2022). In the domain of precipitation forecasting, DL has achieved significant progress. For nowcasting (0–6 h), purely data-driven DL methods based on radar and satellite data have demonstrated substantial superiority over numerical models and optical flow methods (Shi et al., 2015; Wang et al., 2018b; Sønderby et al., 2020; Ayzel et al., 2020; Espeholt et al., 2022; Tan et al., 2024). For short-term forecasting within the 6–24 h range, precipitation prediction primarily relies on post-processing of NWP outputs. For example, Zhang et al. (2020) developed an LSTM-based correction model for 12 h accumulated precipitation over eastern China using ECMWF ensemble control forecasts, demonstrating superior performance for both light rain (<5 mm per 12 h) and heavy rain (>30 mm per 12 h) compared to frequency matching and SVM-based algorithms. Similarly, Chen et al. (2021) constructed an hourly precipitation correction model using a Convolutional Neural Network (CNN) applied to mesoscale forecasts from the East China Regional Numerical Center (CMA-SH9), achieving better skill than probability matching.
Moreover, Zhou et al. (2022) utilized a 3D CNN to learn the nonlinear relationship between basic meteorological variables from the ECMWF's fifth-generation reanalysis dataset (ERA5) and corresponding 3 h accumulated precipitation. Their model, when applied to ECMWF high-resolution forecasts, significantly improved the Threat Score (TS) at the 20 mm per 3 h threshold for lead times up to 72 h. In another study, Kim et al. (2022) used basic meteorological variables and precipitation from numerical model forecasts as input features for a DL model, achieving positive correction effects for light and moderate precipitation, though the improvements diminished for precipitation exceeding 10 mm. Chen et al. (2023) employed a U-Net architecture with a weighted loss function to correct 6 h accumulated precipitation predictions from the ECMWF, using 0.25° ERA5 precipitation data as the ground truth. This approach showed improvements across various precipitation intensities, from light rain (≥0.1 mm per 6 h) to rainstorms (≥20 mm per 6 h), in TS scores compared to the ECMWF forecast. Sun et al. (2023) developed a DABU-Net model combining data augmentation with deep learning to improve GFS wintertime precipitation forecasts over southeastern China. The model significantly enhanced Threat Scores (TS) across multiple thresholds, with TS at the 20 mm d−1 threshold increasing by up to 100 % at a 72 h lead time. Despite these advances, grid-based DL precipitation correction models generally perform better for light to moderate precipitation. Improvements in TS for heavy rainfall are often accompanied by overly smoothed predictions that lack well-defined spatial structures. Additionally, corresponding BIAS scores frequently exceed 1, indicating systematic overestimation and reducing the operational applicability of such methods.
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), as a typical deep generative model (DGM), have successfully transformed the intractable likelihood function into a neural network framework, enabling the model to optimize its parameters to fit the likelihood function. By learning through competition between a generator and a discriminator, GANs enable the production of outputs that closely resemble the distribution of real data. GANs have been widely used in image super-resolution tasks (Wang et al., 2018a) and have shown great promise in addressing challenges in short-term forecasting, such as excessive smoothing and the degradation of intensity over time (Ravuri et al., 2021; Zhang et al., 2023). GANs have also demonstrated strong performance in statistical downscaling within the meteorological field (Leinonen et al., 2021; Price and Rasp, 2022; Singh et al., 2019). Recent studies have explored the use of GANs in post-processing NWP-based precipitation forecasts. For example, Price and Rasp (2022) utilized a 4 km resolution radar precipitation product to train a conditional GAN (CGAN) model for correcting and downscaling 6 h precipitation forecasts from the 32 km ECMWF ensemble. The CGAN model outperformed CNN baselines and achieved skill comparable to high-resolution regional ensemble forecasts, especially for heavy precipitation events (≥30 mm per 6 h). Similarly, Harris et al. (2022) aimed to generate high-resolution ensemble precipitation forecasts by post-processing ECMWF forecasts at 10 km resolution using GAN and VAE-GAN methods, targeting 1 h accumulated precipitation products at 1 km resolution. Compared to traditional methods, the GAN approach showed significant advantages in preserving precipitation structure and predicting heavy precipitation (≥5 mm per 1 h). However, most existing applications of GANs focus on probabilistic ensemble forecasts rather than deterministic quantitative precipitation forecasts, and few studies directly address severe storm precipitation, an area of critical operational importance due to the associated risks.
Short-term heavy precipitation is typically characterized by sudden onset, short duration, small spatial scale, and high localization. These features demand precipitation forecasts with finer temporal and spatial resolutions to meet operational needs. In this study, we employ a GAN-based model, GFRNet, to generate deterministic forecasts of 3 h accumulated precipitation over the next 24 h in North China, using multiple NWP model outputs as input and targeting a resolution of 5 km. Compared with previous research, this study introduces the following key advancements:
-
GAN-based generative fusion framework. We propose GFRNet, a novel GAN-based model that dynamically integrates multi-source NWP forecasts (ECMWF, CMA-SH9, CMA-3KM), enhancing fine-scale precipitation structure reconstruction and mitigating the blurriness common in deep learning precipitation forecasts.
-
Targeted evaluation of high-impact precipitation. Beyond conventional thresholds, this study adopts a stringent 40 mm per 3 h criterion and introduces a Top 10 % coverage-based subset to explicitly assess model performance in organized, high-impact precipitation events.
-
Comprehensive multi-year validation and statistical analysis. GFRNet is systematically evaluated across three independent summer seasons (2022–2024) using diverse metrics (TS, FSS, RMSE, MS-SSIM) and paired t tests, providing robust evidence of skill improvements and clarifying the sources of these gains.
2.1 Data
This study focuses on North China (35–44.55° N, 112–121.55° E), as illustrated in Fig. 1. Administratively, this region includes Beijing, Tianjin, Hebei, Shanxi, and the Inner Mongolia Autonomous Region, with the southeastern part encompassing Shandong and the Bohai Sea region. The target area features complex topography, dominated by the Taihang Mountains, which extend from the southwest to the northeast. To the southeast lies the North China Plain, characterized by an average elevation below 400 m. West of the Taihang Mountains is the Loess Plateau, and to the north is the Inner Mongolia Plateau, with elevations exceeding 800 m and local peaks reaching up to 2000 m.
Figure 1Topography distribution (shaded; in units of m) of North China domain (35–45° N, 112–122° E). The vast area with an altitude of less than 400 m in the middle and southeast of the figure is the North China Plain, which reaches the southern foot of Yanshan Mountain in the north, leans on Taihang Mountain in the west, and borders the Bohai Sea in the east. It includes Beijing (Red Triangle), Tianjin, Shandong, and most of Hebei.
This study utilizes CMA Multi-source merged Precipitation Analysis System (CMPAS) as the ground truth for precipitation fields. CMPAS is a comprehensive precipitation product developed by the National Meteorological Information Center of the China Meteorological Administration. It integrates ground automatic station data, satellite observations, and radar observations using methods such as Probability Density Function (PDF), Bayesian Model Averaging (BMA), Optimal Interpolation (OI) and Downscaling (DS) (Pan et al., 2018). CMPAS provides hourly temporal resolution and a spatial resolution of 0.05°×0.05°.
For numerical models, considering the operational usage, model resolution, and performance, this study uses the precipitation forecast of the following three NWPs. The high-resolution global model forecast from the European Centre for Medium-Range Weather Forecasts (ECMWF), with a horizontal resolution of approximately 9 km in the China region and a temporal resolution of 3 h. The mesoscale forecast from the East China Regional Numerical Center (CMA-SH9) (Zhang et al., 2021), with a horizontal resolution of 9 km and a temporal resolution of 1 h. The high-resolution regional numerical forecast independently developed by the Numerical Prediction Center of the China Meteorological Administration (CMA-3KM) (Shen et al., 2020), with a horizontal spatial resolution of about 3 km and a temporal resolution of 1 h. Forecasts are taken from the initial times of 00:00 and 12:00 UTC, retaining a 24 h forecast range. Spatially, numerical model forecasts are interpolated to a uniform grid of 0.05°×0.05° using a bilinear interpolation algorithm, corresponding to a target area size of 192×192 grid points.
Based on the data described earlier, we performed a 3 h accumulated precipitation forecast for the next 24 h. Table 1 details the specific feature selection process, which includes five sources of features. Let r3(T) denote the 3 h accumulated precipitation at time T, where the learning target is the corresponding CMPAS r3(T) observation. The input features consist of r3(T) and r3(T−3) from ECMWF, CMA-SH9, and CMA-3KM. Given that precipitation formation, development, and movement are closely linked to topography and location, META features including elevation, latitude, and longitude are also incorporated into the model. The performance of numerical model forecasts varies depending on the forecast cycle and lead time. To account for this, temporal information such as forecast cycle and lead hour is encoded using trigonometric functions and included as features in the deep learning model. The cycle values range from [0, 1], corresponding to the initial forecast times of 00:00 and 12:00 UTC for the numerical models. For each cycle, only the forecast lead times at 3, 6, 9, 12, 15, 18, 21, and 24 h are considered.
This study uses precipitation data from 2019 to 2022 and divides the dataset into training, validation, and test sets. In North China, precipitation is predominantly concentrated in the summer months, particularly July and August. Therefore, the period from 10 July to 20 August 2021, was designated as the validation set (637 samples) for model tuning and parameter selection, and the period from 16 June to 31 August 2022, was designated as the initial test set (1196 samples). To further evaluate the model's generalization, we added the rainy seasons of 2023 and 2024 (1093 and 1211 samples, respectively) as independent test sets.
The training set spans multiple summer periods and initially contains many non-precipitation images, which would otherwise cause the model to waste computational resources and potentially degrade its learning. Therefore, we applied an image-level sampling strategy: for each image, if the proportion of pixels exceeding a rainfall threshold t is below a predefined ratio r, the image is discarded; otherwise, it is retained. We set t=1 mm and r=2 % in this study, ensuring that low-level precipitation is captured while maintaining sufficient sample representativeness. After sampling, the training set was reduced from 4645 to 2885 samples. It is important to note that no sampling or filtering was applied to the validation or test sets (Table 2).
Table 2Sample distribution across training, validation, and test sets. The training set (2019–2022) is subject to image-level sampling to increase the proportion of samples containing measurable precipitation. The validation set (2021) and test sets (2022–2024) are not sampled to objectively assess model performance and generalization.
Since the sampling was performed at the image level, we categorized precipitation intensity for each image based on the proportion of pixels exceeding specific thresholds. Specifically, an image is classified as light rain if more than 10 % of its pixels have precipitation ≥0.1 mm, as moderate rain if more than 0.5 % of its pixels have precipitation ≥10 mm, as heavy rain if more than 0.2 % of its pixels have precipitation ≥20 mm, and as a rainstorm if more than 0.1 % of its pixels have precipitation ≥40 mm.
Figure 2a illustrates how the proportion of different image-level precipitation categories changed after sampling: the proportion of light rain samples increased from 40 % to 63 %, moderate rain samples increased from 25 % to 40 %, and heavy rain and rainstorm samples rose from 16 % and 7 % to 26 % and 11 %, respectively. This adjustment in sample proportions is expected to facilitate more stable and efficient model training by increasing the representation of precipitation cases across different intensity levels.
Figure 2Changes in precipitation sample composition in the training set before and after sampling: (a) image-level sample proportions across different precipitation intensities; (b) pixel-level precipitation distribution, showing the persistent long-tail characteristic.
However, Fig. 2b shows the pixel-level precipitation distribution, revealing a persistent long-tail pattern: the proportion of pixels with precipitation ≥20 and ≥40 mm remains extremely low (only 0.29 % and 0.05 %, respectively). While sampling improves the composition of training images, it cannot fundamentally change the imbalance of precipitation intensity at the pixel level. This imbalance motivated the design of the customized weighted loss function (Sect. 2.2.3).
2.2 Methodology
2.2.1 GFRNet
The core idea of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) is to use adversarial training to enable the Generator (G) to learn the distribution of real data and generate synthetic data that closely approximates real data. Simultaneously, the Discriminator (D) strives to improve its ability to distinguish between real data from the dataset and data generated by the generator. In this study, we propose a Generative Fusion Rain Network (GFRNet) for multi-NWP precipitation post-processing. As illustrated in Fig. 3, GFRNet consists of two main components: the Generator and the Discriminator.
Figure 3Model architecture: (a) generator of GFRNet (also referred to as FRNet); (b) discriminator of GFRNet.
The core structure of the Generator in GFRNet is inspired by a U-Net with an encoder–decoder architecture (Ronneberger et al., 2015). The input to the model is a tensor of size , and the output is a tensor of size . The encoder comprises four Down-ConvBlocks, which gradually reduce the spatial dimensions of the feature maps while extracting deep feature information. The decoder, conversely, consists of four Up-ConvBlocks that progressively restore the spatial dimensions of the feature maps through upsampling operations. The specific sizes of the feature maps are illustrated in Fig. 3a. Skip connections are introduced between the encoder and decoder, connecting the output of a layer in the encoder directly to the input of the corresponding layer in the decoder, which helps better preserve and reuse the features extracted at different levels. The activation function of the generator's final layer is set to ReLU (Agarap, 2019) for regression predictions. Each ConvBlock module integrates four key components:
-
Convolution operation: this transforms the size of the feature map and is used for either upsampling or downsampling.
-
Batch Normalization (BN) (Ioffe and Szegedy, 2015), ReLU, and Dropout (Srivastava et al., 2014) layers: these are used to accelerate the training process, improve model robustness, and prevent overfitting.
-
Residual module (He et al., 2016): this backbone consists of two convolutional layers with BN and a dropout layer in between. The final output is obtained by adding the block input to the output of the second convolutional layer through a skip connection.
-
SE-Block: this is a channel-attention module composed of two sub-modules—Squeeze and Excitation (Hu et al., 2018). The squeeze operation compresses the feature map of each channel via global pooling to obtain channel-wise importance coefficients, and the excitation operation reweights each channel according to these coefficients.
The Generator's U-Net-like structure can effectively capture the geographic and spatial dependencies of precipitation distribution. The residual structure in the ConvBlock helps prevent gradient disappearance and explosion in deep-layer networks, enhancing model performance and accelerating training. Moreover, it improves the reuse and transmission of features. The SE attention mechanism helps the model focus on the feature channels that contribute significantly to the prediction of precipitation.
Radford et al. (2016) significantly improved the training stability of GANs and the quality of generated images by introducing the Deep Convolutional GAN (DCGAN) structure. Inspired by DCGAN, the main architecture of our discriminator consists of four ConvBlocks that perform progressive spatial downsampling and channel expansion on the single input image of size , enabling richer semantic feature extraction. This is followed by a Dense layer and a Sigmoid activation, which outputs the probability that the input image is a real sample.
2.2.2 Baseline model: MSEM
Traditional multi-model fusion correction methods have undergone extensive development and validation, demonstrating widespread application and reliable performance in numerical weather prediction (Dai et al., 2018). To provide a meaningful baseline for evaluating the proposed GFRNet, we introduce the Multi-Model Similarity Ensemble Method (MSEM). This method mimics forecasters' reasoning process by quantifying the similarity among different forecasts and assigning higher weights to ensemble members that exhibit higher similarity.
Let denote the flattened forecast field of the ith ensemble member (i=1, 2, …, N), where n is the total number of grid points. MSEM begins by constructing a similarity matrix among all ensemble members using cosine similarity:
Next, the similarity-based weight wi for the ith member is computed as the average similarity between this member and all others:
Finally, the ensemble prediction is obtained through a weighted average:
This approach adaptively emphasizes ensemble members with higher similarity, which are assumed to be more reliable. Compared to standard ensemble averaging, MSEM provides enhanced robustness and interpretability, particularly in capturing localized high-impact rainfall events, as evidenced in previous studies (Chen et al., 2005; Lin et al., 2013). It serves as a strong, physically motivated baseline for evaluating GAN-based correction methods such as GFRNet.
2.2.3 Model training and optimization
During the GAN training process, the generator and discriminator continuously compete and collaborate, driving mutual evolution. The generator aims to produce samples that resemble real data, while the discriminator receives both real and generated data as input and outputs a probability value indicating its confidence in the input being real. In this study, the optimization objectives for the discriminator and generator are as follows:
LD and LG represent the loss functions of the discriminator and generator, respectively. The parameters of the corresponding neural networks are denoted by θD and θG. The input to the generator and the predicted results are represented by x and , respectively, while y denotes the real labels. Wasserstein GANs (WGANs) (Arjovsky et al., 2017; Gulrajani et al., 2017) address the gradient vanishing problem commonly encountered in traditional GANs. Following the principles of WGAN, we adopt a loss function with gradient penalty term to optimize the discriminator. As shown in Eq. (6), D(y) and denote the scores assigned by the discriminator to real samples and samples generated by the generator, respectively. The latter part of the equation represents the gradient penalty term, where the weight γ is set to 10, and the samples are randomly weighted averages of the real label y and the generator's prediction , with ε drawn from a randomly sampled value from a uniform distribution between 0 and 1.
The loss function LG for the generator consists of two components. The first part, is the confidence score given by the discriminator, indicating how closely the generated images resemble real samples. We aim for this score to be as high as possible. The second part, Lcontent is the content loss, which is a weighted combination of Mean Squared Error (MSE) and Mean Absolute Error (MAE) loss functions. By setting the weight λ to 50, we ensure that the values of both loss components are of the same magnitude.
In Lcontent, the MSE part emphasizes larger errors and provides a smoother gradient, while the MAE is less affected by outliers. Combining MSE and MAE helps balance large and small errors, enhancing the model's robustness and stability. Additionally, considering the long-tail distribution of r3 intensity as shown in Fig. 2, where significant precipitation events are rare but critical, it is crucial to assign higher loss weights to samples with strong precipitation intensity (Bihlo, 2023). This strategy mitigates gradient vanishing or explosion and ensures the model learns to predict these rare, high-impact events effectively. As shown in Equation 10, we empirically found that using an exponential loss weighting function with parameters a=4.3 and b=0.8 yields optimal performance. To determine these values, we first tuned a and b on the FRNet configuration (generator-only) to balance performance across moderate-to-heavy and extreme rainfall events, selecting the combination that improved both heavy rainfall detection performance without overestimating light rain. After fixing these weights, we introduced the discriminator to form GFRNet and subsequently adjusted the gradient penalty coefficient γ to ensure stable adversarial training, selecting a value that produced a smooth and consistent decline in the generator's loss curve.
Both the generator and discriminator are optimized using the Adam optimizer (Kingma and Ba, 2017) with betas set to (0.9, 0.999) and a weight decay of 0.01. The learning rate follows a CosineAnnealingLR schedule (Loshchilov and Hutter, 2017), oscillating between 0.001 and 0 over a period of 20 epochs. During training, we observed that the discriminator initially improved slowly, necessitating a reduction in the generator's update frequency. Experimental results showed that updating the generator every 9 steps stabilized training for both networks. Model training was monitored using the validation loss, and early stopping was employed. Training was halted if the validation loss failed to decrease for 30 consecutive epochs. All evaluation results presented below are based on the model checkpoint with the minimum validation loss.
The generator and discriminator contain 4.46 M and 0.72 M parameters, respectively. Training and inference were conducted using the NVIDIA CUDA library and Tesla GPUs. With a single NVIDIA A100 GPU, the training process completes in approximately 3 h, and inference for 1000 samples takes just 2 min, satisfying operational time constraints.
To further evaluate GFRNet, we conducted an ablation study using only the generator, without adversarial training, referred to as FRNet. The content loss, dataset, and training strategies for FRNet remain consistent with those used for GFRNet.
2.3 Evaluation metrics
To comprehensively evaluate the predictive performance of the proposed model, we adopt a combination of categorical, neighborhood-based, and continuous/structural verification metrics. This diverse set of metrics enables an in-depth assessment from multiple perspectives, including precipitation occurrence, magnitude, and spatial structural realism.
2.3.1 Binary verification metrics
The binary metrics (TS, POD, FAR, and BIAS) are calculated based on a confusion matrix constructed at a given precipitation threshold. These metrics provide insight into the forecast's ability to correctly detect rainfall events. The specific definitions are as follows:
The definition of h, f, m aligns with the confusion matrix shown in Table 3. The TS, POD, and FAR values range between 0 and 1. Higher TS and POD values and lower FAR values indicate better forecast performance. A BIAS value of 1 indicates an unbiased forecast, while values between 0 and 1 indicate under-prediction, and values greater than 1 indicate over-prediction.
2.3.2 Neighborhood-based metric: FSS
The above binary metrics are all measured by comparing individual pixel values. Even if the predicted rainfall structure and intensity match the actual conditions, a slight positional deviation in the predicted rainfall band from the observed location can result in a high FAR and a lower POD, leading to a lower TS score, a limitation that cannot objectively reflect the true forecasting ability of the model. To address this, neighborhood spatial verification methods like the Fraction Skill Score (FSS) (Roberts and Lean, 2008) have been developed. FSS evaluates forecast performance by comparing the fraction of grid points exceeding a certain threshold within a neighborhood in both forecast and observation fields. This approach enables a more objective assessment of high-resolution models' ability to capture spatial structures. Additionally, FSS is easy to implement and is not sensitive to parameters such as threshold filters or smoothing radii, which contributes to consistent evaluation results. FSS is now widely used and has been adopted by ECMWF as a standard metric for precipitation evaluation, replacing many traditional skill scores. The FSS is derived from the Fractional Brier Score (FBS) and is calculated as follows:
Here, N is the total number of grid points within the evaluation domain, and Mr and Or represent the ratio of grid points exceeding a threshold to the total number of grid points within a given window size for the forecast and observation fields, respectively. First, we use a modified Brier score to compare the precipitation frequency between forecasts and observations, known as the Fraction Brier Score (FBS). Then, employing the variance skill score concept, we derive the Fraction Skill Score (FSS), which ranges from 0 to 1, where 0 indicates no match and 1 indicates a perfect match. FSS typically increases with larger neighborhood sizes. From the definitions of FBS and BIAS, it can be observed that if the BIAS within the given window is significantly greater or less than 1, the FBS value increases, leading to a lower FSS score. This indicates that FSS penalizes both under-prediction (BIAS<1) and over-prediction (BIAS>1).
2.3.3 Continuous and structural metrics: RMSE and MS-SSIM
To evaluate the overall prediction error in a continuous manner, we use the Root Mean Square Error (RMSE):
In addition, to assess the spatial structural consistency between predicted and observed precipitation, we adopt the Multi-Scale Structural Similarity Index (MS-SSIM) (Wang et al., 2003, 2004). Unlike RMSE, which only reflects pixel-wise magnitude differences, MS-SSIM evaluates perceptual similarity in luminance, contrast, and spatial structure. It is particularly suited for high-resolution precipitation forecasts.The full formulation is:
where x and y are the predicted and observed fields at scale j (typically M=5). The three image components are defined as:
where μx, μy are local means, σx, σy are standard deviations, and σxy is the covariance. Constants C1, C2, and C3 are small values to avoid instability.
MS-SSIM values range from 0 to 1. A higher MS-SSIM value indicates stronger agreement in precipitation spatial structure and better preservation of morphology. MS-SSIM has been widely used in nowcasting and precipitation forecasting as a metric for evaluating the spatial quality of forecasts, and some studies have even applied it as a loss function to further improve model outputs (Yin et al., 2021; Tan et al., 2024). Compared to RMSE and BIAS, MS-SSIM provides a more perceptually aligned evaluation of spatial realism.
For the categorical and neighborhood-based evaluations (i.e., TS, POD, FAR, BIAS, and FSS), we uniformly apply four precipitation thresholds – 0.1, 10, 20, and 40 mm per 3 h – corresponding to light rain, moderate rain, heavy rain, and rainstorm events, respectively. These thresholds are used to assess the model's ability to detect and spatially represent different intensities of precipitation, and to ensure consistent and interpretable comparisons across all models and rainfall regimes.
The statistical evaluation results on the test set are given below.
3.1 Overall performance evaluation
Figure 4 presents a comprehensive evaluation of six models over the rainy seasons of 2022, 2023, and 2024 for 3 h accumulated precipitation forecasts. The evaluation metrics include the Fractions Skill Score (FSS), Threat Score (TS), Probability of Detection (POD), 1-False Alarm Ratio (1-FAR), and BIAS (scaled by 0.5 for visual consistency). These metrics are computed across four precipitation thresholds (0.1, 10, 20, and 40 mm), reflecting each model's performance in terms of spatial pattern reconstruction, intensity detection, and generalization stability across rainfall regimes. For detailed metric scores of each model across the flood seasons of 2022–2024, please refer to the numerical tables provided in Appendix A.
Figure 4Evaluation scores of different models during the (a) 2022, (b) 2023, and (c) 2024 rainy seasons for 3 h accumulated precipitation forecasts. Metrics include the Fraction Skill Score (FSS), Threat Score (TS), Probability of Detection (POD), False Alarm Ratio (shown as 1−FAR), and Bias score (scaled by 0.5 for visualization consistency). Evaluations are conducted at multiple precipitation thresholds (0.1, 10, 20, and 40 mm per 3 h) across six models: ECMWF, CMA-SH9, CMA-3KM, MSEM, FRNet, and GFRNet.
For numerical weather prediction (NWP) models, ECMWF demonstrates overall stability, particularly at the light precipitation threshold (0.1 mm), with relatively high FSS and TS, reflecting its strength in capturing large-scale weak precipitation. However, its performance deteriorates significantly for moderate to heavy precipitation (r3≥10 mm), where both POD and BIAS decline, indicating a systematic bias of overforecasting light rain and underforecasting heavy rainfall. CMA-SH9 exhibits a consistent overestimation tendency across all thresholds, likely due to an overly aggressive deep convection parameterization scheme. CMA-3KM benefits from higher spatial resolution, achieving better FSS and TS than the other NWPs, with relatively reasonable BIAS. Nevertheless, its skill in detecting heavy precipitation remains limited.
The MSEM model, a similarity-based ensemble constructed from multiple NWP forecasts, shows notable and stable improvements for light to moderate precipitation. Its FSS and TS at the 0.1 and 10 mm thresholds outperform all three NWP models. At 20 mm, MSEM maintains competitive TS and POD scores, slightly surpassing CMA-3KM, highlighting the effectiveness of its weighted integration strategy under moderate rainfall conditions. However, under the 40 mm threshold, MSEM's performance declines, constrained by the limitations of its base NWP forecasts. Its TS and FSS scores fall behind deep learning models, especially when the underlying NWP (e.g., CMA-3KM) struggles with heavy precipitation. BIAS analysis reveals a layered behavior: MSEM tends to overpredict light precipitation (0.1 mm), stays near 1 for moderate thresholds (10–20 mm), and underestimates heavy rainfall (r3≥40 mm), highlighting its limited responsiveness to extreme events.
Table 4The RMSE and MS-SSIM of ECMWF, CMA-SH9, CMA-3KM, MSEM, FRNet, and GFRNet for 3-hourly precipitation predictions over 2022–2024 rainy seasons. The best, second-best, and third-best scores for each metric are shown in bold, underlined, and italic, respectively.
FRNet consistently achieves the highest TS and POD scores across the three years, indicating strong detection ability for moderate to heavy precipitation. However, it suffers from systematically high BIAS across all thresholds – notably, BIAS values at 20 and 40 mm reached as high as 2.191 and 2.480 in 2024, respectively – indicating significant overforecasting. This issue leads to weaker FSS and 1-FAR compared to GFRNet. Its performance in 2024 deteriorates noticeably, suggesting reduced generalization across years.
GFRNet offers the most balanced performance across all evaluation metrics. Its FSS scores remain consistently superior across all precipitation levels. While TS scores in 2022 were slightly lower than FRNet, GFRNet matched or outperformed FRNet in 2023 and 2024. Taking 2024 as an example, GFRNet achieves TS values of 0.237, 0.173, and 0.101 for the 10, 20, and 40 mm thresholds, representing improvements of 22.8 %, 38.4 %, and 46.4 % over CMA-3KM. Corresponding FSS scores are 0.570, 0.488, and 0.350, exceeding CMA-3KM by 9.6 %, 30.1 %, and 37.8 %, respectively. GFRNet also maintains reliable intensity estimation: its BIAS scores at the 20 and 40 mm thresholds are 1.067 and 0.911, significantly better than FRNet and CMA-SH9, avoiding both systematic overprediction and underdetection.
GFRNet's outstanding performance is attributed to its generative adversarial framework, which introduces a discriminator to enforce distributional similarity between the predicted and observed precipitation fields. This helps improve the structural realism of rare and intense convective rainfall events. More importantly, GFRNet exhibits minimal interannual fluctuations in FSS and TS, highlighting its robust generalization across years.
We further evaluate the temporal stability of model performance over different lead times (3–24 h) in the 2024 rainy season as shown in Fig. 5. All models exhibit decreasing FSS and TS with increasing lead time, consistent with the accumulation of forecast errors over time. GFRNet maintains consistently high FSS and TS across all lead times, especially for moderate and heavy rainfall. Its BIAS remains within a stable and reasonable range (0.8–1.2), indicating robust intensity estimation under varying temporal conditions.
Figure 5Temporal evolution of (a) FSS, (b) TS, and (c) BIAS scores across different precipitation thresholds (0.1, 10, 20, and 40 mm per 3 h) during the 2024 rainy season. Results are shown for six models at 3–24 h lead times (3 h interval). GFRNet consistently maintains high FSS and TS with relatively stable BIAS across precipitation intensities and forecast ranges.
In contrast, NWP models (ECMWF, CMA-SH9, CMA-3KM) exhibit rapid performance degradation at longer lead times, especially for higher thresholds. FRNet performs well in TS for short lead times but consistently shows high BIAS (>1.5). As a result, its FSS scores under the 20 and 40 mm thresholds lag behind MSEM, and even fall below CMA-3KM during some lead hours (e.g., 15–21 h), suggesting limited structural reconstruction capability. MSEM retains advantages under light to moderate thresholds (0.1–20 mm) but is clearly outperformed at 40 mm, where it falls behind CMA-3KM, again reflecting its limited skill in extreme rainfall scenarios.
GFRNet stands out with consistently superior TS and FSS across precipitation thresholds and lead times, along with robust BIAS control. It exhibits the best overall generalization. FRNet, while strong in detection (TS), suffers from high BIAS and limited spatial accuracy. MSEM remains effective in moderate rainfall but lacks responsiveness in extreme cases.
To further evaluate each model's overall ability to capture precipitation intensity and spatial structure, Table 4 summarizes the RMSE and multi-scale structural similarity index (MS-SSIM) scores for all models during the rainy seasons of 2022–2024. ECMWF consistently achieves the lowest RMSE in 2022–2023, reflecting reliable average intensity forecasts in weak precipitation regimes. However, its MS-SSIM remains low (maximum of 0.693), indicating spatial structure mismatches under convective conditions. CMA-SH9 shows high RMSE and limited MS-SSIM improvements, consistent with its overestimation tendency and parameterization biases. CMA-3KM benefits from resolution (MS-SSIM is 0.754 in 2022) but suffers from poor RMSE in 2024 (3.652), suggesting accumulated errors in convective scenarios.
MSEM maintains second-tier RMSE across years, indicating robust average intensity control. However, its MS-SSIM consistently ranks low, suggesting poor spatial structure reconstruction. FRNet achieves strong MS-SSIM (0.754–0.784), reflecting its ability to represent convective-scale structures, but suffers from a high RMSE, consistent with its high BIAS and overforecasting. GFRNet demonstrates the best balance: top MS-SSIM scores in all three years and competitive RMSE (2022–2024: 2.264–2.857), validating its ability to capture both spatial structure and precipitation magnitude, with strong generalization capability.
3.2 Spatial performance analysis
To further evaluate the spatial performance of different models, Fig. 6 presents the spatial distribution of FSS scores for each model during the 2024 rainy season across four precipitation thresholds (0.1, 10, 20, and 40 mm). Figure 7 shows the FSS improvement of GFRNet relative to the other reference models.
Figure 6Spatial distribution of Fraction Skill Score (FSS) during the 2024 rainy season across four precipitation thresholds (0.1, 10, 20, and 40 mm per 3 h, from top to bottom). The six columns correspond to ECMWF, CMA-SH9, CMA-3KM, MSEM, FRNet, and GFRNet. Higher FSS values indicate better spatial consistency between predictions and observations. The black lines show 500 m elevation contours.
Figure 7Spatial distribution of FSS improvement by GFRNet over other models during the 2024 rainy season. Each column represents the FSS difference between GFRNet and a baseline model (from left to right: ECMWF, CMA-SH9, CMA-3KM, MSEM, and FRNet). Each row corresponds to a rainfall threshold (0.1, 10, 20, and 40 mm per 3 h, from top to bottom). Red areas indicate regions where GFRNet outperforms the corresponding model, while blue areas indicate regions where it underperforms. The black lines show 500 m elevation contours.
At the light precipitation threshold of 0.1 mm, all models exhibit relatively high FSS values with spatially uniform distributions and minor inter-model differences. ECMWF shows slightly lower scores over the Taihang Mountains, while the CMA series and deep learning models (FRNet and GFRNet) perform more stably, indicating that all models effectively capture light rainfall patterns.
As the threshold increases to 10 mm (moderate rain), spatial performance differences among models become more pronounced. In general, NWP models achieve higher FSS scores in plains and coastal regions than in mountainous areas, highlighting the challenge of modeling precipitation over complex terrain. ECMWF maintains competitive performance in the Taihang Mountain region, and CMA-3KM and CMA-SH9 show stable performance over the North China Plain and Shandong Peninsula. They also perform well over the gently varying topography in the northeastern highlands (upper-right region of the figures).
At the 20 mm threshold (heavy rain), spatial differentiation becomes more significant. ECMWF retains relatively high scores in central regions but degrades elsewhere. CMA-SH9 and CMA-3KM continue to show stable performance in the central–eastern regions, reflecting their ability to capture mesoscale precipitation structures. In contrast, MSEM and deep learning models begin to demonstrate advantages, effectively fusing multi-source forecast information to enhance the representation of localized heavy precipitation.
Under the 40 mm threshold (rainstorm), the spatial coverage of high FSS scores from NWP models shrinks noticeably, and the well-performing regions become sparse. Overall FSS values drop significantly, indicating the persistent challenges these models face in forecasting extreme rainfall events. In contrast, GFRNet maintains relatively high scores across multiple key regions, particularly in southern Hebei, western Shandong, and the eastern foothills of Shanxi, suggesting robust spatial generalization and capability for modeling high-impact events.
From the perspective of multi-model integration, MSEM, FRNet, and GFRNet all demonstrate the ability to leverage NWP guidance at moderate to heavy rainfall levels. However, MSEM tends to be more conservative as precipitation intensity increases, and FRNet shows diminished learning capability under extreme events. GFRNet, by contrast, consistently exhibits superior spatial adaptability and integration capability across all rainfall thresholds, with notably higher FSS scores at 10, 20, and 40 mm compared to other models.
The FSS improvement maps in Fig. 7 further illustrate the spatial regions where GFRNet improves over the reference models. At thresholds above 10 mm, GFRNet shows substantial positive gains (highlighted in red) across key areas such as most of Shandong, eastern and central Hebei, the eastern foothills of the Taihang Mountains, and gently elevated plateau regions. Some of these areas are known to be particularly challenging for NWP models due to their complex precipitation structures and less reliable forecasts. Notably, even in regions where ECMWF and the CMA models already perform well, GFRNet still provides consistent gains, underscoring its robustness and adaptability under diverse geographical conditions.
3.3 Case studies of heavy rainfall events
3.3.1 Heavy rainfall event on 5 July 2022
On 5 July 2022, a significant precipitation event impacted southern Hebei and western Shandong in North China. This event was associated with the interaction between a weakening tropical cyclone (the remnants of Typhoon Chaba) and an upper-level trough. The rainfall exhibited both convective and stratiform characteristics and was distributed across a broad region, posing substantial challenges for accurate forecasting, particularly regarding the initiation and development of convective systems.
To comprehensively evaluate model performance, we analyzed the 3-hourly accumulated precipitation forecasts from +3 to +24 h lead times (Fig. 8) and compared the results using standard verification metrics (Fig. 9)
Figure 8Precipitation forecasts of all models initialized at 00:00 UTC on 5 July 2022. Panels show 3 h accumulated precipitation at +3 to +24 h lead times from observations (CMPAS) and six forecast models (ECMWF, CMA-SH9, CMA-3KM, MSEM, FRNet, and GFRNet). This event was associated with a weakening extratropical cyclone and an upper-level trough over North China, resulting in a widespread heavy rainfall event affecting parts of Shandong and Hebei provinces.
Figure 9Verification scores of all models for the precipitation event on 5 July 2022, evaluated over four thresholds (0.1, 10, 20, and 40 m per 3 h). Metrics include the Fractions Skill Score (FSS), Threat Score (TS), and Bias Score (BIAS). To maintain visual comparability across different metrics, BIAS values are scaled by a factor of 0.5 (i.e., BIAS/2 is shown). For BIAS, a value closer to 1 indicates better performance.
In the early forecast stages (from +3 to +12 h), observations indicated scattered convective rainfall over southwestern Shandong and southern Hebei. These localized convective cells gradually evolved into a narrow southwest–northeast-oriented rainband. GFRNet effectively captures the core locations and general evolution of the precipitation at this early stage. In contrast, ECMWF and MSEM significantly underestimated both the intensity and spatial extent of the rainfall. CMA-SH9 and CMA-3KM produced reasonable forecasts but exhibited slight spatial deviations in the initial precipitation patterns.
From +15 h onward, the rainfall band intensified and propagated northeastward. By +21 and +24 h, it split into two distinct clusters, forming a clear double-center structure. Most models captured this structural evolution to varying degrees. ECMWF exhibited a delayed response to moderate and heavy rainfall, failing to forecast the northern rain cluster but reasonably predicting southern rainfall at later lead times (e.g., +18 to +24 h), albeit with weakened intensity. CMA-SH9 and CMA-3KM effectively reproduced the spatial distribution and evolution of the two rainfall centers, although both models showed positional shifts relative to observations. MSEM provided spatially smoothed forecasts with moderate accuracy but relatively low TS and FSS scores. FRNet precisely located the southern rainband but tended to over-smooth rainfall structures, leading to higher false alarm rates and limited generalization at higher thresholds. GFRNet consistently captured both the spatial structure and intensity evolution, accurately reproducing the development of the double-center rainfall pattern and achieving improved spatial fidelity.
It is important to note that GFRNet's forecasting capability is built upon the input guidance of three NWP models. For instance, the ECMWF forecast provided a reasonably accurate depiction of the southern rainfall location after +18 h, but with substantially weaker intensity. In contrast, CMA-SH9 and CMA-3KM both provided relatively better predictions of intensity but with moderate spatial shifts. GFRNet effectively leveraged the complementary strengths of these models by dynamically learning and integrating both spatial and intensity-related features. This fusion mechanism allowed GFRNet to provide a more accurate and balanced forecast, particularly for the evolving rainband structures at later lead times.
The verification scores presented in Fig. 9 further support these findings. GFRNet achieved the highest TS and FSS values across all precipitation thresholds, particularly at 20 and 40 mm per 3 h, indicating its advantage in forecasting moderate to heavy rainfall. Its BIAS values remained close to 1 (noting that BIAS/2 is shown in the figure), suggesting balanced precipitation intensity forecasts. In contrast, CMA-SH9 and FRNet showed a stronger tendency toward overestimation at higher thresholds, while ECMWF consistently underestimated precipitation.
Overall, this case study demonstrates that GFRNet is capable of accurately forecasting both the spatial distribution and intensity of precipitation in a challenging convective environment. This capability stems from its dynamic assimilation and integration of multi-source NWP information, yielding improvements over traditional NWP models and baseline deep learning approaches.
3.3.2 Organized rainstorm over Beijing–Tianjin–Hebei region on 25 July 2024
This case focuses on a typical summer heavy precipitation event that occurred between 00:00 UTC on 24 July and 00:00 UTC on 25 July 2024, significantly impacting the Beijing–Tianjin–Hebei region. The event was characterized by high organization, abrupt onset, and extremity. The formation of this heavy rainfall process was driven by a combination of favorable large-scale conditions: persistent control of the subtropical high, continuous moisture transport from the outer circulation of Typhoon Gaemi, the eastward progression of a mid-latitude trough, the presence of a low-level shear line, and orographic lifting associated with the Taihang and Yanshan Mountains. The 5880 gpm ridge of the subtropical high remained quasi-stationary over northern China, facilitating sustained moisture accumulation and convective instability. Meanwhile, southeasterly flow at 850 hPa, with specific humidity values reaching 16–18 g kg−1, provided abundant water vapor from the East China Sea. The superposition of the mid-level trough, low-level convergence, and orographic forcing contributed to the rapid development and structural organization of the convection.
Figure 10Precipitation forecasts of all models initialized at 00:00 UTC on 25 July 2024. Panels show 3 h accumulated precipitation at +3 to +24 h lead times from observations (CMPAS) and six forecast models (ECMWF, CMA-SH9, CMA-3KM, MSEM, FRNet, and GFRNet).
The evolution of the precipitation system can be broadly divided into two stages: the early stage (+03 to +12 h) was dominated by scattered deep convection, while the later stage (+18 to +24 h) transitioned into a well-organized, banded precipitation system. Observations show that by +03 h, multiple localized heavy rainfall centers emerged in southeastern Hebei and northeastern parts of the domain. By +06 h, two convective cells developed in eastern Hebei, which further organized into two southwest–northeast (SW–NE) oriented narrow rainbands by +09 h. At +12 h, the western band weakened, and the eastern band moved offshore. By +15 h, the heavy rainfall temporarily ceased. Subsequently, from +18 to +24 h, new convective cells developed rapidly over northeastern Hebei, Tianjin, and western Shandong. By +24 h, these cells merged to form a prominent SW–NE oriented rainband spanning multiple provinces, illustrating the high degree of organization and rapid evolution of this event (Fig. 10).
During this process, numerical weather prediction (NWP) models exhibited stage-dependent performance. In the early phase characterized by scattered convection, ECMWF forecasts generally underestimated precipitation intensity and failed to capture convective development. In contrast, CMA-SH9 and CMA-3KM tended to overestimate rainfall intensity and exhibited substantial spatial biases, with premature development and false alarms in certain regions. In the later stage with more organized precipitation, all three NWP models predicted the emergence of the SW–NE oriented rainband as early as +15 h, while in reality the structure was not observed until after +21 h. This indicates a common “premature triggering” issue in system-scale precipitation forecasting. Among them, ECMWF provided more accurate spatial placement but underestimated intensity, while CMA-SH9 and CMA-3KM captured stronger precipitation but suffered from high bias and spatial overextension.
Figure 11Verification scores of all models for the precipitation event on 25 July 2024, evaluated over four thresholds (0.1, 10, 20, and 40 mm per 3 h). Metrics include the Fractions Skill Score (FSS), Threat Score (TS), and Bias Score (BIAS). To maintain visual comparability across different metrics, BIAS values are scaled by a factor of 0.5 (i.e., BIAS/2 is shown). For BIAS, a value closer to 1 indicates better performance.
The MSEM method, based on weighted ensemble integration of three NWP models using inter-model similarity, demonstrated robust performance on light and moderate precipitation scenarios. It achieved the highest FSS scores at the 0.1, 10, and 20 mm thresholds (Fig. 11), highlighting the advantage of ensemble averaging in mitigating individual model biases under moderate conditions. However, due to the absence of structural correction and nonlinear representation capabilities, MSEM significantly underestimated extreme rainfall, with notably lower TS and FSS scores at the 40 mm threshold, revealing limited generalization to high-impact events.
FRNet outperformed most NWPs in terms of TS scores, yet it exhibited a clear tendency toward systematic overprediction. Strong and extensive rainfall belts emerged as early as +06 and +09 h, with further intensification during +21 and +24 h. These features led to a substantial positive bias, indicating excessive spatial coverage and rainfall intensity. While FRNet enhanced structural representation, it lacked sufficient physical constraints on extreme rainfall, making it prone to overfitting under severe weather conditions.
GFRNet, by dynamically integrating ECMWF's strength in spatial placement with CMA-type models' responsiveness to heavy rainfall, achieved more balanced and physically realistic forecasts. It accurately captured the dual-band structure at +06 and +09 h, and its eastern rainband at +12 h closely matched observations. Although residual spurious rainfall persisted at +15 h, it was markedly weaker than in other models. During the +21 to +24 h period, GFRNet effectively reconstructed the newly formed main rainband, both in structure and intensity, without the inflated patterns observed in FRNet. As shown in Fig. 11, GFRNet achieved the highest TS and FSS scores at the 40 mm threshold, while maintaining a near-unity BIAS value, validating its generalization capability and forecast stability in extreme rainfall scenarios. However, similar to the NWPs, GFRNet also exhibited a tendency to predict the emergence of organized rainbands prematurely (around +15 h), indicating that its temporal modeling still inherits timing biases from the input NWP forecasts and thus requires further refinement.
This case study highlights the key characteristics, strengths, and limitations of the different models under a complex extreme rainfall scenario, and demonstrates the enhanced structural and intensity prediction capabilities of GFRNet under a multi-source fusion framework.
3.4 Statistical significance testing and sample stratification analysis
In the overall statistical evaluation of the verification metrics, the deep learning models and MSEM consistently outperformed the numerical models across multiple precipitation thresholds. However, to further verify whether these performance differences are statistically significant and to clarify their primary sources, we conducted a statistical significance analysis. In precipitation forecast evaluation, aggregate scores are often dominated by a large number of weak-precipitation samples. These weak samples typically consist of only scattered or marginal precipitation pixels; although numerous, they contribute little to disaster prevention and mitigation and can therefore “dilute” the models' demonstrated performance in key precipitation events (e.g., organized rainbands and frontal rainfall systems).
To more precisely assess model skill in high-impact precipitation scenarios, we stratified the analysis into two levels:
-
All-sample set: includes all samples with at least one pixel exceeding the specified precipitation threshold.
-
Top 10 % coverage subset: within the all-sample set, samples were ranked by the number of pixels exceeding the threshold, and the top 10 % were selected. These samples represent cases with the largest precipitation coverage and clearest signals at each threshold, providing a focused evaluation of model performance in major precipitation events.
For the 0.1, 10, 20, and 40 mm thresholds, the all-sample set contained 3471, 2812, 2435, and 1813 samples, respectively, with corresponding Top 10 % subsets of 347, 281, 243, and 181 samples. Paired t tests were applied to the scores within each sample set to assess statistical significance between models, and significance levels (p<0.05, 0.01, 0.001) were annotated in the boxplots with star markers.
Figure 12 presents the TS (Threat Score) distributions for the models at thresholds of 0.1, 10, 20, and 40 mm, for both the all-sample and Top 10 % subsets, along with the corresponding significance annotations. At the light precipitation level (≥0.1 mm), MSEM showed the most stable performance, with TS scores clearly higher than most models, followed by ECMWF. FRNet and GFRNet exhibited no obvious advantage at this level, suggesting that their generalization in weak-precipitation contexts remains limited.
Figure 12Boxplots of TS distributions for all models at four precipitation thresholds (0.1, 10, 20, and 40 mm). Panels (a), (c), (e) and (g) show results for the all-sample set, including all samples containing at least one pixel above the given threshold. Panels (b), (d), (f) and (h) present results for the Top 10 % coverage subset, defined as the top 10 % of samples ranked by the number of pixels exceeding the threshold, representing organized high-impact rainfall events. Stars indicate significance levels from paired t-tests between models (*: p<0.05, : p<0.01, : p<0.001). Compared to NWPs and MSEM, GFRNet shows statistically significant advantages in the Top 10 % subset for 20–40 mm thresholds, while differences are less pronounced for light precipitation cases.
For moderate and heavy rainfall, GFRNet already demonstrated superior performance over the NWP models and MSEM in the all-sample set, and this advantage became even more pronounced in the Top 10 % subset. FRNet also showed significant gains in the Top 10 % subset, but for heavy rainfall in the all-sample set, its TS median and upper quartile were slightly lower than those of CMA-3KM, indicating that its higher TS scores largely stem from a small number of high-impact samples.
At the extreme rainfall threshold (≥40 mm), all models performed poorly for dispersed strong-precipitation samples. CMA-3KM, benefiting from its higher resolution, performed slightly better than others in the all-sample set. However, in the Top 10 % subset (representing typical organized extreme rainfall events), FRNet and GFRNet achieved median and upper-quartile TS values significantly higher than those of the CMA models and MSEM, underscoring their advantage in organized extreme rainfall situations.
From this analysis, several important insights emerge. Firstly, in comparing the three NWP models, we find that for moderate and heavy rainfall events, CMA-3KM consistently demonstrates stable and superior performance across both the all-sample and Top 10 % subsets, underscoring the robustness and advantages of high-resolution mesoscale modeling, with CMA-SH9 ranking second. In contrast, ECMWF shows markedly weaker forecast skill for rainfall above 20 mm, highlighting the limitations of the global model in these scenarios. Secondly, deep learning models show a clear advantage for organized extreme rainfall. GFRNet's strong performance in the Top 10 % subset suggests that its generative fusion strategy is particularly effective for complex precipitation processes, such as typhoon rainbands and Mei-yu fronts, leading to substantial improvements in these high-impact scenarios. However, dispersed heavy rainfall remains challenging. In diverse, complex background conditions, FRNet has not demonstrated consistent advantages, and GFRNet also shows weaknesses in handling weak-signal samples, even in extreme rainfall cases (≥40 mm). This suggests that future work should target weak-precipitation scenarios with dedicated training strategies or loss function designs, to prevent the model from becoming overly “aggressive” or neglecting weak signals.
Furthermore, mean scores can obscure differentiated capabilities. Model evaluation should not rely solely on overall scores; it is essential to incorporate analyses of the Top 10 % subset to avoid average statistics masking a model's true strengths in high-impact events. Differentiating between the all-sample and Top 10 % subsets helps diagnose the model's “core capability”. Finally, future development should consider incorporating additional thermodynamic and dynamic predictors to enable the model to better characterize precipitation generation when NWP guidance is weak or fails, thereby enhancing its generalization in weak-precipitation scenarios.
This study developed the GFRNet model, a generative adversarial network (GAN)-based framework designed to produce 3-hourly quantitative precipitation forecasts (QPFs) for northern China up to 24 h ahead. GFRNet ingests forecasts from one global model (ECMWF) and two regional models (CMA-SH9 and CMA-3KM) as inputs, and employs a tailored sampling strategy alongside a weighted loss function to improve model training efficiency and address precipitation's long-tailed distribution. We systematically evaluated GFRNet's performance for the 2022, 2023, and 2024 summer rainy seasons, comparing it against three NWP models (ECMWF, CMA-SH9, CMA-3KM), a similarity-based ensemble approach (MSEM), and a non-generative deep learning baseline (FRNet). Across three independent rainy seasons, GFRNet demonstrated consistently superior spatial structure reconstruction and robust intensity control, achieving higher TS, FSS, and MS-SSIM scores and lower RMSE than other approaches, highlighting its strong generalization capability and operational applicability. FRNet exhibited better detection of heavy precipitation but suffered from high BIAS and weaker generalization, while MSEM performed well in moderate rainfall but deteriorated in extreme precipitation conditions.
The comparative analyses highlight phase-dependent strengths and weaknesses of the NWP models. For scattered, convective precipitation, all models exhibited notable spatial displacement and intensity biases; the higher-resolution CMA-3KM captured localized convection more effectively but remained prone to false triggers and overestimation. In contrast, during organized rainfall events, NWPs captured banded structures relatively well, yet often predicted them too early or too late and with biased intensity. Among the NWP models, CMA-3KM consistently delivered the most reliable performance across both scattered and organized rainfall, demonstrating the value of high-resolution regional models; CMA-SH9 followed, while ECMWF underperformed for events exceeding 20 mm but retained relatively accurate positional guidance due to its large-scale control fields. Compared to the NWPs, GFRNet delivered improvements across moderate to heavy rainfall events, with particularly clear gains for organized high-impact rainfall, demonstrating that its generative fusion mechanism is especially well suited for complex systems (e.g., typhoon outer rainbands and Mei-yu fronts). However, its performance in non-organized, highly localized heavy rainfall remains an area for improvement.
The sources of model skill provide further insights. MSEM, a similarity-weighted ensemble method, showed stable performance in the 0.1–20 mm range but exhibited substantially poorer skill for ≥40 mm events due to its lack of structural correction and nonlinear representation, limiting its generalization to extreme precipitation. FRNet, trained with content-based loss optimization, primarily focused on systematic rainfall events, often adopting an “aggressive fusion” strategy that over-predicted rainfall to maximize detection. While this yielded higher POD and TS scores, it resulted in systematic overforecasting (elevated BIAS) and distorted precipitation structures, and gave insufficient attention to rare, localized heavy rainfall. GFRNet, by introducing adversarial training, reinforced learning of realistic spatial structures and dynamically fused complementary strengths of multiple NWPs, reducing false precipitation while retaining fine-scale structure. This contributed to its stronger spatial fidelity and generalization, which remained stable even on independent 2023–2024 rainy season data. Spatial FSS maps further confirmed this: GFRNet consistently improved FSS both in areas where NWPs struggled and where they already performed well.
Despite these advances, several limitations and future research directions emerge. First, the evaluation metric system warrants refinement. While the Fraction Skill Score (FSS) better reflects spatial displacement and structure errors than TS, it can overreward overly smooth precipitation fields, potentially distorting assessments. A multidimensional metric framework integrating pixel-wise accuracy with structural fidelity would provide more robust evaluations. Furthermore, aggregate metrics can mask model-specific strengths and weaknesses: averaged scores risk “diluting” performance in high-impact rainfall scenarios. Evaluations should explicitly distinguish between all samples and subsets representing the most influential rainfall events to better diagnose model “core competence”.
Second, the current loss functions remain pixel-level MSE and MAE variants. Even with weighting schemes, they tend to neglect weak, isolated rainfall signals during training, leading models to either over-aggressively forecast or under-represent weak signals. Tailored loss functions and training strategies are required to address this gap.
Third, while GFRNet nonlinearly integrates multi-NWP information, its performance is ultimately constrained by NWP guidance. In cases where none of the three NWP models capture precipitation, GFRNet similarly fails to reconstruct realistic structures, illustrating that current deep learning post-processing is still largely dependent on the underlying NWPs. To overcome this limitation, future steps include introducing more thermodynamic and dynamic variables (e.g., temperature, humidity, wind fields, geopotential height) as auxiliary inputs, enabling the model to directly learn the complex nonlinear relationships between physical factors and precipitation generation, thereby enhancing its capabilities in forecasting nascent convection and systematic organizational structures.
Finally, limitations inherent to GAN-based frameworks merit attention. GAN training can suffer from mode collapse, instability, and difficulty learning rare-event distributions, which are critical for extreme rainfall. Future research could leverage diffusion models – next-generation generative frameworks that have demonstrated superior stability and distribution learning in image reconstruction and remote sensing. Conditional diffusion models, which iteratively “denoise” toward realistic outputs under NWP constraints, could gradually generate refined precipitation fields and naturally support probabilistic outputs, enabling uncertainty quantification. Hybrid GAN–diffusion architectures may balance GAN's efficiency with diffusion’s stability, improving realism without compromising speed.
It is worth noting that, although MSEM provides a strong and interpretable baseline for multi-model fusion, this study has not yet carried out a systematic comparison against a broader set of classical statistical or regression-based methods (e.g., locally weighted regression, analogue techniques, or topography-aware interpolation schemes). A more comprehensive benchmark including such lower-complexity and more transparent approaches would further clarify the practical added value of GAN-based post-processing. We regard this as an important direction for future work, particularly in the context of operational implementation and user-facing interpretability.
In summary, GFRNet illustrates the potential of generative modeling for precipitation correction, delivering marked gains for organized and high-impact rainfall events. Future work combining more informative physical predictors (e.g., thermodynamic and dynamic fields), advanced generative architectures (e.g., conditional diffusion), and probabilistic output frameworks offers a clear path to further advance GFRNet's capabilities, enhancing its ability to forecast a wider range of rainfall types and intensities with greater fidelity.
This appendix provides detailed evaluation scores for model performance during the 2022, 2023, and 2024 rainy seasons. The tables present verification results for 3 h precipitation forecasts using a set of categorical and neighborhood-based metrics: Threat Score (TS), Bias Score (BIAS), False Alarm Ratio (FAR), Probability of Detection (POD), and Fraction Skill Score (FSS). Evaluations are performed at four thresholds: 0.1, 10, 20, and 40 mm per 3 h, corresponding to light, moderate, heavy, and extreme rainfall.For each metric, the best and second-best scores are shown in bold and underlined text, respectively.
Table A12022 Rainy Season: evaluation results of ECMWF, CMA-SH9, CMA-3KM, MSEM, FRNet, and GFRNet for r3 prediction. TS, BIAS, FAR, POD, and FSS are listed. For each metric, the best, second-best, and third-best scores are highlighted using bold, underline, and italic font styles, respectively.
Table A22023 Rainy Season: evaluation results of ECMWF, CMA-SH9, CMA-3KM, MSEM, FRNet, and GFRNet for r3 prediction. FSS, TS, BIAS, FAR, and POD are listed. For each metric, the best, second-best, and third-best scores are highlighted using bold, underline, and italic font styles, respectively.
B1 Training process ablation analysis
Ablation experiments were conducted to assess the effects of two architectural components in GFRNet: the Squeeze-and-Excitation (SE) block and the weighted loss function. Specifically, the performance of GFRNet was compared with that of a variant without SE blocks (GFRNet_wo_SE) and another using standard MSE/MAE loss instead of the weighted loss (GFRNet_wo_WeightedLoss). The results are summarised in Table B1.
The SE blocks were found to have limited influence on light rain prediction, as reflected by the comparable TS scores of GFRNet and GFRNet_wo_SE (0.406 vs. 0.408). However, for thresholds of 10 mm and above, GFRNet consistently outperformed the variant without SE blocks. For instance, at the 20 and 40 mm thresholds, the TS scores increased from 0.134 and 0.052 to 0.145 and 0.056, respectively. These results suggest that SE blocks play a notable role in capturing the structural details associated with heavier precipitation. In contrast, the use of standard loss functions led to improved TS ores for light rain (0.431 vs. 0.406), indicating better performance in this regime. Nevertheless, for higher thresholds, the weighted loss function significantly enhanced model accuracy. At the 20 and 40 mm thresholds, the TS scores of GFRNet_wo_WeightedLoss dropped to 0.115 and 0.028, compared to 0.145 and 0.056 for GFRNet. This demonstrates the effectiveness of the weighted loss in improving the model's sensitivity to moderate and heavy rainfall.
In summary, both SE blocks and the weighted loss function are essential to GFRNet's performance in forecasting moderate to heavy precipitation. The SE blocks enhance spatial feature representation, while the weighted loss strengthens the model's focus on high-impact events. These findings confirm the utility of the proposed components in improving the robustness and accuracy of precipitation forecasts.
B2 Input source contribution analysis
We conducted a series of ablation experiments to systematically evaluate the contribution of each input source to the precipitation forecasting performance of GFRNet. The influence of each input was quantified using a Relative Importance Score (RIS), which is defined as follows:
Table B1TS scores for different rain thresholds in blocks ablation experiments. Note: the best score is indicated in bold and the second-best score is underlined.
The TS scores of the ablation experiments are presented in Table B2. The results indicate that, even with the removal of any single input, GFRNet consistently outperforms the three NWP baselines in forecasting moderate, heavy, and storm precipitation. This demonstrates the model's robustness and its capacity to produce reliable corrections even when certain data sources are unavailable.
Figure B1Relative importance score (RIS) of each input source (ECMWF, CMA-SH9, CMA-3KM, Meta, and Time) at different precipitation thresholds (0.1, 10, 20, and 40 mm per 3 h). RIS values quantify the contribution of each source to GFRNet's performance; positive values indicate a beneficial effect on model forecasts, while negative values indicate a slight degradation.
Further analysis of the RIS values (Fig. B1) shows that, except for META and temporal features – which exhibit a minor negative effect on light rain forecasts – all inputs contribute positively across precipitation categories. The ECMWF input is particularly beneficial for moderate and heavy rainfall, although its contribution is smaller in light and storm precipitation forecasts. This is consistent with its status as a global high-resolution model with advanced physical parameterizations (e.g., cloud microphysics and boundary-layer schemes), which enhances its skill in simulating mesoscale precipitation processes.
The CMA-3KM input yields substantial improvements across moderate, heavy, and storm precipitation forecasts, with particularly strong impact on moderate and heavy rain. As a high-resolution regional model, CMA-3KM is capable of resolving finer-scale convective structures and local precipitation evolution, thereby enhancing forecast accuracy in these regimes. In contrast, CMA-SH9 contributes modestly to moderate and heavy rainfall forecasts, but its impact on light and storm precipitation is limited – likely due to its lower spatial resolution and less detailed physical process representations.
META and temporal features improve forecasts for moderate to storm precipitation but slightly degrade performance for light rainfall, possibly due to increased noise. Heavier precipitation events tend to exhibit clearer spatial patterns and more distinct temporal evolution, which can be effectively leveraged by topographic and temporal features.
Overall, by integrating multiple NWP model outputs and auxiliary features, GFRNet substantially improves the accuracy and resolution of precipitation forecasts. The ablation results highlight the model's effectiveness in forecasting moderate to extreme precipitation and demonstrate its robustness to missing input sources, further underscoring its practical applicability.
The gridded precipitation ground truth data and model forecast outputs used in this study are freely accessible at https://doi.org/10.57760/sciencedb.09821 (Zuliang and Qi, 2024). The codes for training GFRNet and FRNet, as well as for evaluating model performance, are available at https://doi.org/10.5281/zenodo.14652556 (Fang and Zhong, 2025).
ZLF, QZ, HMC, and XMW initiated the study, and QZ supervised and administered the project. ZLF, ZZC, and HLL prepared all the data and wrote the training and evaluation scripts together. All authors contributed to the writing and editing of the paper.
The contact author has declared that none of the authors has any competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.
We extend our heartfelt thanks to Dan Zhang for their patient guidance during the writing process, the Tianhe team for providing computational resources and technical support, and the China Meteorological Administration for providing valuable data.
This research has been supported by the National Natural Science Foundation of China (grant nos. U2142214 and 42030611) and the National Key Research and Development Program of China (grant no. 2023YFC3007502).
This paper was edited by Nicola Bodini and reviewed by two anonymous referees.
Agarap, A. F.: Deep Learning using Rectified Linear Units (ReLU), arXiv [preprint], https://doi.org/10.48550/arXiv.1803.08375, 2019. a
Arjovsky, M., Chintala, S., and Bottou, L.: Wasserstein Generative Adversarial Networks, Proceedings of Machine Learning Research, 70, 214–223, https://proceedings.mlr.press/v70/arjovsky17a.html (last access: 2 December 2025), 2017. a
Ayzel, G., Heistermann, M., and Winterrath, T.: RainNet v1.0: a convolutional neural network for radar-based precipitation nowcasting, Geosci. Model Dev., 13, 2631–2644, https://doi.org/10.5194/gmd-13-2631-2020, 2020. a
Bihlo, A.: Key factors for quantitative precipitation nowcasting using deep learning, Geosci. Model Dev., 16, 5895–5916, https://doi.org/10.5194/gmd-16-5895-2023, 2023. a
Boeing, G.: Visual Analysis of Nonlinear Dynamical Systems: Chaos, Fractals, Self-Similarity and the Limits of Prediction, Systems, 4, 37, https://doi.org/10.3390/systems4040037, 2016. a
Chen, L.-Q., Zhou, X.-S., and Yang, S.: A Quantitative Precipitation Forecasts Method for Short-range Ensemble Forecasting, T. Atmos. Sci., 28, 543–548, 2005. a
Chen, P. J., Feng, Y. R., Meng, W. G., Wen, Q. S., Pan, N., and Dai, G. F.: A correction method of hourly precipitation forecast based on convolutional neural network, Meteorol. Mon., 47, 60–70, https://doi.org/10.7519/j.issn.1000-0526.2021.01.006, 2021. a
Chen, Y., Huang, G., Wang, Y., Tao, W., Tian, Q., Yang, K., Zheng, J., and He, H.: Improving the heavy rainfall forecasting using a weighted deep learning model, Front. Environ. Sci., 11, https://doi.org/10.3389/fenvs.2023.1116672, 2023. a
Dai, K., Zhu, Y., and Bi, B.: The review of statistical post-process technologies for quantitative precipitation forecast of ensemble prediction system, Acta Meteorol. Sin., 76, 493–510, https://doi.org/10.11676/qxxb2018.015, 2018. a
Espeholt, L., Agrawal, S., Sønderby, C., Kumar, M., Heek, J., Bromberg, C., Gazen, C., Carver, R., Andrychowicz, M., Hickey, J., Bell, A., and Kalchbrenner, N.: Deep Learning for Twelve Hour Precipitation Forecasts, Nat. Commun., 13, 5145, https://doi.org/10.1038/s41467-022-32483-x, 2022. a
Fang, Z. and Zhong, Q.: Improving the fine structure of intense rainfall forecast by a designed adversarial generation network, Zenodo [code], https://doi.org/10.5281/zenodo.14652556, 2025. a
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.: Generative Adversarial Networks, in: Advances in Neural Information Processing Systems 27 (NIPS 2014), edited by: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Q., Curran Associates, Inc., Red Hook, NY, USA, https://papers.nips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html (last access: 2 December 2025), 2014. a, b
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.: Improved Training of Wasserstein GANs, in: Advances in Neural Information Processing Systems 30 (NIPS 2017), edited by: Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., Curran Associates, Inc., Red Hook, NY, USA, https://papers.nips.cc/paper_files/paper/2017/hash/892c3b1c6dccd52936e27cbd0ff683d6-Abstract.html (last access: 2 December 2025), 2017. a
Harris, L., McRae, A. T. T., Chantry, M., Dueben, P. D., and Palmer, T. N.: A Generative Deep Learning Approach to Stochastic Downscaling of Precipitation Forecasts, J. Adv. Model. Earth Syst., 14, e2022MS003120, https://doi.org/10.1029/2022MS003120, 2022. a
He, K., Zhang, X., Ren, S., and Sun, J.: Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, https://doi.org/10.1109/CVPR.2016.90, 2016. a
Hu, J., Shen, L., Albanie, S., Sun, G., and Wu, E.: Squeeze-and-Excitation Networks, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018, https://doi.org/10.1109/CVPR.2018.00745, 2018. a
Ioffe, S. and Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Proceedings of Machine Learning Research, 37, 448–456, https://proceedings.mlr.press/v37/ioffe15.html (last access: 2 December 2025), 2015. a
Kim, T., Ho, N., Kim, D., and Yun, S.-Y.: Benchmark Dataset for Precipitation Forecasting by Post-Processing the Numerical Weather Prediction, arXiv [preprint], https://doi.org/10.48550/arXiv.2210.02797, 2022. a
Kingma, D. P. and Ba, J.: Adam: A Method for Stochastic Optimization, arXiv [preprint], https://doi.org/10.48550/arXiv.1412.6980, 2017. a
Leinonen, J., Nerini, D., and Berne, A.: Stochastic Super-Resolution for Downscaling Time-Evolving Atmospheric Fields With a Generative Adversarial Network, IEEE T. Geosci. Remote, 59, 7211–7223, https://doi.org/10.1109/TGRS.2020.3032790, 2021. a
Lin, J., Zong, Z.-P., and Jiang, X.: The verification report of multi-model integrated QPF products from 2010–2011, Weather Forecast. Rev., 5, 67–74, 2013. a
Loshchilov, I. and Hutter, F.: SGDR: Stochastic Gradient Descent with Warm Restarts, arXiv [preprint], https://doi.org/10.48550/arXiv.1608.03983, 2017. a
Pan, Y., Gu, J., Yu, J., Shen, Y., Shi, C., and Zhou, Z.: Test of merging methods for multi-source observed precipitation products at high resolution over China, Acta Meteorol. Sin., 76, 755–766, https://doi.org/10.11676/qxxb2018.034, 2018. a
Price, I. and Rasp, S.: Increasing the accuracy and resolution of precipitation forecasts using deep generative models, Proceedings of Machine Learning Research, 151, 10 555–10 571, https://proceedings.mlr.press/v151/price22a/price22a.pdf (last access: 2 December 2025), 2022. a, b
Radford, A., Metz, L., and Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, arXiv [preprint], https://doi.org/10.48550/arXiv.1511.06434, 2016. a
Ravuri, S., Lenc, K., Willson, M., Kangin, D., Lam, R., Mirowski, P., Fitzsimons, M., Athanassiadou, M., Kashem, S., Madge, S., Prudden, R., Mandhane, A., Clark, A., Brock, A., Simonyan, K., Hadsell, R., Robinson, N., Clancy, E., Arribas, A., and Mohamed, S.: Skillful Precipitation Nowcasting Using Deep Generative Models of Radar, Nature, 597, 672–677, https://doi.org/10.1038/s41586-021-03854-z, 2021. a
Roberts, N. M. and Lean, H. W.: Scale-Selective Verification of Rainfall Accumulations from High-Resolution Forecasts of Convective Events, Mon. Weather Rev., 136, 78–97, https://doi.org/10.1175/2007MWR2123.1, 2008. a
Ronneberger, O., Fischer, P., and Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Lecture Notes in Computer Science, 9351, 234–241, Springer, Cham, https://doi.org/10.1007/978-3-319-24574-4_28, 2015. a
Shen, X., Wang, J., Li, Z., Chne, D., and Gong, J.: China's independent and innovation development of numerical weather prediction, Acta Meteorol. Sin., 78, 451–476, https://doi.org/10.11676/qxxb2020.030, 2020. a
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., kin Wong, W., and chun Woo, W.: Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting, arXiv [preprint], https://doi.org/10.48550/arXiv.1506.04214, 2015. a
Singh, A. K., Albert, A., and White, B.: Downscaling Numerical Weather Models with GANs, in: AGU Fall Meeting Abstracts, 2019, GC43D–1357, AGU, https://agu.confex.com/agu/fm19/meetingapp.cgi/Paper/496182 (last access: 3 December 2025), 2019. a
Sønderby, C. K., Espeholt, L., Heek, J., Dehghani, M., Oliver, A., Salimans, T., Agrawal, S., Hickey, J., and Kalchbrenner, N.: MetNet: A Neural Weather Model for Precipitation Forecasting, arXiv [preprint], https://doi.org/10.48550/arXiv.2003.12140, 2020. a
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.: Dropout: A Simple Way to Prevent Neural Networks from Overfitting, J. Mach. Learn. Res., 15, 1929–1958, 2014. a
Sun, D., Huang, W., Yang, Z., Luo, Y., Luo, J., Wright, J. S., Fu, H., and Wang, B.: Deep Learning Improves GFS Wintertime Precipitation Forecast Over Southeastern China, Geophys. Res. Lett., 50, e2023GL104406, https://doi.org/10.1029/2023GL104406, 2023. a
Sun, J., Xue, M., Wilson, J. W., Zawadzki, I., Ballard, S. P., Onvlee-Hooimeyer, J., Joe, P., Barker, D. M., Li, P.-W., Golding, B., Xu, M., and Pinto, J.: Use of NWP for Nowcasting Convective Precipitation: Recent Progress and Challenges, B. Am. Meteorol. Soc., 95, 409–426, https://doi.org/10.1175/BAMS-D-11-00263.1, 2014. a
Tan, J., Huang, Q., and Chen, S.: Deep learning model based on multi-scale feature fusion for precipitation nowcasting, Geosci. Model Dev., 17, 53–69, https://doi.org/10.5194/gmd-17-53-2024, 2024. a, b
Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Loy, C. C., Qiao, Y., and Tang, X.: ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks, in: The European Conference on Computer Vision Workshops (ECCVW), Lecture Notes in Computer Science, 11133, Springer, Cham, 63–79, https://doi.org/10.1007/978-3-030-11021-5_5, 2018a. a
Wang, Y., Gao, Z., Long, M., Wang, J., and Yu, P. S.: PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning, rXiv [preprint], https://doi.org/10.48550/arXiv.1804.06300, 2018b. a
Wang, Z., Simoncelli, E. P., and Bovik, A. C.: Multiscale structural similarity for image quality assessment, in: The 37th Asilomar Conference on Signals, Systems & Computers, vol. 2, 1398–1402, https://doi.org/10.1109/ACSSC.2003.1292216, 2003. a
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P.: Image quality assessment: from error visibility to structural similarity, IEEE T. Image Process., 13, 600–612, https://doi.org/10.1109/TIP.2003.819861, 2004. a
Yang, X., Dai, K., and Zhu, Y.: Progress and challenges of deep learning techniques in intelligent grid weather forecast, Acta Meteorol. Sin., 80, 649–667, https://doi.org/10.11676/qxxb2022.051, 2022. a
Yin, J., Gao, Z., and Han, W.: Application of a Radar Echo Extrapolation‐Based Deep Learning Method in Strong Convection Nowcasting, Earth Space Sci., 8, e2020EA001621, https://doi.org/10.1029/2020EA001621, 2021. a
Zhang, C.-J., Zeng, J., Wang, H.-Y., Ma, L.-M., and Chu, H.: Correction Model for Rainfall Forecasts Using the LSTM with Multiple Meteorological Factors, Meteorol. Appl., 27, e1852, https://doi.org/10.1002/met.1852, 2020. a
Zhang, X., Yang, Y., Chen, B., and Huang, W.: Operational Precipitation Forecast Over China Using the Weather Research and Forecasting (WRF) Model at a Gray-Zone Resolution: Impact of Convection Parameterization, Weather Forecast., 36, 915–928, https://doi.org/10.1175/WAF-D-20-0210.1, 2021. a
Zhang, Y., Long, M., Chen, K., Xing, L., Jin, R., Jordan, M. I., and Wang, J.: Skilful Nowcasting of Extreme Precipitation with NowcastNet, Nature, 619, 526–532, https://doi.org/10.1038/s41586-023-06184-4, 2023. a
Zhou, K., Sun, J., Zheng, Y., and Zhang, Y.: Quantitative Precipitation Forecast Experiment Based on Basic NWP Variables Using Deep Learning, Adv. Atmos. Sci., 39, 1472–1486, https://doi.org/10.1007/s00376-021-1207-7, 2022. a
Zuliang, F. and Qi, Z.: Precipitation observation and forecast in North China in 2022 by numerical model and deep learning model, Science Data Bank [data set], https://doi.org/10.57760/sciencedb.09821, 2024. a
- Abstract
- Introduction
- Data and method
- Results and analysis
- Discussion and conclusions
- Appendix A: Performance tables for rainy seasons (2022–2024)
- Appendix B: Ablation study
- Code and data availability
- Author contributions
- Competing interests
- Disclaimer
- Acknowledgements
- Financial support
- Review statement
- References
- Abstract
- Introduction
- Data and method
- Results and analysis
- Discussion and conclusions
- Appendix A: Performance tables for rainy seasons (2022–2024)
- Appendix B: Ablation study
- Code and data availability
- Author contributions
- Competing interests
- Disclaimer
- Acknowledgements
- Financial support
- Review statement
- References