Discrete k-nearest neighbor resampling for simulating multisite precipitation occurrence and model adaption to climate change

Stochastic weather simulation models are commonly employed in water resources management, agricultural applications, forest management, transportation management, and recreational activities. Stochastic simulation of multisite precipitation occurrence is a challenge because of its intermittent characteristics as well as spatial and temporal cross-correlation. This study proposes a novel simulation method for multisite precipitation occurrence employing a nonparametric technique, the discrete version of the knearest neighbor resampling (KNNR), and couples it with a genetic algorithm (GA). Its modification for the study of climatic change adaptation is also tested. The datasets simulated from both the discrete KNNR (DKNNR) model and an existing traditional model were evaluated using a number of statistics, such as occurrence and transition probabilities, as well as temporal and spatial cross-correlations. Results showed that the proposed DKNNR model with GAsimulated multisite precipitation occurrence preserved the lagged cross-correlation between sites, while the existing conventional model was not able to reproduce lagged crosscorrelation between stations, so long stochastic simulation was required. Also, the GA mixing process provided a number of new patterns that were different from observations, which was not feasible with the sole DKNNR model. When climate change was considered, the model performed satisfactorily, but further improvement is required to more accurately simulate specific variations of the occurrence probability.


6.
Thanks for agreeing to this comment. In that case, there is an over-emphasis in the title regarding the "adaptation to climate change". If the methodology is not addressing the climate change, please remove the section or modify it accordingly. Section 5.4 still claims "Adaptation to climate change". This can be addressed along with the next comment (7 th comment), where again authors justify the changing of these probabilities to address the climate change. It is not clear, how tuning of crossover and mutation probabilities could handle the non-stationarity (or climate change according to authors) in the time series of multiple stations?
Reply: The authors consider that the major parts of the climate change adaptation studies in a stochastic generator for multisite precipitation occurrence is the capability to simulate the occurrence series with its changing probability. Even if only the marginal and transition probabilities were tested in the current study in section 3.2 and section 6.4, the current model can be further developed to handle the climate change issue. The authors believe that tuning crossover and mutation probabilities could handle this issue for each station, but not multiple stations at the same time. As mentioned in the previous reply, this tuning process cannot handle the change of correlation structure in future climate scenarios. However, the key change of marginal and transition probabilities can be adapted in the DKNNR with GA model by tuning the crossover and mutation probabilities as tested in Figure 11 and Figure 12. We agree that the tuning probabilities must be further studied to clarify whether the model works reasonably well. Also note that the crossover probabilities might affect the stations each other while the mutation probabilities do not.
To express how this tuning procedure is able to address future climate adaptation, the following description is added. "Assume that the occurrence probability (P1) of the control period is 0.26 (see the dotted line with cross on the bottom panel of Figure 11 and Figure 12) and GCM output indicates that the occurrence probability (P1) increases up to 0.27. This can be achieved with increasing either the crossover probability to 0.1 or the mutation probability to 0.05. Note that the crossover probabilities might affect the stations each other while the mutation probabilities do not." If this reviewer considers that this experimental part for climate change adaption is not good enough and still think this part must be removed, the authors will remove the whole part and change the title. However, the authors prefer leaving as is to indicate future development of the current model.

7.
I could not find much difference between simple KNNR model and KNNR model with GA mixing (Figures s2-s6 and Figures 5-9). Both produce almost same results. Does that mean, the incorporation of GA has not added much value?
Reply: Note that the simple DKNNR model obtains the patterns from the historical data. Its simulated data are compared with the historical statistics. Therefore, DKNNR with GA is difficult to add much value more than the simple DKNNR except that the simulated data can have different patterns from the historical ones. However, the value of the DKNNR with GA is critical, since one of the major reasons for simulating weather data is to generate all possible cases to compare and prepare such cases.

8.
Please see comment 6 in this document, regarding the adaptation to climate change.
Reply: See Reply 6. The simulation can be done by comparing TPM with a uniform random number (ut s ) as 142 where s i p 1 is the selected probability from TPM regarding the previous condition i (i.e. either 0 or 144 1). Wilks (1998)  where ε is a uniform random number between 0 and 1. From this crossover, a 206 new occurrence vector whose elements are similar to the historical ones is generated. 207 (6-3) Mutation: Replace each element (i.e., each station, s=1,…, S) with one selected 208 from all the observations of this element for i=1,…, n with probability m P , i.e., 209 with equal probability for i=1,…, n 211 and ε is a uniform random number between 0 and 1. This mutation procedure 212 allows to generate a multisite occurrence combination that is totally different 213 from the historical records. Without this procedure, multisite occurrences 214 always similar to historical combinations are generated, which is not feasible 215 for a simulation purpose. 216 (7) Repeat steps (1)-(6) until the required data are generated. 217 The selection of the number of nearest neighbors (k) has been investigated by Lall and 218 Sharma (1996) and Lee and Ouarda (2011). A simple selection method was applied in the current 219 study as n k  . For hydrometeorological stochastic simulation, this heuristic approach of the k 220 selection has been employed (Lall and Sharma, 1996;Lee and Ouarda, 2012;Lee et al., 2010b;221 Prairie et al., 2006;Rajagopalan and Lall, 1999). One can use generalized cross-validation (GCV) 222 as shown in Sharma and Lall (1996) and Lee and Ouarda 2011 by treating this simulation as a 223 prediction problem. However, the current multisite occurrence simulation does not necessarily 224 require an accurate value prediction and not much difference in simulation using the simple 225 heuristic approach has been reported. Also, this heuristic approach of the k selection has been 226 popularly employed for hydrometeorological stochastic simulations (Lall and Sharma, 1996;Lee 227 and Ouarda, 2012;Lee et al., 2010b;Prairie et al., 2006;Rajagopalan and Lall, 1999). 228 In Appendix A, an example of the DKNNR simulation procedure is explained in detail. For example, the probability of P11 can be increased with the cross-over probability Pcr by 238 adding the condition to increase the probability of P11 as: 239 can increase the marginal distribution as much as 244 Pm×P1. This has been tested in a case study. 245

246
For testing the occurrence model, 12 weather stations were selected from Yeongnam province 247 which is located in the southeastern part of South Korea, as shown in Figure 1Figure  Administration with the area name (third column) is shown in Table 1Table 1. The employed 251 precipitation dataset presents strong seasonality, since this area is dry from late fall to early autumn 252 and humid and rainy during the remaining seasons, especially in summer. The employed stations 253 are not far from each other, at most 100 km apart, and not much high mountains are located in the 254 current study area. Therefore, this region can be considered as a homogeneous region ( It is important to analyze the impact of weather conditions for planning agricultural 266 operations and water resources management, especially during the summer season, because around 267 50-60 percent of the annual precipitation occurs during the summer season from June to September. 268 The length of daily precipitation data record ranges from 1976 to 2015 and the summer season 269 record was employed, since a large number of rainy days occur during summer and it is important 270 to preserve these characteristics. Also, the whole year dataset was tested and other seasons were 271 15 further applied but the correlation coefficient was relatively high and its correlation matrix 272 estimated was not a positive semi-definite matrix for the MONR model. 273

274
To analyze the performance of the proposed DKNNR model, the occurrence of precipitation 275 was simulated. The DKNNR simulation was compared with that of the MONR model. For each 276 model, 100 series of daily occurrence with the same record length were simulated. The key 277 statistics of observed data and each generated series, such as transition probabilities (P11, P01 , and 278 P1) and cross-correlation (see Eq. (10)(10)), were determined. The MONR model underestimated 279 the lag-1 cross-correlation, as indicated by Wilks (1998). In the current study, this statistic was 280 analyzed, since a synoptic scale weather system often results in lagged cross-correlation for daily 281 precipitation data (Wilks, 1998). It was formulated as 282 Statistics from 100 generated series were evaluated by the root mean square error (RMSE) 284 expressed as: 285 where N is the number of series (here 100), G m  is the statistic estimated from the m th generated 287 series, while h  is the statistic for the observed data. Note that lower RMSE indicates better 288 performance, represented by ing the summarized error of a given statistic of generated series from 289 the statistic of the observed data.  However, all three probabilities (P11, P01, and P1) have relatively small RMSEs in Pm =0.003. 310 Therefore, the parameter set 0.02 and 0.003 was chosen for Pcr and Pm, respectively, and employed 311 in the current study. We also tested the simulation without the GA mixing procedure (results not 312 17 shown). The results showed that no better result could be found from the simulation without GA 313 mixing. The necessity of the GA mixing is further discussed in the following. 314 We further tested and discuss why the GA mixing is necessary in the proposed DKNNR 315 model as follows. For example, assume that three weather stations are considered and observed 316 data only has the occurrence cases of 000, 001,011,010, 011,100,111 among 2 3 =8 possible cases. 317 In other words, no patterns for 110 and 101 is found in the observed data. Note that 0 is dry day 318 and 1 is rainy (or wet) day. The KNNR is a resampling process in that the simulation data is 319 resampled from the observations. Therefore, no new patterns such as 110 and 101 can be found in 320 the simulated data. 321 This can be problematic for the simulation purpose in that one of the major simulation 322 purposes is to simulate sequences that might possibly happen in the future. The wet (1) or dry (0)  323 for multisite precipitation occurrence is decided by the spatial distribution of a precipitation 324 weather system. A humid air mass can be distributed randomly, relying on wind velocity and 325 direction as well as the surrounding air pressure. In general, any combinations of wet and dry 326 stations can be possible, especially when the simulation continues infinitely. Therefore, the 327 patterns of simulated data must be allowed to have any possible combinations, here 4096 even if 328 it has not been observed from the historical records. Also, its probability to have this new pattern 329 must not be high, since it has not been observed in the historical records and this can be taken into 330 account by low probability of the crossover and mutation. 331 This drawback of the KNNR model frequently happens in multisite occurrence as the 332 number of stations increases. Note that the number of patterns increases as 2 n where n is the number 333 of stations. If n=12, then 4096 cases must be observed. However, among 4096 cases, observed 334 18 cases are limited, since the number of data is limited. The GA process can mix two candidate 335 patterns to produce new patterns. For example, in the three station case, a new pattern 101 can be 336 produced from two observed occurrence candidates of 001 and 100 by the crossover of the first 337 value of 001 to the first value of 100 (i.e. 001 101), which is not in the observed data. 338 Note that the data employed in the case study are 40 years and 122 days (summer months) 339 in each year. The total number of the observed data is 4880 and the number of possible cases is 340 4096. We checked the number how many of possible cases that were are not found in the observed 341 data. The result shows that 3379 cases weare not observed at all for the entire cases as shown in 342  7 for the observed data and the data generated from the DKNNR and MONR models. In Table  356 2Table 2, the observed statistic shows that P11 is always higher than P01 and P1 is between P11 and 357 P01. Site 6 shows the lowest P11 and P1 and site 12 shows the highest P11. 358 As shown in Figure 5Figure 5, the probability P11 of the observed data shows that sites 6, 7, 359 8, and 9 located in the northern part of the region exhibited lower consistency (i.e. consecutive 360 rainy days) than did the other sites, while sites 5 and 12 had higher probability of P11 than did other 361 sites. Both models preserved well the observed P11 statistic. It seems that the MONR model had a 362 slightly better performance, since this statistic is parameterized in the model as shown in section 363 2.2 and that is the same for P01 and P1 as shown in Figure  In the DKNNR modeling procedure, the simple distance measurement in Eq. (11)(11) allows 370 to preserve transition probabilities in that the following multisite occurrence is resampled from the 371 historical data whose previous states of multisite occurrence (xi s ) are similar to the current 372 simulation multisite occurrence (Xc s ). This summarized distance (Di) is an essential tool in the 373 proposed DKNNR modeling. The condition of the current weather system is memorized and the 374 system is conditioned on simulating the following multisite occurrence with the distance 375 measurement like a precipitation weather system dynamically changes but often it impacts the 376 system of the following day. 377 20 As shown in Figure 6Figure 6, the P01 probability showed a slightly different behavior such 378 that sites 1, 2, and 3 located in the middle part of the Yeongnam province showed a higher 379 probability than did other sites. A slight underestimation was seen for sites 2 and 11 but it was not 380 critical, since its observed value with a cross mark was close to the upper IQR representing 75 381 percentile. 382 The behavior of P1 was found to be the same as that of the P11 probability. It can be seen in 383 Figure 7Figure 7 that no significant underestimation is seen for the DKNNR model (top panel). 384 The P1 statistic was fairly preserved by both DKNNR and MONR models. Note that the MONR 385 model parameterized the P1 statistic through the transition probabilities as in Eq. (5)(5), while the 386 DKNNR model did not. Although the DKNNR model used almost no parameters for simulation, 387 the P1 statistic was preserved fairly well. 388

389
Cross-correlation is a measure of the relationship between sites. The preservation of cross-390 correlation is important for the simulation of precipitation occurrence and is required in the 391 regional analysis for water resources management or agricultural applications. Furthermore, 392 lagged cross-correlation is also essential as much as is cross-correlation (i.e. contemporaneous 393 correlation). For example, the amount of streamflow for a watershed from a certain precipitation 394 event is highly related with lagged cross-correlation. 395 Daily precipitation occurrence, in general, shows the strongest serial correlation at lag-1 and 396 its correlation decays as the lag gets longer. This is because a precipitation weather system moves 397 according to the surrounding pressure and wind direction that dynamically change within a day or 398 week. Therefore, we analyzed the lag-1 cross-correlation in the current study as the representative 399 lagged correlation structure. 400 The cross-correlation of observed data is shown in Table 3Table 3. High cross-correlation 401 among grouped sites, such as sites 6, 7, and 8 (northern part) and sites 3, 4, and 5, as well as 12 402 (southeast coastal area, 0.68-0.87), was found. As expected, sites 5 and 12 had the highest cross-403 correlation (0.87) due to proximity. The northern sites and coastal sites showed low cross-404 correlation. This observed cross-correlation was well preserved in the data generated from both 405 DKNNR and MONR models, as shown in Figure 8Figure 8 as well as Table 4Table 4 and Table  406 5Table 5. However, consistently slight but significant underestimation of cross-correlation was 407 seen for the data generated by the MONR model (see the bottom panel of Figure 8Figure 8). Note 408 that the errorbars are extended to upper and lower lines of the circles to 1.95×standard deviation. 409 The difference of RMSE in Table 6Table 6 showed this characteristic, as most of the values were 410 positive, indicating that the proposed DKNNR model performed better for cross-correlation. 411 The lag-1 cross-correlation of observed data, as shown in Table 7Table Table 8Table 8 reflects this behavior. In the bottom panel of Figure 9Figure 9, 421 some of the lag-1 cross-correlations were well preserved, that wereas aligned with the base line. 422 From Table 8Table 8, the MONR model reproduced the autocorrelations well with the shaded 423 values. It is because the lag-1 autocorrelation was indirectly parameterized with the transition 424 probabilities of P11 and P01 as in Eq. (6)(6). Other than this autocorrelation, the lag-1 cross-425 correlation was not reproduced well with the MONR model. This shortcoming was mentioned by 426 Wilks (1998). Meanwhile, the proposed DKNNR model preserved this statistic without any 427 parameterization. 428 We further tested the performance measurements of MAE and Bias whose . The estimates 429 showed that MAE had no difference from RMSE. In addition, Bias of the lag-1 correlation 430 presented significant negative values implying its underestimation for the simulated data of the 431 MONR model as shown in Table 9Table 9, while Table 10Table 10 of the DKNNR model showed 432 a much smaller bias. 433 Also, the whole year data instead of the summer season data was tested for model fitting. 434 Note that all the results presented above were for the summer season data (June-September) as 435 mentioned in section 4 on data description. The lag-1 cross-correlation is shown in Figure 10Figure  436 10 which indicates that the same characteristic was observed as for the summer season, such that 437 the proposed DKNNR model preserved better the lagged cross-correlation than did the existing 438 MONR model. Other statistics, such as correlation matrix and transition probabilities, exhibited 439 the same results (not shown). Also, other seasons were tried but the estimated correlation matrix 440 was not a positive semi-definite matrix and its inverse cannot be made for multivariate normal 441 distribution in the MONR model. It was because the selected stations were close to each other 442 (around 50-100 km) and produced high cross-correlation, especially in the occurrence during dry 443 seasons. Special remedy for the existing MONR model should be applied, such as decreasing 444 cross-correlation by force, but further remedy was not applied in the current study since it was not 445 within the current scope and focus. 446

447
Model adaptability to climate change in hydro-meteorological simulation models is a critical 448 factor, since one of the major applications of the models is to assess the impact of climate change. 449 Therefore, we tested the capability of the proposed model in the current study by adjusting the 450 probabilities of cross-over and mutation as in Eqs. (15)(15) and (16)(16)

. A number of variations 451
can be made with different conditions. 452 In Figure 11Figure 11, the changes of transition and marginal probabilities are shown along 453 with the increase of ing the crossover probability Pcr from 0.01 to 0.2 with the condition that that 454 the candidate value is one and the previous value is also one as in Eq. (15)(15) for the selected 5 455 stations among the 12 stations (from station 1 to station 5, see Table 1Table 1 for details). The 456 stations were limited in this analysis due to computational time. In each case 100 series were 457 simulated. The average value of the simulated statistics is presented in the figure. It is obvious that 458 the transition probability P11 increased as intended along with the increase of Pcr . As expected 459 from Eq. (5)(5), P1 presents that the change of P1 is highly related to P11. However, the probability 460 P01 fluctuated along with the increase of Pcr. Elaborate work to adjust all the probabilities is 461 however required. 462 The changes in transition and marginal probabilities are presented in Figure 12Figure 12 463 with increasing mutation probability Pm from 0.01 to 0.2 under the condition that the candidate 464 value is one so that the marginal probability P1 increased. P01 also increased along with increasing 465 24 P1. The change of P11 was not related with other probabilities. The combination of the adjustment 466 of Pcr and Pm with a certain condition to the previous state will allow the specific adaptation for 467 simulating future climatic scenarios. 468 As an example, assume that the occurrence probability (P1) of the control period is 0.26 (see 469 the dotted line with cross on the bottom panel of Figure 11 and Figure 12) and GCM output 470 indicates that the occurrence probability (P1) increases up to 0.27. This can be achieved with 471 increasing either the crossover probability to 0.1 or the mutation probability to 0.05. Note that the 472 crossover probabilities might affect the stations each other, while the mutation probabilities do not 473 Climate change, however, may refer to a larger phenomenon, which cannot be addressed 474 directly through modifying only the marginal and transition probabilities as in the current study. 475 Further modeling development on systematically varying temporal and spatial cross-correlations 476 is required to properly address the climate change of the regional precipitation system. 477

478
In the current study, the discrete version of a nonparametric simulation model, based on 479 KNNR, is proposed to overcome the shortcomings of the existing MONR model, such as long 480 stochastic simulation for parameter estimation and underestimation of the lagged crosscorrelation 481 between sites as well as testing the adaptability for climatic change. Occurrence and transition 482 probabilities and cross-correlation as well as lag-1 cross-correlation are estimated for both models. 483 Better preservation of cross-correlation and lag-1 cross-correlation with the DKNNR model than 484 the MONR model is observed. For some cases (i.e., the whole year data and other seasons than the 485 summer season), the estimated cross-correlation matrix is not a positive semi-definite matrix so 486 25 the multivariate normal simulation is not applicable for the MONR model, because the tested sites 487 are close to each other with high cross-correlation. 488 Results of this study indicate that the proposed DKNNR model reproduces the occurrence 489 and transition probabilities satisfactorily fairly and preserves the cross-correlations better than does 490 the existing MONR model. Furthermore, not much effort is required to estimate the parameters in 491 the DKNNR model, while the MONR model requires a long stochastic simulation just to estimate 492 each parameter. Thus, the proposed DKNNR model can be a good alternative for simulating 493 multisite precipitation occurrence. 494 We tested further the enhancement of the proposed model for adapting to climate change by 495 modifying the mutation and crossover probabilities Pm and Pcr. The rResults showed that the 496 proposed DKNNR model has the capability to adapt to the climate change scenarios, but further 497 elaborate work is required to find the best probability estimation for climate change. Also, only 498 the marginal and transition probabilities cannot address the climate change of regional 499 precipitation. The variation of temporal and spatial cross-correlation structure must be considered 500 to properly address the climate change of the regional precipitation system. Further study on 501 improving the model adaptability to climate change will be followed in the near future. Also, the 502 simulated multisite occurrence can be coupled with a multisite amount model to produce 503 precipitation events, including zero values. Further development can be made for multisite amount 504 models with a nonparametric technique, such as KNNR and bootstrapping. 505 In this appendix, one example of DKNNR simulation is presented with observed dataset in 517 for GA (i.e. Pc and Pm) are 0.1 and 0.01, respectively. Simulation can be made as follows: 522

Code and Data Availability
(1) Estimate the distance Di between i x and c X for i=1,…,n-1 as in Eq.(11)(11). For 523 example, for i=1, 524 All the estimated distances are shown in the last column of Table A  (3) Simulate a uniform random number (u) between 0 and 1. Say u=0.321. This value must 532 be compared with the cumulative weighted probabilities in the last column of Table A  Therefore, the selected day, p, is 14. The occurrences of the following day p+1=15 for 12 536 stations are selected as in the second row of Table A  one among all six days is selected with equal probability. Assume that p=4 is selected and 541 the following occurrences are selected, as shown in the third row of Table A 4Table A 4. 542 (5) With two sets, crossover and mutation process is performed as follows: 543 (5-1) Crossover: For each station, a uniform random number (ε) is generated and 544 compared with Pc=0.1 here. Say ε =0.345, then skip since ε =0.345> Pc=0.1. For 545 s=6, assume the generated random number, ε (=0.051)< Pc(=0.1) and then switch 546 the 6 th station value of Set 1 into the value of Set 2 (see Table A 4Table A 4). The 547 28 occurrence state of s c X 1  turns into 1 from 0 as shown in the fourth row of Table A  548 4Table A 4 as well as station 8. 549 (5-2) Mutation: For each station, a uniform random number (ε) is generated and compared 550 with Pm=0.01. For s=12, assume ε =0.009< Pm=0.01 and switch the 12 th station 551 value of Set 1 with the one selected among all the observed 12 th station values with 552 equal probability (here the last column, s=12, of the bottom part of Table A Table 3. Cross-correlation of observed data for 12 stations from Yeongnam province, South 668 Korea. 669 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S1 1  S9 S10 S11 S12  Table 5. Averaged cross-correlation of 100 simulated series from the MONR model for 12 678 stations from Yeongnam province. 679 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S1 1  S9 S10 S11 S12  shows underestimation. 714 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12  Day S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 Day S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12  precipitation occurrences for 12 stations and the rows below show the absolute difference 809 between the current occurrences (Xc) and the observed data in Table A 1Table A 1. The last  810 column presents the distances in Eq. (11)(11). 811 day S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 Dist