Neural networks for data assimilation of surface and upper-air data in Rio de Janeiro

de Almeida, Vinícius Albuquerque; de Campos Velho, Haroldo Fraga; França, Gutemberg Borges; Ebecken, Nelson Francisco Favilla

doi:10.5194/gmd-2022-50

Preprints

https://doi.org/10.5194/gmd-2022-50

Preprints

Submitted as: development and technical paper

09 Sep 2022

Submitted as: development and technical paper |

| 09 Sep 2022

Status: this preprint was under review for the journal GMD. A final paper is not foreseen.

Neural networks for data assimilation of surface and upper-air data in Rio de Janeiro

Vinícius Albuquerque de Almeida, Haroldo Fraga de Campos Velho, Gutemberg Borges França, and Nelson Francisco Favilla Ebecken

Abstract. The practical feasibility of neural networks models for data assimilation using local observations data in the WRF model for the Rio de Janeiro metropolitan region in Brazil is evaluated. Surface and multi-level variables retrieved from airport meteorological stations are used: air temperature, relative humidity, and wind (speed and direction). Also, 6-hour forecast from WRF high-resolution simulations are used – domain centered in the Rio de Janeiro city with nested grids of 8 and 2.6 km. Periods of 168 h from 2015–2019 are used with 6 h and 12 h assimilation cycles for surface and upper-air data, respectively, applied to 6-hour forecast fields. The observed data (interpolated to grid points close to airport locations and influence computed in its surroundings) and short-range forecasts are used as input for training model and the 3D-Var analysis on 6-hour forecast fields for each grid point is used as target variable. The neural network models are built using two different approaches: WEKA mul- tilayer perceptron model and TensorFlow’s deep learning implementation. The year of 2019 is used as an independent dataset for forecast validation from the trained models. Results employing 6-hour forecast fields with neural network models are able to emulate the 3D-Var results for surface and multi-level variables, with better results for the NN-TensoFlow implementation. The main result refers to CPU time reduction enabled by the neural networks models, reducing the data assimilation CPU-time by 121 times and 25 times for NN-TensorFlow and NN-WEKA, respectively, in comparison to the 3D-Var method under the same hardware configurations.

This preprint has been withdrawn.

Received: 20 Feb 2022 – Discussion started: 09 Sep 2022

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Preprint (PDF, 1399 KB)

Withdrawal notice
This preprint has been withdrawn.
Preprint (1399 KB)

Download & links

This preprint has been withdrawn.

Vinícius Albuquerque de Almeida, Haroldo Fraga de Campos Velho, Gutemberg Borges França, and Nelson Francisco Favilla Ebecken

Interactive discussion

Status: closed

CEC1:
'Comment on gmd-2022-50', Juan Antonio Añel, 25 Oct 2022

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
First, your manuscript does not contain the mandatory Code Availability section. You have included some information about WEKA and TensorFlow inline in the text, but it should not be there, but in the specific section required at the end of the manuscript. Also, in the text, you point out web pages that are not trustable repositories for scientific archival. In this way, you must publish the WEKA and TensorFlow codes in a new repository, one from our list of suitable ones. The same applies to WRF. You must indicate too the specific version of the model that you use.
Second, all the repositories that you provide in your current Data Availability section do not comply with our requirements. It is especially striking in the case of GitHub. GitHub is not a suitable repository, and it instructs authors to use other alternatives for long-term archival and publishing, such as Zenodo (which you can create directly from GitHub).
Therefore, please, publish all the software used in your manuscript in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as it should be available for the Discussions stage. Also, please, include the relevant primary input/output data. In this way, you must reply to this comment with the DOI and link for those repositories so that they are available during the Discussions stage (as requested). Moreover, please, include in any potential reviewed version of your manuscript the modified 'Code and Data Availability' section with the requested information.
Please, be aware that failing to comply promptly with this request will result in rejecting your manuscript for publication.
Regards,
Juan A. Añel
Geosci. Model Dev. Exec. Editor

Citation: https://doi.org/10.5194/gmd-2022-50-CEC1
- AC1: 'Reply on CEC1', Vinícius Almeida, 31 Dec 2022
  
  Dear Editor,
  Ref.: https://gmd.copernicus.org/preprints/gmd-2022-50/
  
  Ref.: https://doi.org/10.5194/gmd-2022-50
  A "Code Availability" section was included after the "Conclusions" section. (revised manuscript)
  
  It is important to point out that in July/22 data was migrated from Github (https://github.com/aa-vinicius/data-assimilation-nn) to Zenodo (https://doi.org/10.5281/zenodo. 6806170) in order to comply with the journal's rules.
  
  This modification had already been made and sent to
  
  Polina Shvedko <polina.shvedko@copernicus.org> on 11/Jul.
  
  I look forward to further observations.
  
  Best Regards,
  
  Vinícius.
  
  Citation: https://doi.org/10.5194/gmd-2022-50-AC1
RC1:
'Comment on gmd-2022-50', Anonymous Referee #1, 21 Nov 2022
The motivation to replace data assimilation with neural network is attractive. Application of assimilation did demand high computational costs, e.g., forwarding ensemble members in EnKF, maintaining the adjoint and optimization in 4DVar. In this work, simple MLP models are tested to replace a 3DVar assimilation in a relatively small city region with limited number of observations. I would suggest authors to make substantial modification before submitting it again.

Major ones:

Assimilation like 4D-Var or EnKF did requires huge computation efforts. However, the 3D-Var calculation complexity is proportion to the size of model or observations, it is usually trivial as illustrated in Table 4 (several seconds). Even handling models with larger size or with super data like remote sensing obsers, the issue could be solved through regional analysis easily. The choice of 3D-Var is faint to support the motivation.

In Figure 3 and 4: The author provides very limited samples or snapshots of analysis for testing their trained NN model, without stating the overall performance in the whole testing dataset.

Page 9, line 206: only 5 airport measurements are assimilated for analysis. Meanwhile, these same data are used for generation of pseudo-observation for validating the analysis? That is not the corrected way to using the measurements. Crossing validation is required. Please Check Ref: Peter Rayner. Data assimilation using an ensemble of models: a hierarchical approach., 2020, ACP.

In Table 3, NN-TensorFlow outperforms the 3D-Var? It is not solid, afterall, 3D-Var analysis is the learning object of NN? Performance should be examined in-depth.

Minor:

As long as they described the CPU time for assimilation in 3D-Var, NN-TF, NN-Weka in Table 4. It is essential to illustrate the size of the problem, vec x and y in Eq(1), and the solver/environment for 3D-Var and NN. Otherwise, the comparison is unfair.

How to train the NN is unclear, what is the output actually? the analysis over the whole model domain? Or is it trained grid by grid? How many samples in their 4-year dataset?
Citation: https://doi.org/10.5194/gmd-2022-50-RC1
- AC2: 'Reply on RC1', Vinícius Almeida, 12 Jan 2023
  
  Dear Sir/Madam,
  
  The authors would like to thank the editors for comments/suggestions/corrections, helping to improve the present version of the paper. We have carefully revised the manuscript.
  
  Parts of the text were rewritted and reorganized. Attached, we present a document with point-by-point answers for all questions.
  
  Citation: https://doi.org/10.5194/gmd-2022-50-AC2
RC2:
'Comment on gmd-2022-50', Anonymous Referee #2, 03 Jan 2023

I liked this paper, easy to read and understand. I have given this major corrections just due to my major review questions. If they are easy to answer (which I imagine they are), then it's not really anything major at all, but if they are not then it's kind of major, so I put major just to be safe. See my attached reviews.

Citation: https://doi.org/10.5194/gmd-2022-50-RC2
- AC3: 'Reply on RC2', Vinícius Almeida, 12 Jan 2023
  
  Dear Sir/Madam,
  
  The authors would like to thank the editors for comments/suggestions/corrections, helping to improve the present version of the paper. We have carefully revised the manuscript.
  
  Parts of the text were rewritted and reorganized. Attached, we present a document with point-by-point answers for all questions.
  
  Citation: https://doi.org/10.5194/gmd-2022-50-AC3

Interactive discussion

Status: closed

CEC1:
'Comment on gmd-2022-50', Juan Antonio Añel, 25 Oct 2022

Dear authors,
Unfortunately, after checking your manuscript, it has come to our attention that it does not comply with our "Code and Data Policy".

https://www.geoscientific-model-development.net/policies/code_and_data_policy.html
First, your manuscript does not contain the mandatory Code Availability section. You have included some information about WEKA and TensorFlow inline in the text, but it should not be there, but in the specific section required at the end of the manuscript. Also, in the text, you point out web pages that are not trustable repositories for scientific archival. In this way, you must publish the WEKA and TensorFlow codes in a new repository, one from our list of suitable ones. The same applies to WRF. You must indicate too the specific version of the model that you use.
Second, all the repositories that you provide in your current Data Availability section do not comply with our requirements. It is especially striking in the case of GitHub. GitHub is not a suitable repository, and it instructs authors to use other alternatives for long-term archival and publishing, such as Zenodo (which you can create directly from GitHub).
Therefore, please, publish all the software used in your manuscript in one of the appropriate repositories, and reply to this comment with the relevant information (link and DOI) as soon as possible, as it should be available for the Discussions stage. Also, please, include the relevant primary input/output data. In this way, you must reply to this comment with the DOI and link for those repositories so that they are available during the Discussions stage (as requested). Moreover, please, include in any potential reviewed version of your manuscript the modified 'Code and Data Availability' section with the requested information.
Please, be aware that failing to comply promptly with this request will result in rejecting your manuscript for publication.
Regards,
Juan A. Añel
Geosci. Model Dev. Exec. Editor

Citation: https://doi.org/10.5194/gmd-2022-50-CEC1
- AC1: 'Reply on CEC1', Vinícius Almeida, 31 Dec 2022
  
  Dear Editor,
  Ref.: https://gmd.copernicus.org/preprints/gmd-2022-50/
  
  Ref.: https://doi.org/10.5194/gmd-2022-50
  A "Code Availability" section was included after the "Conclusions" section. (revised manuscript)
  
  It is important to point out that in July/22 data was migrated from Github (https://github.com/aa-vinicius/data-assimilation-nn) to Zenodo (https://doi.org/10.5281/zenodo. 6806170) in order to comply with the journal's rules.
  
  This modification had already been made and sent to
  
  Polina Shvedko <polina.shvedko@copernicus.org> on 11/Jul.
  
  I look forward to further observations.
  
  Best Regards,
  
  Vinícius.
  
  Citation: https://doi.org/10.5194/gmd-2022-50-AC1
RC1:
'Comment on gmd-2022-50', Anonymous Referee #1, 21 Nov 2022
The motivation to replace data assimilation with neural network is attractive. Application of assimilation did demand high computational costs, e.g., forwarding ensemble members in EnKF, maintaining the adjoint and optimization in 4DVar. In this work, simple MLP models are tested to replace a 3DVar assimilation in a relatively small city region with limited number of observations. I would suggest authors to make substantial modification before submitting it again.

Major ones:

Assimilation like 4D-Var or EnKF did requires huge computation efforts. However, the 3D-Var calculation complexity is proportion to the size of model or observations, it is usually trivial as illustrated in Table 4 (several seconds). Even handling models with larger size or with super data like remote sensing obsers, the issue could be solved through regional analysis easily. The choice of 3D-Var is faint to support the motivation.

In Figure 3 and 4: The author provides very limited samples or snapshots of analysis for testing their trained NN model, without stating the overall performance in the whole testing dataset.

Page 9, line 206: only 5 airport measurements are assimilated for analysis. Meanwhile, these same data are used for generation of pseudo-observation for validating the analysis? That is not the corrected way to using the measurements. Crossing validation is required. Please Check Ref: Peter Rayner. Data assimilation using an ensemble of models: a hierarchical approach., 2020, ACP.

In Table 3, NN-TensorFlow outperforms the 3D-Var? It is not solid, afterall, 3D-Var analysis is the learning object of NN? Performance should be examined in-depth.

Minor:

As long as they described the CPU time for assimilation in 3D-Var, NN-TF, NN-Weka in Table 4. It is essential to illustrate the size of the problem, vec x and y in Eq(1), and the solver/environment for 3D-Var and NN. Otherwise, the comparison is unfair.

How to train the NN is unclear, what is the output actually? the analysis over the whole model domain? Or is it trained grid by grid? How many samples in their 4-year dataset?
Citation: https://doi.org/10.5194/gmd-2022-50-RC1
- AC2: 'Reply on RC1', Vinícius Almeida, 12 Jan 2023
  
  Dear Sir/Madam,
  
  The authors would like to thank the editors for comments/suggestions/corrections, helping to improve the present version of the paper. We have carefully revised the manuscript.
  
  Parts of the text were rewritted and reorganized. Attached, we present a document with point-by-point answers for all questions.
  
  Citation: https://doi.org/10.5194/gmd-2022-50-AC2
RC2:
'Comment on gmd-2022-50', Anonymous Referee #2, 03 Jan 2023

I liked this paper, easy to read and understand. I have given this major corrections just due to my major review questions. If they are easy to answer (which I imagine they are), then it's not really anything major at all, but if they are not then it's kind of major, so I put major just to be safe. See my attached reviews.

Citation: https://doi.org/10.5194/gmd-2022-50-RC2
- AC3: 'Reply on RC2', Vinícius Almeida, 12 Jan 2023
  
  Dear Sir/Madam,
  
  The authors would like to thank the editors for comments/suggestions/corrections, helping to improve the present version of the paper. We have carefully revised the manuscript.
  
  Parts of the text were rewritted and reorganized. Attached, we present a document with point-by-point answers for all questions.
  
  Citation: https://doi.org/10.5194/gmd-2022-50-AC3

Vinícius Albuquerque de Almeida, Haroldo Fraga de Campos Velho, Gutemberg Borges França, and Nelson Francisco Favilla Ebecken

Viewed

Total article views: 2,307 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
1,626	608	73	2,307	83	123

HTML: 1,626
PDF: 608
XML: 73
Total: 2,307
BibTeX: 83
EndNote: 123

Views and downloads (calculated since 09 Sep 2022)

Month	HTML	PDF	XML	Total
Sep 2022	194	45	5	244
Oct 2022	78	23	4	105
Nov 2022	80	20	3	103
Dec 2022	52	24	1	77
Jan 2023	83	20	8	111
Feb 2023	50	14	0	64
Mar 2023	53	7	0	60
Apr 2023	45	10	0	55
May 2023	26	8	1	35
Jun 2023	11	5	0	16
Jul 2023	7	22	0	29
Aug 2023	10	11	2	23
Sep 2023	27	16	1	44
Oct 2023	30	11	1	42
Nov 2023	13	10	0	23
Dec 2023	15	7	3	25
Jan 2024	4	10	0	14
Feb 2024	21	9	2	32
Mar 2024	13	15	1	29
Apr 2024	25	10	4	39
May 2024	19	12	5	36
Jun 2024	25	8	3	36
Jul 2024	11	4	2	17
Aug 2024	15	9	3	27
Sep 2024	15	2	0	17
Oct 2024	9	9	0	18
Nov 2024	23	4	1	28
Dec 2024	12	7	0	19
Jan 2025	13	10	0	23
Feb 2025	12	10	2	24
Mar 2025	16	10	1	27
Apr 2025	8	16	1	25
May 2025	10	5	1	16
Jun 2025	17	14	1	32
Jul 2025	18	15	0	33
Aug 2025	41	20	2	63
Sep 2025	285	16	2	303
Oct 2025	42	30	4	76
Nov 2025	50	42	2	94
Dec 2025	26	28	1	55
Jan 2026	23	12	3	38
Feb 2026	43	8	1	52
Mar 2026	50	12	2	64
Apr 2026	6	8	0	14

Cumulative views and downloads (calculated since 09 Sep 2022)

Month	HTML	PDF	XML	Total
Sep 2022	194	45	5	244
Oct 2022	78	23	4	105
Nov 2022	80	20	3	103
Dec 2022	52	24	1	77
Jan 2023	83	20	8	111
Feb 2023	50	14	0	64
Mar 2023	53	7	0	60
Apr 2023	45	10	0	55
May 2023	26	8	1	35
Jun 2023	11	5	0	16
Jul 2023	7	22	0	29
Aug 2023	10	11	2	23
Sep 2023	27	16	1	44
Oct 2023	30	11	1	42
Nov 2023	13	10	0	23
Dec 2023	15	7	3	25
Jan 2024	4	10	0	14
Feb 2024	21	9	2	32
Mar 2024	13	15	1	29
Apr 2024	25	10	4	39
May 2024	19	12	5	36
Jun 2024	25	8	3	36
Jul 2024	11	4	2	17
Aug 2024	15	9	3	27
Sep 2024	15	2	0	17
Oct 2024	9	9	0	18
Nov 2024	23	4	1	28
Dec 2024	12	7	0	19
Jan 2025	13	10	0	23
Feb 2025	12	10	2	24
Mar 2025	16	10	1	27
Apr 2025	8	16	1	25
May 2025	10	5	1	16
Jun 2025	17	14	1	32
Jul 2025	18	15	0	33
Aug 2025	41	20	2	63
Sep 2025	285	16	2	303
Oct 2025	42	30	4	76
Nov 2025	50	42	2	94
Dec 2025	26	28	1	55
Jan 2026	23	12	3	38
Feb 2026	43	8	1	52
Mar 2026	50	12	2	64
Apr 2026	6	8	0	14

Viewed (geographical distribution)

Total article views: 2,221 (including HTML, PDF, and XML) Thereof 2,221 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 09 Apr 2026

Short summary

The paper focuses on data assimilation for the WRF model by employing neural network. The applied supervised ML technique was designed to emulate the 3D-Var in a regional atmospheric model. The proposed technique has the potential to significantly reduce the computational effort of data assimilation. Indeed, in the worked example the neural network scheme was more 70 times faster than 3D-Var method, with similar quality for the analysis.


Total:	0
HTML:	0
PDF:	0
XML:	0