Submitted as: development and technical paper
27 Jun 2022
Submitted as: development and technical paper | 27 Jun 2022
Status: this preprint is currently under review for the journal GMD.

The Impact of Altering Emission Data Precision on Compression Efficiency and Accuracy of Simulations of the Community Multiscale Air Quality Model

Michael S. Walters1,2, and David C. Wong1, Michael S. Walters and David C. Wong
  • 1Atmospheric and Environmental Systems Modeling Division, Center for Environmental Measurement and Modeling, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, USA
  • 2Oak Ridge Associated Universities, Oak Ridge, TN, USA
  • These authors contributed equally to this work.

Abstract. The Community Multiscale Air Quality Model (CMAQ) has been a vital tool for air quality research and management at the United States Environmental Protection Agency (U.S. EPA), and at government environmental agencies and academic institutions worldwide. CMAQ requires a significant amount of disk space to store and archive input and output. For example, an annual simulation over the contiguous United States with horizontal grid cell spacing of 12 km requires 2–3 TB of input data and can produce anywhere from 7–45 TB of output data, depending upon modelling configuration, and desired post-processing output (e.g., for evaluations or graphics). After a simulation is complete, model data are archived for several years, or even decades, to ensure the replicability of conducted research. As a result, careful disk space management is essential to optimize resources and ensure the uninterrupted progress of ongoing research and applications requiring large scale, air quality modelling. Proper disk space management may include applying optimal data compression techniques that are executed on input and output files for all CMAQ simulations. There are several (not limited to) such utilities that compress files using losslessness compression, such as GNU Gzip and Basic Leucine Zipper Domain (bzip2). A new approach is proposed in this study that reduces the precision of the air quality model emissions input to reduce storage requirements (after a losslessness compression utility is applied) and accelerate runtime. The new approach is tested using CMAQ simulations and post-processed CMAQ output to examine the impact on the air quality model performance. In total, four simulations were conducted, and nine cases were post-processed from direct simulation output to determine disk space efficiency, runtime efficiency, and model (predictive) accuracy. Three simulations were run with emissions input containing only five, four, or three significant digits. To enhance the analysis of disk space efficiency, altered emissions CMAQ simulations were additionally post-processed to contain five, four, or three significant digits. The fourth, and final, simulation was run using the full precision emissions files with no alteration. Thus, in total, 13 gridded products (four simulations and nine cases) were analysed in this study.

Results demonstrate that the altered emission files reduced the disk space footprint by 6 %, 25 %, and 48 % compared to the unaltered emission files when using the bzip2 compression utility for files containing five, four, or three significant digits, respectively. Similarly, the altered output files reduced the required disk space by 19 %, 47 %, and 69 % compared to the unaltered CMAQ output files when using the bzip2 compression utility for files containing five, four, or three significant digits, respectively. For both compressed datasets, bzip2 performed better than gzip, in terms of compression size, by 5–27 % for emission data and 15–28 % for CMAQ output for files containing five, four, or three significant digits. Additionally, CMAQ runtime was reduced by 2–7 % for simulations using emission files with reduced precision data on a non-dedicated environment. Finally, the model estimated pollutant concentrations from the four simulations were compared to observed data from the U.S. EPA Air Quality System (AQS) and the Ammonia Monitoring Network (AMON). Model performance statistics were negligibly impacted (e.g., normalized mean bias differed by less than 0.01 % for all altered simulations and cases). In summary, by reducing the precision of CMAQ emissions data to five, four, or three significant digits, the simulation runtime on a non-dedicated environment was slightly reduced, disk space usage was substantially reduced, and model accuracy remained relatively unchanged compared to the base CMAQ simulation, which suggests that the precision of the emissions data could be reduced to more efficiently use computing resources while minimizing the impact on CMAQ simulations.

Michael S. Walters and David C. Wong

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gmd-2022-82', Anonymous Referee #1, 01 Aug 2022
  • RC2: 'Comment on gmd-2022-82', Anonymous Referee #2, 19 Oct 2022
  • AC2: 'Responses to Referee #2', David Wong, 27 Oct 2022

Michael S. Walters and David C. Wong

Michael S. Walters and David C. Wong


Total article views: 543 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
461 67 15 543 4 3
  • HTML: 461
  • PDF: 67
  • XML: 15
  • Total: 543
  • BibTeX: 4
  • EndNote: 3
Views and downloads (calculated since 27 Jun 2022)
Cumulative views and downloads (calculated since 27 Jun 2022)

Viewed (geographical distribution)

Total article views: 506 (including HTML, PDF, and XML) Thereof 506 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 27 Jan 2023
Short summary
A typical numerical simulation that associates with large amount of input and output data, applying popular compression software, gzip or bzip2, on data is one good way to mitigate data storage burden. This article proposes a simple technique to alter input, output, or input and output by keeping a specific number of significant digits in data and demonstrates an enhancement in compression efficiency on the altered data but maintains similar statistical performance of the numerical simulation.