These authors contributed equally to this work.
The Community Multiscale Air Quality (CMAQ) model has been a vital tool for air quality research and management at the United States Environmental Protection Agency (US EPA) and at government environmental agencies and academic institutions worldwide. The CMAQ model requires a significant amount of disk space to store and archive input and output files. For example, an annual simulation over the contiguous United States (CONUS) with horizontal grid-cell spacing of 12 km requires 2–3 TB of input data and can produce anywhere from 7–45 TB of output data, depending on modeling configuration and desired post-processing of the output (e.g., for evaluations or graphics). After a simulation is complete, model data are archived for several years, or even decades, to ensure the replicability of conducted research. As a result, careful disk space management is essential to optimize resources and ensure the uninterrupted progress of ongoing research and applications requiring large-scale, air quality modeling. Proper disk-space management may include applying optimal data-compression techniques that are executed on input and output files for all CMAQ simulations. There are several (not limited to) such utilities that compress files using lossless compression, such as GNU Gzip (gzip) and Basic Leucine Zipper Domain (bzip2). A new approach is proposed in this study that reduces the precision of the emission input for air quality modeling to reduce storage requirements (after a lossless compression utility is applied) and accelerate runtime. The new approach is tested using CMAQ simulations and post-processed CMAQ output to examine the impact on the performance of the air quality model. In total, four simulations were conducted, and nine cases were post-processed from direct simulation output to determine disk-space efficiency, runtime efficiency, and model (predictive) accuracy. Three simulations were run with emission input containing only five, four, or three significant digits. To enhance the analysis of disk-space efficiency, the output from the altered precision emission CMAQ simulations were additionally post-processed to contain five, four, or three significant digits. The fourth, and final, simulation was run using the full precision emission files with no alteration. Thus, in total, 13 gridded products (4 simulations and 9 altered precision output cases) were analyzed in this study.
Results demonstrate that the altered precision emission files reduced the disk-space footprint by 6 %, 25 %, and 48 % compared to the unaltered emission files when using the bzip2 compression utility for files containing five, four, or three significant digits, respectively. Similarly, the altered output files reduced the required disk space by 19 %, 47 %, and 69 % compared to the unaltered CMAQ output files when using the bzip2 compression utility for files containing five, four, or three significant digits, respectively. For both compressed datasets, bzip2 performed better than gzip, in terms of compression size, by 5 %–27 % for emission data and 15 %–28 % for CMAQ output for files containing five, four, or three significant digits. Additionally, CMAQ runtime was reduced by 2 %–7 % for simulations using emission files with reduced precision data in a non-dedicated environment. Finally, the model-estimated pollutant concentrations from the four simulations were compared to observed data from the US EPA Air Quality System (AQS) and the Ammonia Monitoring Network (AMoN). Model performance statistics were impacted negligibly. In summary, by reducing the precision of CMAQ emission data to five, four, or three significant digits, the simulation runtime in a non-dedicated environment was slightly reduced, disk-space usage was substantially reduced, and model accuracy remained relatively unchanged compared to the base CMAQ simulation, which suggests that the precision of the emission data could be reduced to more efficiently use computing resources while minimizing the impact on CMAQ simulations.
The Community Multiscale Air Quality (CMAQ) model (Byun and Schere, 2006) is a sophisticated, 3D Eulerian (gridded) numerical modeling system based on message passing interface (MPI) that uses scientific first principles to simulate the chemical transformation and transport of ozone, particulate matter, toxic compounds, and acid deposition. Since the formation and transformation of chemical species are functions of complex atmospheric and chemical interactions, two primary input types are required to initialize CMAQ simulations: meteorology and emissions. First, meteorological data (such as temperature, wind, cloud formation, and precipitation rate) provide atmospheric conditions to drive CMAQ. The second required input field, which is the focal point of this study, is emission data (i.e., emission rates from emission sources) that characterize pollutants from both man-made and naturally occurring sources.
The CMAQ model typically requires multiple emission datasets which occupy a significant amount of disk space. Although disk space is becoming progressively cheaper and more affordable, the research and computational needs are rapidly increasing and becoming more complex. For instance, the total sizes of emission and meteorological datasets are about 7.0 and 6.8 GB, respectively, for a 1 d CMAQ simulation for the contiguous United States (CONUS) with a horizontal resolution of 12 km. The total disk-space size for 1 d of output is 20 GB (for a typical output configuration considering only surface output and neglecting extra diagnostic output). Including 3D fields and diagnostic output, however, the total output disk-space size can easily be tripled. Most studies with CMAQ on this scale create at least a full year's worth of data, so aggressive disk-space management is justifiable to minimize overall costs associated with running CMAQ. Aggressive disk-space management could be a substantial cost-saving measure, regardless of whether simulations are conducted on-site (such as with a high-performance computing architecture or a Linux cluster) or by using cloud computing, where data retrievals can quickly elevate costs. Here, we propose optimizing disk space by compressing CMAQ emission datasets as one practical consideration to maximize storage capacity. If successful, this option could be extended to other input types with large disk-space needs, such as meteorological data.
Compression algorithms can be described as either lossless or lossy. Lossless compression algorithms reduce disk space by replacing repeated sequences with a smaller, unique identifier. Thus, an entire dataset can be retrieved, once uncompressed, without alteration of the original dataset (hence the name, lossless). Lossy algorithms, however, in terms of numeric arrays, reduce disk space by manipulating the mantissa of individual floating-point numbers. Typically, trailing, or insignificant bits, are replaced with a sequence of zeros or ones. As a result, data are compressed at the cost of numerical inconsistencies between the original dataset and the compressed dataset.
The concept of maximizing disk space by altering netCDF datasets has been
examined previously by Zender (2016) and Kouznetsov (2021). Zender (2016)
created a versatile toolset that compresses data based on user
specifications that are applied to the mantissa of floating-point datasets.
The first notable algorithm developed by Zender (2016) is
precision trimming, which is publicly available in the netCDF operators
(NCOs,
Excluding analyses conducted on datasets via lossy compression algorithms,
the authors are unaware of any studies that have been conducted on the
compression efficiency of floating-point datasets with respect to
All input and output files in this study are 32-bit, binary, netCDF files
which inherently contain seven or eight significant digits at most. To
perform this study, we created a simple tool written in Fortran to truncate
floating-point data in netCDF files by keeping
Examples of precision-reducing transformations of floating points from their original forms (first column) to their altered precision forms (second to fourth column).
For this study, CMAQ v5.3.1 (USEPA, 2019; Appel et al., 2021) was run with 459 columns, 299 rows, and 35 vertical layers with a horizontal grid-scale resolution of 12 km (Fig. 1a). Emission input files consist of two area sources and nine point sources (hourly). The area-source emission files contain 57 and 62 variables, and the point-source files contain anywhere from 54 to 58 variables (containing one vertical layer). Ten CMAQ output files (nine of them are hourly) were generated in this study: three output files were generated for simulation-restart purposes (SOILOUT, CGRID which contains only 1 h data, and MEDIA), two files contained average (APMDIAG and ACONC) and hourly (CONC) species concentrations, three files held wet deposition (WETDEP1; 140 variables), dry deposition (DRYDEP; 174 variables), and deposition velocity (DEPV; 104 variables) output, and lastly, the final file contained biogenic emission diagnostic output (B3GTS).
Regions for spatial and temporal stratification
In total, we conducted four annual CMAQ simulations for 2016: one with
unaltered emission data (simulation orig) and three with altered precision emission data by setting
Setup of all simulations (orig, A05, A04, and A03) and cases analyzed in this study.
Simulated numerical, or predictive, accuracy was analyzed against
concentrations of particulate matter with diameter less than 2.5
Typical statistical metrics including mean bias (MB), correlation
coefficient (
The CMAQ input and output data are stored for future analyses and to ensure the reproducibility of modeling studies which demands a tremendous amount of disk space for input and output files. Therefore, we propose easing the disk-space burden by utilizing efficient compression algorithms. For this section of the analysis, two popular, reliable, and efficient compression utilities, gzip and bzip2, were utilized to determine compression efficiency with respect to emission input (emissions mentioned in Sect. 2.) files and CMAQ output (mentioned in Sect. 2. including CGRID, CONC, and SOILOUT) files. Both compression utilities were applied daily to compress emission input and CMAQ output files throughout the entirety of the 2016 simulation (Fig. 2).
Relative compression size of two utilities, gzip (solid line) and bzip2 (dotted line), on daily emission files (labeled as Emiss.) and direct CMAQ output (labeled as CMAQ) for 2016 with reduced precision settings: 5, 4, and 3 (labeled as Altered 05, Altered 04, and Altered 03, respectively). Negative values indicate better compression efficiency.
The gzip compression utility reduced the file sizes, on average by 1 %, 5 %, and 21 %. This translates into about 5, 26, and 111 GB actual difference between the compressed orig case and the compressed A05, A04, and A03 emission datasets for the entire year of 2016, respectively. The reduction in file size (using gzip) was more substantial when applied to reduced precision CMAQ output, with an average reduction in file size of 4 %, 19 %, and 67 %. This means about 167, 839, and 2016 GB actual difference between the orig case and FX05, FX04, and FX03, respectively for the entire year. With the bzip2 utility, the reduction in magnitude is much larger than with gzip, with an average reduction of file size equal to 6 %, 25 %, and 48 % (actual differences are about 27, 126, and 241 GB, respectively for A05, A04, and A03 emission files and 19 %, 47 %, and 69 % (actual differences are about 856, 2142, and 3115 GB, respectively) for the compressed CMAQ output. Thus, bzip2 is found to be a more effective tool than gzip by roughly 5 %, 20 %, and 27 % for emission data and 15 %, 28 % and 23 % for CMAQ output, for reduced precision by keeping 5, 4, and 3 significant digits (reduced precision emissions and reduced precision output data), respectively.
We examined daily runtime (captured by an MPI function called
MPI_WTIME) for CMAQ using emission data prepared with
truncations of A05, A04, and A03 compared with running CMAQ with unaltered (
Relative daily runtime with respect to different adjusted emission input for the A03, A04, and A05 simulations for 2016.
The accuracy of each case is first examined grid-to-point between modeled
output and in situ observations (Fig. 1; AQS and AMON) for all available
model–measurement pairs throughout 2016. In general, to gauge the accuracy
of CMAQ, bulk statistical metrics of bias, NMB,
Annual bulk statistical metrics for all grid–point pairs for the unaltered simulation (orig) binned by species (row) and statistic (column).
Absolute differences in bulk statistical metrics for daily
PM
Stacked bar plots of RMSE (
Bulk statistical results with respect to in situ observations and compared
to the orig simulation (Fig. 4) are encouraging; differences are small, ignoring regional or temporal stratification. To determine if statistical results fluctuate spatially (by region) and or temporally (by season), RMSE was computed for nine different subregions (regions are portrayed in Fig. 1)
across the United States for four seasons (winter, spring, summer, and fall)
from the mentioned observation and model pairs. Each region's RMSE was
stacked together, by simulation and case, and plotted as “accumulated RMSE”
by species. Likewise, results are negligible for daily PM
Results indicate that all simulations and cases have negligible differences
in terms of bulk statistical metrics across the United States and considering
regional and temporal stratifications. Statistical results conducted on in
situ observations were redone (methodologically) at the grid level for
hourly PM
Stacked bar plots of changes to RMSE (
Maximum absolute bias (versus the orig simulation) for PM
Maximum absolute bias (versus the orig simulation) for O
Maximum absolute bias (versus the orig simulation) for NH
Additionally, the maximum absolute bias for all grid cells was determined
spatially between the orig simulation and the altered simulations and cases
throughout 2016 for PM
Total absolute bias difference between the orig simulation and the altered cases and simulations by deposition rate (row) throughout 2016 utilizing hourly output.
Maximum and minimum biases (altered – orig) calculated from hourly CMAQ output for all simulations and cases with respect to the orig simulation across all grid cells.
The final aspect of this evaluation explores differences of important
deposition rates using bar plots which depict the sum of hourly absolute
differences (for all cells across the domain) between the orig simulation and
the altered simulations and cases. Bar plots were created for the wet-deposition rates of sodium (Na), ammonium (NH
No error accumulation due to the non-systematic changes in model inputs
(changing precision introduces both positive and negative changes in a
spatially and temporally random manner) can occur over the course of the
annual simulation for chemical species of interest such as O
We have demonstrated that altering data by keeping a specified number of significant digits in terms of emission input and/or simulated output, increased compression efficiency based on two different, popular compression utilities (gzip and bzip2). For emission data, bzip2 performed far better than gzip and provided compression reduction, on average, by 6 %, 25 %, and 48 %, and 19 %, 47 %, and 69 % for output data for the A05, A04, and A03 cases, respectively, compared to the orig case. In terms of daily simulation runtime for the entire simulation year, the A05, A04, and A03 simulations were faster than the orig simulation in an undedicated HPC system for most simulation days.
As for accuracy, results for all studied simulations, either with
altered precision emission only, or with altered precision emission plus
altered precision output, produced numerically insignificant differences.
For example, the maximum absolute, bulk statistical difference between the
orig simulation and the altered cases and simulations for daily PM
Statistical inconsistencies arise when comparing grid–grid values of hourly
PM
In summary, altering datasets by truncation to retain fewer significant
digits significantly improved data compression and slightly improved
runtime. Based on the thorough, yet spatially limited, in situ evaluation,
this study has shown this proposed technique did not compromise model
accuracy based on an evaluation of simulations and cases at in situ
locations compared to current air quality thresholds for daily PM
The source code of the tool to alter data by keeping a specific number of
significant digits and a run script which includes usage instructions for
this tool, is available from
MSW conducted the runs, performed data analysis, created graphics, and wrote the first draft of the manuscript and worked with DCW to improve it. DCW originated and oversaw this work, coded the tool to alter data by keeping a specific number of significant digits, created scripts to run the entire experiment, outlined the first draft of the manuscript, and contributed to writing and improving the manuscript.
The contact author has declared that none of the authors has any competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper was edited by Sergey Gromov and reviewed by two anonymous referees.