Remote sensing observations in the mid-infrared spectral region (4–15

Mid-infrared radiative transfer, covering the spectral range from 4 to 15

Today, mid-infrared radiance measurements provided by satellite instruments are often assimilated directly for global forecasting, climate reanalyses,
or air quality monitoring by national and international weather services and research centers. For example, observations by the fleet of Infrared
Atmospheric Sounding Interferometer (IASI) instruments

The vast number of remote sensing observations from next-generation satellite sensors poses a big data challenge for numerical weather prediction and
Earth system science. Fast and accurate radiative transfer models (RTMs) for the Earth's atmosphere are a key component for the analysis of these
observations. For example, the Radiative Transfer for TOVS (RTTOV) model

However, many traditional RTMs suffer from large computational costs, typically requiring high-performance computing resources to process multiyear
satellite missions. In this study, we focus on the porting and optimization of radiative transfer calculations for the mid-infrared spectral region to
graphics processing units (GPUs). GPUs bear a high potential for accelerating atmospheric radiative transfer calculations. For instance,

An alternative approach to perform atmospheric infrared radiative transfer calculations on field-programmable gate arrays (FPGAs) was proposed by

In this study, we will discuss porting and performance analyses of radiative transfer calculations based on the emissivity growth approximation (EGA)
method as implemented in the JUelich RApid Spectral SImulation Code (JURASSIC) to GPUs. The EGA method as introduced by

The JURASSIC model was first described by

The GPU-enabled version of the JURASSIC radiative transfer model described here is referred to as JURASSIC-GPU. The first version of JURASSIC-GPU was
developed and introduced by

In Sect.

The first step in numerical modeling of infrared radiative transfer is the definition of the atmospheric state. In JURASSIC, the atmosphere is assumed
to be homogeneously stratified, and field quantities such as pressure

For coordinate transformations between spherical and Cartesian coordinates, JURASSIC assumes the Earth to be spherical with a fixed mean radius

Once the atmospheric state is defined, the ray paths through the atmosphere need be calculated. Here it needs to be considered that refraction in the Earth's atmosphere leads to bending of the ray paths towards the Earth's surface. In the case of the limb sounding geometry, this effect causes real tangent heights in the troposphere to be lowered up to several hundred meters below geometrically calculated tangent heights that have been calculated without refraction being considered.

The positions along a single ray path

Here,

The step size

For the limb sounding geometry, it is of particular interest to know the actual real tangent height of the ray paths when taking into account refraction. For this purpose, JURASSIC applies a parabolic interpolation based on the three points of the ray path closest to the Earth's surface to enable a more accurate determination of the tangent point. The error of the tangent heights estimated by the parabolic interpolation method was found to be 1–2 orders of magnitude below the accuracy by which this quantity can typically be measured.

The propagation of monochromatic radiance

In the case of local thermodynamic equilibrium and if scattering of radiation can be neglected, the source function corresponds to the Planck function,

The transmissivity of the atmosphere is determined by specific molecular rotational–vibrational wavebands of the trace gases and by a series of
continuum processes. For molecular emitters, the transmissivity along the ray path is related to the absorption coefficients

Some species are significantly affected by line mixing, whereby the interaction or overlap of spectral lines can no longer be described by simple
addition

Satellite instruments measure radiance spectra at a given spectral resolution. The mean radiance

The most accurate method for calculating the monochromatic emissivity

Instead of detailed monochromatic calculations of the radiance, emissivity, and Planck function, this approximation uses only the spectral averages (indicated by bars) within the spectral range defined by the filter function. Accordingly, the method is referred to as the band transmittance approximation. This method can become computationally very fast if the spectrally averaged emissivities are determined by means of a band model or, as is the case for JURASSIC, by using a set of look-up tables that have been precalculated by means of a line-by-line model.

Equations (

Another error of the band transmittance approximation results from the fact that the spectral correlations of different emitters are neglected. From
Eq. (

The residual

Such correlation terms are neglected in the JURASSIC model. The associated errors remain small if at least one of the emitters has a relatively
constant spectral response. The approximation

For the numerical integration of the approximated radiative transfer equation, Eq. (

Integration of radiance along a ray path. Contributions

It is essential to note that although the numerical integration scheme in Eq. (

The full advantage in terms of speed of the approximated radiative transfer calculations can be obtained by using look-up tables of spectrally
averaged emissivities, which have been prepared for JURASSIC by means of line-by-line calculations. In subsequent radiative transfer calculations, the
spectral emissivities are determined by means of simple and fast interpolation from the look-up tables. For the calculation of the emissivity look-up
tables, any conventional radiative transfer model can be used, which allows calculating the transmission of a homogeneous gas cell depending on
pressure

The pressure, temperature, and column density values in the emissivity look-up tables need to cover the full range of atmospheric conditions. If the
coverage or the sampling of the tables is too low, this could significantly worsen the accuracy of the radiative transfer calculations. We calculated
the look-up tables for pressure levels

Since even a single forward calculation for a remote sensing observation may require thousands of interpolations on the emissivity look-up tables,
this process must be implemented to be most efficient. Direct index calculation is applied for regularly gridded data (temperature), and the bisection
method

As an example, Fig.

Log–log plot of spectral mean emissivity curves for carbon dioxide (

Spectral mean emissivities of an inhomogeneous atmospheric path can be obtained from the look-up tables in different ways. In JURASSIC, the EGA method

For a given ray path of

The emissivity of the extended path of

The basic assumption of the EGA method is that the total emissivity of

The principle of the EGA method is further illustrated in Fig.

Illustration of the EGA method. For the first segment of a ray path, the EGA method will apply the emissivity curve

JURASSIC is written in the C programming language and makes use of only a few library dependencies. In particular, the GNU Scientific Library (GSL) is
used for linear algebra in the retrieval code provided along with JURASSIC. In order to connect the GPU implementation of JURASSIC seamlessly to the
reference implementation

For the GPU programming, we selected the Compute Unified Device Architecture (CUDA) programming model, which exclusively addresses NVIDIA graphical
processors. CUDA is a dialect of C/C++ and the user has to write compute kernels in a specific CUDA syntax.

This minimal example already shows some of the most important features of the CUDA programming model. There are kernels with the attribute

Inside our example CUDA kernel

The example so far assigns zero to the values of array

The inbuilt variable

Graphical processors are equipped with their own memory, typically based on a slightly different memory technology compared to standard CPU memory
(SDRAM versus DRAM). Therefore, pointers to be dereferenced in a CUDA kernel need to reside in GPU memory, i.e., need to be allocated with

In the unified memory model (supported by all modern NVIDIA GPUs)

For production codes, one might accept the extra burden of maintaining a separate GPU version of an application code. However, despite being considered ready for production, JURASSIC remains a research code under continuous development; i.e., it should be possible to add new developments without overly large programming efforts at any time. Therefore, a single source policy has been pursued as much as possible. Having a single source also provides the practical advantage to compile and link a CPU and a GPU version inside the same executable.

Retaining a single source is a driving force for directives-based GPU programming models such as OpenACC, wherein CPU codes are converted to GPU codes by code annotations, which is comparable to OpenMP pragmas for CPUs. On the one hand, in order to harvest the best performance and to acquire maximum control over the hardware, CUDA kernels are considered mandatory. On the other hand, the coding complexity outlined above (kernels, drivers, launch parameters, grid stride loops, GPU pointers, memory transfers) should remain hidden to some extent for some developers of the code, which may be domain scientists or students that are not familiar with all peculiarities of GPU programming. This poses a challenge for enforcing a single source policy.

For these reasons, the GPU-enabled version of JURASSIC is structured as follows: all functionality related to ray tracing and radiative transfer that
should run on both CPU and GPU is defined in inline functions in a common header file

Common source code approach for CPUs and GPUs. The complete functionality of the JURASSIC forward model is provided via a joint header file included by both the specific CPU and GPU drivers.

Using profile-guided analysis of execution runs of the JURASSIC reference code

The look-up tables are densely sampled so that linear interpolation for all continuous quantities (

Depending on the number of emitters and detector channels, the total memory consumption of the look-up tables can become quite large. A typical configuration of sampling points leads to 3 MiB per gas and per instrument channel. Random memory access onto these memory sizes usually leads to an inefficient usage of the memory caches attached to the CPUs and GPUs. The idea of caches is to reduce the memory access latency and potentially also the required bandwidth towards the main memory. A cache miss leads to a request for a cache line from the main memory, typically related to an access latency at least an order of magnitude larger than a read from cache. If the memory access pattern of an application shows a predominant data access structure, a large step towards computing and energy efficiency is to restructure the data layout such that cache misses are avoided as much as possible. Throughput-oriented architectures like GPUs work optimally when sufficiently many independent tasks are kept in flight such that the device memory access latencies can be hidden behind computations on different tasks.

Random access to memory for reading a single

The current JURASSIC reference implementation

On GPUs, branch divergence is an important source of inefficiency. Therefore, the mapping of parallel tasks onto lanes (CUDA threads) and CUDA blocks
has been chosen to avoid divergent execution as much as possible. For the computation of the EGA kernels, lanes are mapped to the different detector
channels

Furthermore, coalesced loads are critical to exploit the available GPU memory bandwidth. Although it seems counterintuitive, the GPU version of
JURASSIC therefore features a restructured data layout for the loop-up tables:

Compared to the data layout in Eq. (

On CPUs, simultaneous multithreading, also known as hyper-threading, is achieved by assigning more than one thread to a core and by time-sharing of the execution time. This means that CPU threads can execute for awhile until the operating system tells them to halt. Then, the thread context is stored and the context of the next thread to execute is loaded. GPUs can operate in the same manner; however, storing and loading of the context can become a bottleneck as this means extra memory accesses. The best operating mode for GPUs is reached if the entire state of all blocks in flight can be kept inside the register file. Only then, context switches come at no extra cost. Consequently, GPU registers are a limited resource which we should monitor when tuning performance critical kernels. The CUDA compiler can report the register usage and spill loads or stores of each CUDA kernel.

Data flow graph for the most important sub-kernels in JURASSIC. Data items are shown as ovals independent of whether they are stored in memory or exist only as intermediate results. Sub-kernels are depicted as rectangles. The arrow labels indicate the data sizes in units of kibibytes. The example refers to a nadir use case considering a single trace gas (

Figure

In the reference code of the forward model, the computation of continuum emissions requires many registers. The exact number of registers depends on
the combination of trace gases yielding continuum emissions in the given spectral range. A gas continuum is considered relevant if any of the detector
channels of the radiative transfer calculations fall into their predefined wavenumber window. JURASSIC implements the

Table

Register counts for the 16 possible combinations of switching on or off the

In the following sections, we discuss the verification and performance analysis of the JURASSIC-GPU code. All performance results reported here were
obtained on the Jülich Wizard for European Leadership Science (JUWELS) supercomputing system at the Jülich Supercomputing Centre, Germany

In the study of

Nadir

In this assessment, we aim for rather extensive coverage of the mid-infrared spectral range. By means of line-by-line calculations with RFM, we
prepared emissivity look-up tables for 27 trace gases with 1

In order to verify the model, we continuously compared GPU and CPU calculations during the development and optimization of JURASSIC-GPU. For the test
case presented in this study, it was found that the GPU and CPU calculations do not provide bit-identical results. However, the relative
differences between the calculated radiances from the GPU and CPU code remain very small (

Reference spectra for midlatitude atmospheric conditions at 650 to 2450

It needs to be considered that many trace gases cover only limited wavebands throughout the mid-infrared spectrum (Fig.

Spectral coverage of 24 selected gases between 650 and 2450

Figure

GPU runtime per ray as a function of wavenumber. Bundles of 32 wavenumbers from 650 to 2449

In order to quantify the correlation between the GPU runtime and the average number of active look-up tables, the data are presented as a scatter plot
in Fig.

Linear scaling model to estimate GPU runtime: the GPU runtime per ray can be modeled as a linear function of the number of active look-up tables. For each ray, a V100 GPU processing 32 channels needs approximately 16.4

The average number of active look-up tables in the wavenumber interval from 650 to 2450

As pointed out in Sect.

In order to understand the effect of cold caches, we repeated the measurements, but we were looking particularly at the first timing result. With
cold caches, the slope of the linear model for the runtime increased from 11.3 to 12.5

From this, we can deduce the runtime increase due to memory page misses in the GPU memory. Table

GPU runtime measured and modeled for 16 896 limb ray paths with 12.78 of 27 look-up tables active. In the unified memory model, data are transferred into GPU memory on demand so a performance penalty is observed at first access (referred to as cold cache).

In this section, we investigate the scaling behavior of the GPU runtime with respect to the number of ray paths and the number of instrument
channels. The GPU runtime results regarding the scaling with respect to the number of ray paths in Fig.

Scaling of the GPU runtime in the limb case. The left panel shows the runtime as a function of the number of ray paths for 1, 2, 4, 8, 16, and 32 channels. The right panel shows the same data in a different projection: the different black lines refer to

Figure

During earlier tuning efforts of the JURASSIC GPU version, much investigation was spent on radiative transfer calculations for the 4.3 and
15

The performance data for the limb case shown in Fig.

The code restructuring described above and by

The best performance results for the CPU version are achieved when running large workloads with 32 channels on two OpenMP threads per core. We find
that the runtime is close to proportional to the number of ray paths over the entire range from 64 to 65

Scaling of the CPU runtime with respect to the numbers of ray paths. See the caption of Fig.

The benchmarks of the reference version (REF) have been restricted to workloads of up to 4096 ray paths, as the code has not been optimized to handle
larger workloads. However, Fig.

Scaling of the (REF) reference implementation runtime with the number of ray paths and channels. In contrast to Fig.

We extracted the best performance from GPU, CPU, and REF implementations and summarized them in Table

Time to solution and estimated energy to solution comparison between GPU version, CPU version, and the reference implementation

A direct comparison of GPU runtimes to CPU runtimes is in most cases hardly meaningful as we need a CPU to operate a GPU. Therefore, we tried to
estimate the power consumption of the compute node with and without GPUs active. We assume a thermal design power (TDP) envelope of 300

From Table

High-performance computing using graphics processing units (GPUs) is an essential tool for advancing computational science. Numerical modeling of infrared radiative transfer on GPUs can achieve considerably higher throughput compared to standard CPUs. In this study, we found that this also applies for the case of the emissivity growth approximation (EGA), which allows us to effectively estimate band-averaged radiances and transmittances for a given state of the atmosphere, avoiding expensive line-by-line calculations.

In order to enable the GPU acceleration, including ray tracing and the EGA method, a major redesign of the radiative transfer model JURASSIC has been necessary. Besides the goal of maximizing the GPU's throughput, the code base has been transformed to offer both a GPU and a CPU version of the forward model; the number of duplicate source code lines has been minimized, facilitating better code maintenance.

The GPU version of JURASSIC has been tuned to deliver outstanding performance for the nadir geometry in earlier work. In the nadir case, only

In order to find a figure of merit to evaluate the application porting and restructuring efforts for JURASSIC, we tried to assess the performance ratio of GPUs over CPUs. In terms of energy to solution, we found the GPU version to be about 9 times more energy efficient than its CPU counterpart. The CPU version, in turn, is about 14 times faster than the reference implementation from which the porting project started.

Although there are further ideas for performance tuning and code optimization, including the idea to implement analytic Jacobians for data assimilation and retrieval applications, the given achievements in terms of improved CPU performance and utilization of GPUs are considered an important step forward in order to prepare the JURASSIC radiative transfer model for large-scale data processing of upcoming satellite instruments.

The most recent version of the JURASSIC-GPU model is available at

PFB and LH developed the concept for this study. PFB is the main developer of the GPU implementation and LH the main developer of the CPU reference implementation of JURASSIC. PFB conducted the verification tests and performance analyses of the JURASSIC-GPU code. Both authors made equal contributions to writing the paper.

The contact author has declared that neither they nor their co-author has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was made possible by efforts conducted in the framework of the POWER Acceleration and Design Center. We thank the Jülich Supercomputing Centre for providing access to the JUWELS Booster. We acknowledge the consultancy of Jiri Kraus (NVIDIA) and earlier contributions by Benedikt Rombach (IBM) and Thorsten Hater (JSC) to the software development as well as vivid discussions with Sabine Grießbach (JSC).

The article processing charges for this open-access publication were covered by the Forschungszentrum Jülich.

This paper was edited by Sylwester Arabas and reviewed by two anonymous referees.