psit 1.0: a system to compress Lagrangian flows

Pietak, Alexander; Huang, Langwen; Fusco, Luigi; Sprenger, Michael; Schemm, Sebastian; Hoefler, Torsten

doi:10.5194/gmd-19-3893-2026

Articles | Volume 19, issue 9

https://doi.org/10.5194/gmd-19-3893-2026

Articles | Volume 19, issue 9

Model description paper

13 May 2026

Model description paper |

| 13 May 2026

psit 1.0: a system to compress Lagrangian flows

Alexander Pietak, Langwen Huang, Luigi Fusco, Michael Sprenger, Sebastian Schemm, and Torsten Hoefler

Abstract

Meteorological simulations produce large amounts of data, which can represent a challenge when trying to store, share, and analyze it. As weather and climate models increasingly simulate the atmosphere at higher spatio-temporal resolution, it becomes imperative to compress the data effectively. While compression algorithms exist for weather data stored in a gridded Eulerian frame, there are, to date, no specialized alternatives for data stored in the Lagrangian frame. In this study, we present psit, a system to compress weather data stored in the Lagrangian frame. The system works by mapping the trajectories to a grid structure, performing additional encodings on these, and passing them to either the JPEG 2000 image compression algorithm or SZ3. The specialty of the algorithm is the mapping phase and the following encodings, which generate the grids in a way that allows the aforementioned compression algorithms to perform well. To gauge the performance of psit, we test a variety of metrics. We demonstrate that in the majority of cases, equivalent or superior compression performance is attained through the utilization of psit as opposed to naive compression with ZFP or SZ3. We also compare compression with measurement inaccuracies. Here, we show that the density of 168 hour long trajectories compressed with a ratio in the range of 30 to 40 behaves similarly to trajectories calculated from uncompressed wind fields with additional random perturbations with magnitude of 0.1 m s⁻¹ in the horizontal and around $6 \times 10^{- 3}$ Pa s⁻¹ in the vertical component. Additionally, we conduct two case studies in which we discuss the impact of compression on the study of warm conveyor belts associated with extratropical cyclones and the impact of compression on the radioactive plume prediction of the Fukushima incident in 2011.

Download & links

How to cite.

Received: 19 Feb 2025 – Discussion started: 14 Mar 2025 – Revised: 10 Feb 2026 – Accepted: 26 Mar 2026 – Published: 13 May 2026

1 Introduction

With advancements in parallel computing and the introduction of larger and more powerful HPC systems, the size and complexity of today's meteorological models have increased significantly. Such an increase in complexity leads to a rise in the amount of output data, reaching the range of dozens of petabytes (Hoefler et al., 2024, 2023). Storing and analysing such a massive amount of data poses significant challenges. Consequently, different approaches have been developed to address this issue. One approach works by reducing the number of elements that are stored (Tintó Prims et al., 2024). In the case of weather data, this can be achieved, for example, by reducing the grid resolution of the stored data or decreasing the temporal interval at which data is stored. This approach is already in use for the ERA5 weather dataset (Tintó Prims et al., 2024). A second approach that is commonly used focuses not on reducing the number of elements but on decreasing the space occupied by each of them. This is the field of compression algorithms, which reduce data size by exploiting structure present in the underlying data, a technique commonly used in multimedia, as seen in formats like JPEG (Wallace, 1991) and H.265 (Sullivan et al., 2012). For the compression of scientific data, we usually prefer lossy compression, as, due to the inherent noise present, lossless compression algorithms will not be able to achieve large compression ratios (in the range of 2 to 4 times). We will therefore focus on lossy compression techniques.

Several lossy compression algorithms for the compression of multidimensional floating-point data have been developed. A (non exhaustive) list includes: SZ3 (Liang et al., 2023 b, 2018; Zhao et al., 2021), ZFP (Lindstrom, 2014; Diffenderfer et al., 2019), and THRESH (Ballester-Ripoll et al., 2019; Ballester-Ripoll and Pajarola, 2015). Previous research by Tintó Prims et al. (2024) has demonstrated the effectiveness of these floating-point compression algorithms in compressing weather data in the Eulerian frame. There are also specialized compression algorithms developed specifically for weather data, such as VAEformer (Han et al., 2024), an autoencoder- based compression algorithm designed for the compression of the ERA5 weather dataset. Hence, a wide variety of compression algorithms exist that can be utilized for the compression of weather data in the Eulerian frame.

On the other hand, meteorologists not only work with the Eulerian representation but also often consider a Lagrangian one. In this representation, we do not have a multidimensional grid but instead track the position and properties, like temperature, of small air parcels carried by the wind, resulting in so called trajectories. This representation is crucial for certain types of analysis, such as identifying warm conveyor belts (Wernli and Davies, 1997 b; Joos and Wernli, 2012; Schemm et al., 2013) or stratosphere-troposphere exchanges (Holton et al., 1995; Škerlak et al., 2014). However, here we also face the problem of an increase in output data and therefore seek to compress it. Compared to multidimensional grid data, the landscape of compression algorithms for weather data trajectories is much poorer. There are mainly two fields related to this, first is the compression of trajectory data, where research mostly focuses on data generated from GPS tracking of vessels or, more generally, the movement of objects in space over many time steps (Makris et al., 2021). Examples of such algorithms include Dead-Reckoning, Douglas-Peuker, and Time-Ratio (Makris et al., 2021), all of which reduce data volume by strategically removing individual time steps while preserving the general trajectories. However, these methods are not suitable for compressing trajectories originating from weather data for two main reasons. First, the number of time steps in weather trajectories is typically much lower than in GPS trajectories. For GPS trajectories, the number of time steps might be in the 10 000 range, while for weather trajectories, depending on the use case, one might work with only a few dozen (e.g., hourly to six hourly temporal resolution). Additionally, removing time steps from weather trajectory data may not be permissible due to the specific requirements of subsequent research. Secondly, weather trajectory data often includes additional variables, such as temperature or potential vorticity, that are of interest along the trajectories. The above mentioned trajectory compression algorithms do not account for these additional variables, only for the position, which makes them unsuitable for compressing such weather data. The second research field related to this is the compression of unstructured mesh data. Different methods like (Liang et al., 2023 a; Ren et al., 2024; Wu et al., 2025) exists and a recent survey has been made by Di et al. (2025). But quite a lot of them are not directly designed for time series data, but focus on other aspects such as point clouds (Quach et al., 2022) or medical data (Al-Salamee and Al-Shammary, 2021). The research that focuses on time series data (Chiarot and Silvestri, 2023) mostly is in the field of internet of things and not on weather data. So we can see that there appears to be a gap when it comes to the compression of trajectories originating from weather data.

To address this gap, we present psit, a lossy compression method for Lagrangian flow data. Psit works by transforming the trajectory data into 2D grids and passing those grids to existing compression algorithms, specifically JPEG 2000 (Skodras et al., 2001) and SZ3, where JPEG 2000 has shown its potential in internal research and is used by the visualization community due to its adaptive scalability property (Woodring et al., 2011). This approach leverages the refined performance of these compression algorithms to create a compression pipeline specifically designed for trajectory weather data.

2 Method

The general idea behind psit is to take the trajectories and intelligently transform them so that they can be passed to existing compression algorithms. This general pipeline is illustrated in Fig. 1. The input expected by the dense data compressors (JPEG 2000 and SZ3) consists of 2D arrays (i.e., grids), which ideally should be smooth, i.e., the difference between neighboring entries of the array should be small. Note that a single 2D array (representing one channel) can be passed to the JPEG 2000 algorithm, as it is able to handle an arbitrary number of channels (individual 2D arrays) and is not limited to the default three color channels (red, green, and blue) used for images. Therefore, in the first step (the mapping phase), the trajectories are transformed into such grids. During this mapping, it is crucial to assure the resulting grids smoothness, as greater smoothness allows both compression algorithms to perform more effectively. Once the mapping step is completed, encoding schemes can be applied to them, namely delta and color encoding. These encoding schemes modify the grids in a way that leads to better compression performance for the previously mentioned compression algorithms. In the next step, the encoded grids are passed to a dense data compressors to perform the compression, and the compressed data is subsequently stored to disk. These four steps form the basis of psit and are discussed in more detail in the rest of this section.

https://gmd.copernicus.org/articles/19/3893/2026/gmd-19-3893-2026-f01

Figure 1The compression pipeline of psit. In the first step, the mapping phase, trajectory data is converted into grid data. After these grids have been created they can be manipulated in order to have better compression performance, namely delta and color encoding. Then as a final step the grids are compressed with either JPEG 2000 or SZ3 and stored to disk. The decompression works the same, but in the opposite direction.

Download

The decompression phase mirrors the compression process: the data is loaded from disk and then decompressed using the same algorithm employed during compression (e.g., JPEG 2000). After this, the encoded grids are decoded (delta and color decoding) to obtain the mapped 2D grids. Finally, these 2D grids are transformed back into trajectories using an inverse mapping of the original one. After this step, the decompression is complete, and the trajectory data should closely match the original data, with only the errors introduced by the compression algorithm differentiating them.

2.1 Input data

In the following, we will provide a definition of what a trajectory and a possible input file are in order to establish a common basis for later discussions. We define a trajectory as a sequence of positional variables (longitude, latitude, and pressure, which indicates the height) defining a 3D path on Earth, along with additional sequences that store variables like temperature or humidity along it. Each entry in these sequences represents one time step. Multiple such trajectories (in the range of a few million) can then be collected together to form a trajectory file, which would function as input to our compression algorithm.

For example, an input file might consist of one million trajectories spanning 13 time steps, with each trajectory also storing temperature and humidity. At the first time step, the trajectories are uniformly initialized over 26 pressure levels spanning the entire globe, with some predefined distance between the different starting points. This would result in a file storing a total of five million individual sequences (five per trajectory: 3 positional and 2 data variables), where each sequence stores 13 elements.

2.2 Mapping

The mapping phase is the first step in the compression pipeline and focuses on transforming the trajectories into grids while ensuring that the resulting grids are smooth. To achieve this, a method for storing the trajectory data as grids must be established. A simple approach is used: each trajectory is associated with at least one grid point, as visually represented in Fig. 2. The next step consists of translating the data from the trajectories into grids. For this, we first create an individual grid for each time step, and because a trajectory consists of multiple data variables (3 positional and an arbitrary number of additional ones), we also create an individual grid for each of those. After all these grids are created, we simply set the values of the grid points to the value of the associated trajectories data variable at that specific time point. This way, a collection of 2D grids can be created that store the same information as the trajectories but are inherently 2D. For example, the previously introduced trajectory file consists of one million, 13 time step long trajectories, which also store temperature and humidity. This file would be converted into 65 individual grids, where each grid would have at least one million entries, e.g., a dimension of 1500×750.

https://gmd.copernicus.org/articles/19/3893/2026/gmd-19-3893-2026-f02

Figure 2Visual representation of a mapping, each trajectory is assigned to some grid points based on their initial position, note that for visual clarity only some mappings are drawn. The way in which this association is done forms an integral part of our pipeline and is based on solving a minimal weight full bipartite matching problem.

While creating such an initial association can be done trivially, there are constraints that make the task more difficult. The first and most crucial constraint is that the resulting grids need to be smooth. Secondly, we must ensure that every trajectory maps to at least one grid point. If this is not the case, a trajectory would have no representation in the compressed data and would therefore be lost during compression. These constraints, especially the smoothness one, prevent the use of naive mapping methods and require the use of more sophisticated techniques.

In our research, we analysed different mapping methods. The one that yields the best results is based on solving a minimal weight full bipartite matching problem, originating in graph theory (Burkard et al., 2012). A second promising approach is based on solving a linear program (LP) (Bertsimas and Tsitsiklis, 1997). Solving this LP becomes computational infeasible for large images, but we still include it for theoretical analysis.

2.2.1 Bipartite Mapping

For this mapping method (results in Fig. 3, with an example in Fig. 4), the problem is solved using concepts from graph theory, more specifically by solving a minimal weight full bipartite matching problem. Hence, following this approach, we have two sets of nodes: one called “workers” and the other called “tasks”, with weighted edges between them constraining which workers may be assigned to which tasks and what the cost of this assignment is. The goal is to match each worker to exactly one task while trying to keep the overall cost of the matchings as low as possible.

https://gmd.copernicus.org/articles/19/3893/2026/gmd-19-3893-2026-f03

Figure 3Latitude grids with 550 hPa starting pressure created with the bipartite mapping method. Three different time steps are displayed the leftmost image is at the first time step, the second one is 12 h later and the rightmost one is 24 h later. In the leftmost image we have very good smoothness, which originates from the mapping method, over time this smoothness starts to degrade, as the trajectories start to diverge.

Download

https://gmd.copernicus.org/articles/19/3893/2026/gmd-19-3893-2026-f04

Figure 4Example of how the bipartite mapping works. For this 464 trajectories starting at two different pressure levels are considered. In a first step we extract the staring positions of all the trajectories individually for the two different pressure levels. We then take these starting positions and create a mapping to a grid such that the resulting grids are smooth. This mapping is then used to generate a collection over time for each of the data variables which are present on our trajectories.

This minimum weight full bipartite matching problem can be easily applied to our use case. Namely, the workers become the trajectories and the tasks the grid points. The weights are then defined as the Euclidean distance between the trajectories and the grid points at the first time step after both are translated into 3D Cartesian space. To transform the grid points into Cartesian space, a projection onto a 3D sphere is used. This projection method should lead to the grid points being homogeneously distributed over the globe, making a simple longitude-latitude mapping unsuitable. We chose the projection presented in Sect. 4 of the paper by Calhoun et al. (2008). The trajectories are transformed into Cartesian space by taking their spherical coordinates of longitude, latitude, and pressure and converting them to Cartesian x, y, and z coordinates. This transformation leads to the creation of a spherical shell, representing a mismatch in dimensionality between the grid points and the trajectories. Therefore, the spherical shell is binned into discrete levels over its pressure, resulting in the creation of multiple spheres for each of the bins. This means that, in addition to a grid being created for each time step and data variable, individual grids are also created for each of these pressure levels. If the algorithm then minimizes the Euclidean distance between the trajectories and the grid points, the resulting grids should become smooth, as trajectories with similar starting positions should be mapped to grid points that are close to one another, and we have that trajectories starting at similar spatial positions behave rather similarly over time (hence locating them next to each other results in a smoother grid). So, by representing our mapping problem as such a graph problem and solving it, we should end up with a valid and smooth mapping.

One problem with this mapping via minimal weight full bipartite matching is that if the number of trajectories is less than the number of grid points some grid points will not be mapped, leading to holes in the resulting grid. While it theoretically is possible to have the same number of trajectories as grid points, tests showed that this can become computationally infeasible and that in practice we need more grid points than trajectories. Therefore, we need to devise a way to handle this. The hole filling works by taking all the grid points that do not get mapped and mapping their closest trajectory (Euclidean distance in Cartesian space) to them (note that this means that a trajectory may map to multiple grid points at the same time). By using this method, the holes can be filled in an easy manner that does not require any costly computations.

2.2.2 LP Mapping

We also considered using linear programming. While this approach has high theoretical appeal, its practical usability is limited by the fact that the resulting LP becomes very large and requires a lot of resources to solve, while delivering nearly identical results to the approach based on minimal weight full bipartite matching. For our use case, we will model the mapping problem as an integer linear program (ILP), for which we then show that its LP formulation is integral, i.e., solving the LP formulation instead of the ILP one delivers the same results. This allows us to use a much cheaper LP solver, resulting in a large decrease in computational resources. In order to define an integer linear program, we need to specify an objective and a set of constraints that create a valid and smooth mapping from trajectories to grid points.

We first start with an intuitive definition of the objective and the constraints. The objective function is designed to minimize the sum of distances between the trajectories and their associated grid points. For the constraints, we say that each trajectory must map to at least one grid point and that each grid point must be mapped to. By the property of spatial continuity, a minimization of the objective function should lead to a smooth grid, while the constraints ensure that the mapping is valid. The rigorous mathematical definition is then given by:
Let the binary variables $x_{t, p} \in {0, 1}$ denote whether a trajectory t∈T is mapped to grid point p∈P, 0 if not mapped, 1 if mapped. And let the variable $d_{t, p} \in R_{\geq 0}$ denote the Euclidian distance in 3D Cartesian space (defined analogue to the bipartite mapping method) between a trajectory t∈T and a grid point p∈P. The resulting ILP is then given in Eq. (1):

\begin{matrix} (1) & minimize \sum_{(t, p) \in T \times P} d_{t, p} x_{t, p} \\ (2) & \begin{array}{r} s.t. \sum_{t \in T} x_{t, p} = 1 \forall p \in P \\ \sum_{p \in P} x_{t, p} \geq 1 \forall t \in T \end{array} \end{matrix}

While the variables x_t,p are defined to be binary, making the problem an integer LP, we can prove (see Appendix A) that the LP formulation, achieved by defining $x_{t, p} \in [0, 1]$ , is integral. This means that instead of an ILP solver a much cheaper LP solver can be used during calculation.

2.3 Color Encoding

To motivate color encoding, we must examine the longitude grids created by the previously explored mappings (Fig. 5 top), which mapping we look at (LP or bipartite) is not important as they deliver very similar results. As we can see from Fig. 5 top, there is a discontinuity line around the position of the dateline at longitude 180°, which begins to warp as time progresses. This discontinuity arises because trajectories with a longitude value of −180 (black pixels) are adjacent to trajectories with longitude values of 180 (white pixels). Such jumps in pixel values lead to worsened compression performance, as shown by experiments where we observed a decrease in RMSE error in the longitude variable by a factor of 1.4 when compressing with color encoding compared to no color encoding (using a compression factor of 15).

https://gmd.copernicus.org/articles/19/3893/2026/gmd-19-3893-2026-f05

Figure 5Top: Longitude grids with 550 hPa starting pressure displayed over different time steps (0, 12, 24 h). At the date line there is a discontinuity which starts to warp over time. This discontinuity leads to worsened compression performance and we can use color encoding to fix this. Bottom: HSV (left) and XYZ (right) color encoded longitude grids with 550 hPa starting pressure displayed after 24 h. The date line discontinuity is gone.

Download

https://gmd.copernicus.org/articles/19/3893/2026/gmd-19-3893-2026-f06

Figure 6Qualitatively demonstration of the discontinuity line at the date line. On the left only a singular variable (black to white) is used to represent the longitude, a discontinuity is created. On the right two variables and a cosine sine mapping is used, displayed as red and green. Using two variables it is possible to create a continuous color cycle with no discontinuity.

Download

The way we eliminate the discontinuity line is by using multiple variables instead of one, which maps the $[- 180, 180]$ longitude range to the black to white pixels. Hence, we map the $[- 180, 180]$ to a multidimensional space. If we use two or three variables, this multidimensional space can be represented by a color bar. This concept is illustrated in Fig. 6, where on the left, we have a single variable mapping the longitude from black to white, generating a discontinuity in the color wheel. On the right, we use two variables (red and green) with a sine-cosine mapping to create a color bar, which results in a continuous color wheel. This simple concept can define a variety of different color encoding methods. In our research, we explored several such methods, and the ones that provided the best performance are based on utilizing three variables, covering the entire RGB range. We call them HSV and XYZ color encoding and they are explained in the following sections.

2.3.1 HSV color encoding

The HSV color encoding method (Fig. 5 bottom left) uses the HSV (Smith, 1978) color representation, which, like RGB, is a way to describe color. However, instead of using values for red, green, and blue, it uses hue, saturation, and value to represent color. For our use case, we set the saturation and value to 1 and only vary the hue. For constant saturation and value, a varying hue generates a continuous color wheel. Another reason for setting saturation and value to 1 is that the JPEG 2000 algorithm might handle dark or washed-out colors differently, as the human eye cannot distinguish them as precisely as strong, bright colors.

The function f(x) for converting a longitude value in the normalized range [0,1] to an RGB value using an HSV representation is given in Eq. (3). The inverse mapping $f_{inv} ([r, g, b]^{T})$ , which maps an RGB color value to a hue angle, is provided in Eq. (4).

\begin{matrix} (3) & \begin{aligned} f (x) = [\begin{array}{c} h (5) \\ h (3) \\ h (1) \end{array}] \\ with \\ h (l) = 1 - max \{0, min \{k, 4 - k, 1\}\} \\ k = (l + 6 x) \mod 6 \end{aligned} \\ (4) & \begin{aligned} f_{inv} ([r, g, b]^{T}) = \{\begin{cases} 0, & if c = 0 \\ \frac{1}{6} \cdot (\frac{g - b}{c} \mod 6), & if v = r \\ \frac{1}{6} \cdot (\frac{b - r}{c} + 2), & if v = g \\ \frac{1}{6} \cdot (\frac{r - g}{c} + 4), & if v = b \end{cases} \\ v = max \{r, g, b\} \\ k = min \{r, g, b\} \\ c = v - k \end{aligned} \end{matrix}

2.3.2 XYZ color encoding

The XYZ color encoding (Fig. 5 bottom right) method considers both the longitude and latitude data variables, combining them into a single three-channel color grid. To this end, the longitude and latitude values are taken and converted into Cartesian coordinates, from which the r, g, b color channels are created from the x, y, z positional coordinates.

The function f([x_lon,x_lat]^T), which maps both longitude and latitude in the normalized range [0,1] to RGB values, is given in Eq. (5). The inverse function $f_{inv} ([r, g, b]^{T})$ , which converts RGB values back to longitude and latitude, is given by Eq. (6).

\begin{matrix} (5) & \begin{aligned} f ([x_{lon}, x_{lat}]^{T}) = \frac{1}{2} (1 + [\begin{array}{c} \cos {\hat{x}}_{lat} \cos {\hat{x}}_{lon} \\ \cos {\hat{x}}_{lat} \sin {\hat{x}}_{lon} \\ \sin {\hat{x}}_{lat} \end{array}]) \\ where {\hat{x}}_{lat} = π (x_{lat} - \frac{1}{2}), {\hat{x}}_{lon} = 2 π (x_{lon} - \frac{1}{2}) \end{aligned} \\ (6) & \begin{aligned} f_{inv} ([r, g, b]^{T}) = [\begin{array}{c} \frac{1}{2 π} (arctan2 (\hat{g}, \hat{r}) + π) \\ \frac{1}{π} (arctan2 (\hat{b}, \sqrt{{\hat{r}}^{2} + {\hat{g}}^{2}}) + \frac{π}{2}) \end{array}] \\ where \hat{r} = 2 r - 1, \hat{g} = 2 g - 1, \hat{b} = 2 b - 1 . \end{aligned} \end{matrix}

2.4 Delta Encoding

We also looked into using delta encoding as part of our compression pipeline. The principle behind delta encoding is that instead of storing time sequenced data directly, the delta between subsequent time steps is calculated and saved. While the calculation of the delta itself will not lead to a decrease in data, it might reduce entropy, which can beneficially impact the performance of compression algorithms. We will formulate this more rigorously in the rest of this section.

We will start with a naive implementation of delta encoding and show that it is not suitable. We will use the following notation: At time step i, the full grid is denoted as g_i, and the delta between the grids at time steps i and i−1 is denoted as Δ_i. It is calculated by subtracting the previous grid from the current one: $Δ_{i} = g_{i} - g_{i - 1}$ . Then, instead of passing the full grid sequence $(g_{1}, g_{2}, \dots, g_{N})$ to the dense data compressors, we use the delta sequence $(g_{1}, Δ_{2}, Δ_{3}, \dots, Δ_{N})$ . The problem with this simple delta encoding approach is that the error starts to accumulate over time. To see this, we define

\begin{matrix} (7) & g_{i}^{'} = g_{i} + ϵ_{i} \\ (8) & Δ_{i}^{'} = Δ_{i} + η_{i} = g_{i} - g_{i - 1} + η_{i} \end{matrix}

as the full frames and the delta frames which have gone though compression and therefore have an error ϵ_i and η_i applied to them. The grid at time step i that gets reconstructed by this is:

\begin{matrix} (9) & g_{i}^{r} = g_{1}^{'} + \sum_{j = 2}^{i} Δ_{j}^{'} = g_{i} + ϵ_{1} + \sum_{j = 2}^{i} η_{j} \end{matrix}

Therefore the delta frame compression errors η_j start to accumulate over time. We can solve this problem rather easily. Instead of defining a delta frame from the “perfect” grid g_i−1, we use the reconstructed $g_{i - 1}^{r *}$ grid:

\begin{matrix} (10) & \begin{aligned} {Δ^{'}}_{i}^{*} = Δ_{i}^{*} + η_{i} = g_{i} - g_{i - 1}^{r *} + η_{i}, \\ g_{i - 1}^{r *} = g_{1}^{'} + \sum_{j = 2}^{i - 1} {Δ^{'}}_{j}^{*} \end{aligned} \end{matrix}

The $g_{i}^{r *}$ can then be reformulated to

\begin{matrix} (11) & g_{i}^{r *} = g_{1}^{'} + \sum_{j = 2}^{i} {Δ^{'}}_{j}^{*} = g_{i} + η_{i} \end{matrix}

As we can see here the error does not start to accumulate over time, therefore this method is to be preferred over the naive one.

2.5 Implementation details

For the implementation, we had to take certain shortcuts and make adjustments to the theory in order to remain within reasonable computational limits. The primary simplification we made was to limit the possible number of grid points a trajectory may map to, for both the bipartite and the LP mapping, to a set of its closest neighbors. For bipartite mapping, this was the closest 200 neighbors, while for the LP mapping, it was the closest 50. However, this introduces the additional constraint of needing the starting locations of the trajectories to be evenly distributed. If they are not evenly distributed, i.e., if there are regions of low and high trajectory density, the mapping phase will fail, as we reach a point where we have locally more trajectories than possible grid points to which they may map. How limiting this constraint is depends on the type of research. For some studies, the initial configuration is uniform, e.g., a global uniformity in Stoffels et al. (2025) Sprenger et al. (2017), and Bakels et al. (2025) or local uniformity in Pérez-Muñuzuri et al. (2018), and Wendisch et al. (2024). But there is also research such as Keune et al. (2022), Schielicke and Pfahl (2022), and Dey et al. (2023) where non-uniform initial configurations are used and which therefore cannot be handled by psit. A second shortcut we are taking is that we need more grid points than trajectories in practice, this is due to both considering only the closest neighbors and because the algorithmic implementation we chose to solve the minimal weight full bipartite matching problem performs much faster when there are more grid points compared to trajectories. In practice, we have around 50 % more grid points than trajectories. These are the main considerations we had to take during implementation in order to strike a balance between computational performance and compression capabilities. In the future, it might be advisable to revisit these shortcuts, as removing them could lead to a broader applicable field of psit.

Some other minor implementation details are that we only implemented color encoding in combination with the JPEG 2000 compression algorithm, and that when compressing local rather than global trajectories (i.e., starting positions are bounded inside a box instead of spanning the entire globe), a longitude-latitude projection is used to map the grid points into 3D Cartesian space. This is because there is no simple inverse function for the presented projection method. Both of these differences do not have a large impact on the compression performance of psit, but could still be explored in future research.

3 Results

Previous research (Baker et al., 2014, 2016; Poppick et al., 2020) has already shown that simple error metrics are not enough to argue about the effectiveness of lossy compression algorithms in the case of weather and climate data. Therefore in this section we explore a set of experiments which should cover a range of different metrics and should give a general overview of how psit performs. In total we will carry out five different experiments:

Error metrics and comparison to ZFP (pyzfp v0.5.5) and SZ3 (v3.2.1) in order to compare performance to existing alternatives,
error value distribution in order to discern the creation of bias in the error,
trajectory density comparison against trajectories produced from perturbed wind fields in order to compare the impact of compression with the impact of data assimilation inaccuracies,
two case studies, one about warm conveyor belts and one about the Fukushima disaster, in order to see how psit performs in a more realistic environment,
and throughput and memory usage to observe the computational performance.

This should provide the reader with an insight into how psit behaves in different situations.

3.1 Input Data

Table 1The minimal and maximal values for the different data variables of the tra_20200101_00 file.

psit 1.0: a system to compress Lagrangian flows

2.1 Input data

2.2 Mapping

2.2.1 Bipartite Mapping

2.2.2 LP Mapping

2.3 Color Encoding

2.3.1 HSV color encoding

2.3.2 XYZ color encoding

2.4 Delta Encoding

2.5 Implementation details

3.1 Input Data

3.2 Performance of psit

3.2.1 Error metrics and comparison with ZFP and SZ3

3.2.2 Error Value Distribution

3.3 Trajectory Density and Comparison to Perturbed Wind Fields

3.4 Case Studies

3.4.1 Warm Conveyor Belts

3.4.2 The Fukushima accident

3.5 Throughput and Memory Usage

4.1 Applicable range and limitations

4.2 Future Work