Probability density functions (PDFs) provide information about the probability of a random variable taking on a specific value. In geoscience, data distributions are often expressed by a parametric estimation of their PDF, such as, for example, a Gaussian distribution. At present there is growing attention towards the analysis of non-parametric estimation of PDFs, where no prior assumptions about the type of PDF are required. A common tool for such non-parametric estimation is a kernel density estimator (KDE). Existing KDEs are valuable but problematic because of the difficulty of objectively specifying optimal bandwidths for the individual kernels. In this study, we designed and developed a new implementation of a diffusion-based KDE as an open source Python tool to make diffusion-based KDE accessible for general use. Our new diffusion-based KDE provides (1) consistency at the boundaries, (2) better resolution of multimodal data, and (3) a family of KDEs with different smoothing intensities. We demonstrate our tool on artificial data with multiple and boundary-close modes and on real marine biogeochemical data, and compare our results against other popular KDE methods. We also provide an example for how our approach can be efficiently utilized for the derivation of plankton size spectra in ecological research. Our estimator is able to detect relevant multiple modes and it resolves modes that are located closely to a boundary of the observed data interval. Furthermore, our approach produces a smooth graph that is robust to noise and outliers. The convergence rate is comparable to that of the Gaussian estimator, but with a generally smaller error. This is most notable for small data sets with up to around 5000 data points. We discuss the general applicability and advantages of such KDEs for data–model comparison in geoscience.

In geoscience, the application of numerical models has become an integral part of research. Given the complexity of some models, such as earth system models with their descriptions of detailed processes in the ocean, atmosphere, and land, a number of plausible model solutions may exist. Accordingly, there is a strong demand for the analysis of model simulations on various temporal and spatial scales and to evaluate these results against observational data. A viable evaluation procedure is to compare non-parametric probability density functions (PDFs) of the data with their simulated counterparts. By non-parametric PDFs, it is meant that no assumptions are made regarding any particular (parametric) probability distribution, such as the normal distribution.

Some studies have already documented the advantage of analyzing changes in PDFs, for example, when results of climate models are evaluated on a regional scale and their sensitivities to uncertainties in model parameterizations and forcing are examined

The examination of the suitability of certain divergence functions for data–model assessment is only one aspect; another is the quality or representativeness of the estimated PDFs. Well-approximated PDFs have been used to benefit data analysis in the geosciences

Mathematically formulated, PDFs are integrable non-negative functions

The application of kernel density estimators (KDEs) has become a common approach for approximating PDFs in a

A KDE is based on a kernel function and a smoothing parameter. The kernel function is ideally chosen to be a PDF itself, usually unimodal and centered around zero

The reformulation of the most common Gaussian KDEs

In this study, we present a new, modified diffusion-based KDE with an accompanying Python package, diffKDE

This paper is structured as follows: first, we will briefly summarize the general concept of KDEs. Afterwards, our specific KDE approach will be introduced and described, as developed and implemented in our software package. We explain the two pilot estimation steps and the selection of the smoothing parameters. Then the performance of our refined estimator will be compared with other state-of-the-art KDEs while considering known distributions and real marine biogeochemical data. The real test data include carbon isotope ratios of particulate organic matter (

A kernel density estimator is a non-parametric statistical tool for the estimation of PDFs. In practice, diverse specifications of KDEs exist that may improve the performance with respect to individual needs. Before we explain our specifications of the diffusion-based KDE, we will provide basic background information about KDEs.

From now on, let

The most general form of a KDE approximates the true density

The sets

The parameter

If now

This result was already obtained by

In Eq. (

There exists a variety of available choices for the type of kernel function

A common choice for

The standard KDE from Eq. (

The diffusion-based KDE provides a different approach to Eq. (

This KDE solves Eq. (

The input data are treated as the initial value

An advantageous connection to Eq. (

This function solves Eq. (

Our implementation of the diffusion KDE is based on

The final iteration time

This specific type of KDE has several advantages. First of all, it naturally provides a sequence of estimates for different smoothing parameters

We now focus on the selection of the optimal squared bandwidth

We stressed in Eq. (

In the simplified setup with

Still, the smoothing parameter depends on the unknown density function

The possible difficulties in finding one single optimal bandwidth

A crude first estimate of the true density

Our new approach solves the diffusion equation in three stages, where the first two provide pilot estimation steps for the diffKDE. The three chosen bandwidths increase in complexity and accuracy over this iteration. This algorithm was implemented in Python 3.

For the optimal final iteration time

We begin to estimate

As seen in Eq. (

To calculate the final iteration time

The iqr is the interquartile range defined as

We approximate optimal final iteration time

For the boundary values we set the second derivative at the lower boundary to

We set the finite differences approximation from Eqs. (

The

Equation

We start with the description of the discretization of the spatial domain

Let

This implies

We approximate Eq. (

Analogously, we approximate Eqs. (

Finally, we derive from Eq. (

Now, we identify

By these calculations the solution of the partial differential equation from Eq. (

The time-stepping applied to solve the ordinary differential equation from Eqs. (

We use an implicit Euler method to approximate Eq. (

The implicit Euler method is chosen at this place, since it is A-stable and by this ensures convergence of the solver.

Equation (

Dirac sequence

The initial value in Eq. (

We assume

Then

Input variables. The only required input variable for the calculation of the diffKDE is a one-dimensional data set as an array-like type. All other variables are optional, with some prescribed defaults. On demand the user can set individual lower and upper bounds for their data evaluation under the diffKDE as well as the number of used spatial and temporal discretization intervals. The individual selection of the final iteration time provides the opportunity to choose the specific smoothing grade on demand.

Now, let

Hence Eq. (

The concept of the Dirac sequence also provides the justification to generally rely on the

The implementation was conducted in Python and its concept is shown in Algorithm

Finite-differences-based algorithm for the implementation of the diffusion KDE.Note: the routine solve

The spatial grid discretizing

The Dirac sequence

In the pilot estimation steps, we calculate the KDEs for

The final iteration time

The return value is a tuple providing the user the diffKDE and the spatial discretization as displayed in Table

Possible problems are caught in

Besides the standard use to calculate a diffKDE at an approximated optimal final iteration time for direct usage, we also included three possibilities to generate a direct visual output, one of them being interactive. Matplotlib

The function call

Pre-implemented direct visual output of the evolution process of the diffKDE. The input data are

The function call

The diffKDE and its pilot estimate

Different snapshots from the interactive visualization of the diffusion KDE generated from the artificial data set

The function call

In the following we document the performance of the diffKDE on artificial and real marine biogeochemical data. Whenever not stated otherwise, we used the default values of the input variables given in Table

Return values of the diffKDE. The return variable of the diffKDE is a vector. Its first entry is the diffKDE evaluated on the spatial grid. Its second entry is the spatial grid

For testing our implementation against a known true PDF, we first constructed a three-modal distribution. The objective was to assess the diffKDE's resolution and to exemplify the pre-implemented plot routines. The distribution was constructed from three Gaussian kernels centered around

As described in Sect.

First we demonstrate how to display the diffKDE's evolution. By calling the

Lastly, we illustrate example snapshots of the interactive option to investigate different smoothing stages of the diffKDE by the function. We chose simpler and smaller example data for this demonstration, because these are better suited for visualization of this tool's possibilities. The function

In this section we present results obtained by random samples of the trimodal distribution from Eq. (

We start with an individual selection of the approximation stages. This is one of the main benefits of the diffKDE compared with standard KDEs by providing naturally a family of approximations. This family can be observed by the function

Family of diffKDEs evaluated at different bandwidths. A data set of 50 random samples drawn as gray circles on the

In the following we only work with the default solution of the diffKDE at

Test cases with known distributions. Panels

Integrals of the KDEs displayed in Figs.

We use differently sized random samples of the known distribution from Eq. (

Lognormal test cases with different mean and variance parameters. Of each distribution

We refined the test cases from Fig.

Test cases with different sample sizes. All four plots show the diffKDE of random samples of the known trimodal distribution defined in Eq. (

Now we show the performance of the diffKDE on increasingly large data sets. We still use the trimodal distribution from Eq. (

Furthermore, we investigated the convergence of the diffKDE to the true distribution, again in comparison with the three other KDEs. The error between the respective KDE and the true distribution is calculated by the Wasserstein distance

The evolution of the errors of the diffKDE, the Gaussian KDE, the Epanechnikov KDE, and the Botev KDE are drawn on a log scale against the increasing sample size on the

Noised data experiments. A random sample of

Finally, we investigated the noise sensitivity of the diffKDE compared with the three other KDEs on data containing artificially introduced noise. We again used the trimodal distribution from Eq. (

The performance of the diffKDE is now illustrated with real data of (a) measurements of carbon isotopes

Both data sets were already analyzed using KDEs in their original publications

Comparison of KDE performance on marine biogeochemical field data. The

Comparison of KDE performance on

The

Another example demonstrates the performance of the diffKDE if applied to plankton size data

Comparison of KDE performance using monthly means (February, March, and April) of chlorophyll

Our last example refers to PDFs that reflect temporal changes (monthly means) of surface chlorophyll

In geoscientific research, the derivation and comparison of well-resolved PDFs can be useful, as demonstrated in our selected examples. Yet, the significance of resolving details in non-parametric PDFs remains unclear. However, having high resolution PDFs available, as obtained with the diffKDE, is readily of value and will likely guide further research. An obvious benefit of the diffKDE is its lesser dependence on the specification of a single, albeit optimal, bandwidth. Its application is likely more robust for the assessment of simulation results, either against data or results of other models (e.g., multimodel ensembles), which is particularly relevant for evaluations of future climate projections obtained with earth system models

Comparison of model and field data requires additional processing to account for spatial–temporal differences between collected samples and model resolution. Typically, simulation results are available at every single spatial grid point and in every time step. In comparison, field data are usually sparsely available only. Interpolating such sparse field data can introduce high uncertainty

The presented diffKDE provides a non-parametric approach to estimate PDFs with typical features of geoscientific data. Being able to resolve typical patterns, such as multiple or boundary-close modes, while being insensitive to noise and individual outliers makes the diffKDE a suitable tool for future work in the calibration and optimization of earth system models.

In this study we constructed and tested an estimator (KDE) of probability density functions (PDFs) that can be applied for analyzing geoscientifc and ecological data. KDEs allow the investigation of data with respect to their probability distribution, and PDFs can be derived even for sparse data. To be well suited for geoscientific data, the KDE must work fast and reliably on differently sized data sets while revealing multimodal details as well as features near data boundaries. A KDE should not be overly sensitive to noise introduced by measurement errors or by numerical uncertainties. Such an estimator can be applied for direct data analyses or can be used to construct a target function for model assessment and calibration.

We presented a novel implementation of a KDE based on the diffusion heat process (diffKDE). This idea was originally proposed by

Finite differences build the fundamentals of our discretization. The spatial discretization comprises equidistant finite differences. The

Our diffKDE implementation includes pre-implemented default output options. The first option is the visualization of the diffusion time evolution showing the sequence of all solution steps from the initial values to the final diffKDE. This lets a user view the influence of individual data points and outlier accumulations on the final diffKDE and how this decreases over time. The second option is the visualization of the pilot estimate that is also included in the partial differential equation to introduce adaptive smoothing properties. This provides the user with easy insight into the adaptive smoothing as well as the lower boundary of structure resolution given by this parameter function. Finally, an interactive plot provides a simple opportunity to explore all of these time iterations and look even beyond the optimal bandwidth and see smoother estimates.

Our implementation is fast and reliable on differently sized and multimodal data sets. We tested the implementation for up to 10 million data points and obtained acceptably fast results. A comparison of the diffKDE on known distributions together with classically employed KDEs showed reliable and often superior performance. For comparison we chose a SciPy implementation

An assessment of the diffKDE on real marine biogeochemical field data in comparison with usually employed KDEs reveals superior performance of the diffKDE. We used carbon isotope and plankton size spectra data and compared the diffKDE to the KDEs that were used to explore the data in the respective original data publications. On the carbon isotope data, we furthermore applied all previous KDEs for comparison. In both cases we were able to show that the diffKDE resolves relevant features of the data while not being sensitive to individual outliers or uncertainties (noise) in the data. We were able to obtain a best possible and reliable representation of the true data distribution, better than those derived with other KDEs.

In future studies the diffKDE may potentially be used for the assessment, calibration, and optimization of marine biogeochemical- and earth system models. A plot comprising PDFs of field data and simulation results already may provide visual insight into some shortcomings of the applied model. A target function can be constructed by adding a distance like the Wasserstein distance

The derivation of the optimal bandwidth choice for a KDE was already described in

In our context of working with the squared bandwidth

Here, we briefly give the proof of the integral property of the used Dirac sequence

Table

Error convergence of the observed KDEs in Fig.

The exact version of the diffKDE implementation

MTP structured the manuscript and developed the implementation of the diffusion-based kernel density estimator. VL conducted the comparison experiments with the plankton size spectra. MS conducted the comparison experiments with the remote sensing data. CJS edited the manuscript. MS and TS edited the manuscript and supported the development of the implementation of the diffusion-based kernel density estimator.

The contact author has declared that none of the authors has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

The editor and the authors thank two anonymous reviewers for their comments on this article.

The first author is funded through the Helmholtz School for Marine Data Science (MarDATA), grant No. HIDSS-0005. The article processing charges for this open-access publication were covered by the GEOMAR Helmholtz Centre for Ocean Research Kiel.

This paper was edited by Travis O'Brien and reviewed by two anonymous referees.