The model's ability to reproduce the state of the simulated object or particular feature or phenomenon is always a subject of
discussion. Multidimensional model quality assessment is usually customized for the specific focus of the study and often for a limited number of
locations. In this paper, we propose a method that provides information on the accuracy of the model in general, while all dimensional information
for posterior analysis of the specific tasks is retained. The main goal of the method is to perform clustering of the multivariate model errors. The
clustering is done using the
Ocean general circulation models are valuable tools for hindcasting and forecasting ocean state. The values of the simulated fields depend on the
quality of the modelling products. Assessment of model quality is a basic step that is taken before the model results are used for evaluation of the
ocean state or other specific purposes. For instance, product quality assessment is routinely done for all products of the Monitoring Forecasting Centre within the Copernicus Marine Environment Monitoring Service (CMEMS; 2016) and the National Oceanic and Atmospheric Administration (NOAA;
Common statistical metrics for a single prognostic variable (e.g. bias, root mean square difference, correlation coefficient, standard deviations) are used to assess the model skills (Murphy et al., 1989; Murphy, 1995; Wȩglarczyk, 1998; Jolliff et al., 2009; Dybowski et al., 2019). Taylor diagrams (Taylor, 2001) or target diagrams (Jolliff et al., 2009) are usually implemented for compact visualization of the model performance statistics. Stow et al. (2009) studied 149 papers based on numerical modelling. They found that the majority (68 %) of the model validation works were based on visual comparison and comparing simple statistics such as bias and variance; 9 % of the works calculated the correlation coefficient, and roughly 11 % of the works implemented various cost–function techniques (e.g. Holt et al., 2005; Eilola et al., 2009). Ocean general circulation model output consists of a set of variables in space and time, i.e. 4-dimensional fields (i.e. three spatial dimensions and time). Similarly, measurement data have 4-dimensional distribution but are irregular in space and time. The amount of observational data has increased tremendously over the past decades. Temperature and salinity are widely used state variables for the assessment of the accuracy of general circulation models. These variables “integrate” temporal and spatial dynamics of the circulation in the water basin that has been modelled. Temperature and salinity are usually measured simultaneously, have 4-dimensional distribution, and form a major share of the data in the databases. The classical approach is that statistical metrics are calculated independently for each variable used for validation. Usually, time series data or profile data are extracted at a fixed location where the number of measurements is sufficiently large. In these cases, the measurements at the locations which are seldom visited are not used for the validation, but these measurements can form a significant amount of the data in the databases. Also, the model performance statistics are calculated for preselected geographical areas whereby all data that fall into that area and time window are included. In that case, a single set of the model performance statistics characterizes the model performance in that area. Even if all available data with sufficient spatio-temporal coverage are used for multivariate comparison, the end result is a single metric or limited set of metrics that characterize the general quality of the model. Then, the same metrics of model goodness of fit are assigned to every grid point and time. The shortcoming of this approach is that the detailed spatial and temporal distribution of model errors is lost.
Ideally, researchers like to know the model accuracy for the whole model domain and time period considered. Therefore, we suggest a new method based
on the machine learning
The intuitive prerequisite for using any clustering approach is that the dataset should have a natural cluster structure (Jain, 2010). Prior knowledge about model accuracy and distribution of model errors in space and time is usually missing. If there is a large number of data for comparison, then the distribution of model errors might not show visually identified clusters. If more than two variables are used for model quality assessment, then the visualization of the errors for the identification of the clusters becomes more complicated.
In this study, we will show that implementing the
Additionally, we implement the learning–predicting sequence in the form of clustering stability tests. The learning period consists of the model run for a certain period and error clustering. The learning period is for determining the number of clusters and the coordinates of the centroids. Based on the error clustering of the learning period, we can presume that a similar error distribution is valid for the forward model simulation results. During the predicting period, new available errors are added to the clusters. The coordinates of the centroids and other metrics are updated. In the operational applications, the value of this process lies in the fact that the exploitation of model simulation results can start before new validation is completed.
We apply proposed
Spatial
The Baltic Sea (Fig. 1a) is a wide non-tidal estuary-type marginal sea with a longitudinal salinity between 0 and 20
Salinity fronts are formed in the straits that connect different sub-basins of the Baltic Sea: between Kattegat and southwestern Baltic Sea, the Gulf of Riga and the Baltic Proper, and the Gulf of Bothnia and the Baltic Proper. The Danish straits and Kattegat are situated in a region with a very dynamic and strong front that separates the brackish Baltic Sea water and the saline North Sea water (Nielsen, 2005). The Baltic Sea water of low salinity is transported towards the North Sea in summer, but saline water of the North Sea inflows to the Baltic Sea in winter (Mohrholz, 2018). A dynamic front is present in the transition area between the northeastern Baltic Proper and the Gulf of Finland, although that is a wide and deep area.
The Baltic Sea is seasonally ice-covered. Inter-annually variable and dynamic ice coverage (Raudsepp et al., 2020) has a considerable effect on the evolution of the thermohaline fields in the Baltic Sea.
The General Estuarine Transport Model (GETM; Burchard and Bolding, 2002) is a numerical 3D circulation model initially developed for coastal and estuarine applications (Gräwe et al., 2015; Holtermann et al., 2014). The hindcast simulation of the general circulation of the Baltic Sea was carried out for the period of 1966–2006 (Maljutenko and Raudsepp, 2019; 2014). The model open boundary was located in Kattegat, where sea-level elevation, temperature, and salinity are prescribed. Model horizontal resolution was set to 1 nmi (1852 m), which was consistent with the horizontal resolution of the digital bathymetry of the Baltic Sea (Seifert and Kayser, 1995). Vertically, 40 bottom-following adaptive layers were used, which resulted in a vertical resolution of less than 5 m.
The initial conditions of salinity and temperature were compiled using observation data from the Baltic Environmental Database (BED;
We use salinity and temperature measurements for the Baltic Sea from the EMODnet Chemistry database (SMHI, 2018). From the original dataset, we have
extracted 1 376 674 measurements, which met the following conditions: (1) time range of 1966–2005; (2) spatial range of the model domain, excluding
coastal observations, which fell outside the model grid; (3) the simultaneous existence of
The spatial and temporal distribution of the validation data is presented in Fig. 1. The spatial density of the data is visualized on the
25
Logarithmic distribution of the number of salinity and temperature error pairs (model minus observation) in the 2-dimensional error space
The
The first step of the method is to determine the number of clusters and an initialization. For practical reasons (Hastie et al., 2009), a regular
pattern of initial centroids was chosen for this study (Fig. 2b), although we have run the algorithm with randomly spaced clusters. When we start with
only one cluster, we can choose its location at {
In general, the errors retain their 4-dimensional structure, i.e. {
Each error pair belongs to a fixed cluster
For the spatial distribution of errors, we take the error pairs as independent of time and vertical coordinate, i.e.
{
For the vertical distribution of errors, we take error pairs as dependent only on the vertical coordinate
{
For the temporal distribution of errors, we take error pairs as dependent only on time {
There is no need to do normalization when we look at time series in a fixed spatial location or plot the Hovmöller diagram of error clusters.
The coordinates of the centroids and the standard deviations of salinity and temperature errors within the clusters for a different set of predefined number of clusters,
The distribution of clusters in the error space for a different number of predefined clusters,
We start by clustering bulk data covering the entire modelling period and domain. Error representation does not provide a clear understanding on how
many clusters should be predefined or how the clusters will form. The initial location of the centroids is selected according to the scheme shown on
Fig. 2b. The coordinates of the centroid of one cluster (Fig. 3a) provide a model bias of 0.64
Sum of square distances (black bars) between the normalized pairs of error points and their designated centroids for different numbers of initial centroids. The first-order (red bars) and the second-order (blue bars) forward differences calculated from the sum of square distances.
Increasing the number of clusters results in the splitting of the error space into clusters with centroids close to the zero point (Fig. 3). A
representative structure of distribution of the errors emerges in the case of four clusters (Fig. 3d). We can confirm the choice of four clusters by
implementing cluster selection criteria. The distance between points and designated centroids reduces exponentially with the increase in the number of
clusters (Fig. 4). The rate of distance reduction with the increasing number of clusters shows local minima at
The
The distribution of the error clusters for
Retrieving spatial coverage of
Vertical distribution of the error clusters confirms that the share of good match errors ranges between 0.5 and 0.9 of all data (Fig. 5e). In the
surface layer, we have overestimated salinity and underestimate salinity in almost 50 % of cases. In comparison with horizontal
distribution of errors, a large part of these errors probably belongs to the Danish straits (Fig. 5b). The overestimated temperature has a
considerable share centred at a depth of 25
A decrease in time of a good match coincides with an increase of the share of underestimated salinity and overestimated salinity
(Fig. 5c). Seasonally overestimated salinity has a higher share in summer, while underestimated salinity has a higher share in winter
(Fig. 5d). Combining the horizontal (Fig. 5b) and seasonal distribution of errors (Fig. 5d), we could conclude that the salinity is overestimated in the
Danish straits in summer and underestimated in winter. In addition, we would like to note that the good match share decreases and
underestimated salinity increases abruptly at the end of the 1980s when the number of measurements becomes larger in the database. The
overestimated temperature has an almost constant share of 0.1 in time (Fig. 5c). The elevated share of overestimated temperature errors in
summer confirms that the model overestimates the temperature in the seasonal thermocline (Fig. 5d). For comparison, we have provided a similar
analysis of the errors for
Hovmöller diagram of the distribution of error points of
We extract error profiles from Gotland Deep station BY15, which is widely used for the validation of the physical and biogeochemical models of the
Baltic Sea. In the upper layer of 60
Learning
As the first step, the whole 4-dimensional {
The average normalized distance of shifts of predicting centroids relative to learned centroids as a function of the share of the learning dataset. Averaging has been done from 30 trials. Different lines correspond to different numbers of initial clusters,
If the learning dataset makes up 10 %–95 % of the total dataset (
The total number and the spatio-temporal coverage of the comparison points (Fig. 1) indicate that the model performs well over the Baltic Sea and the
simulation period considered (Fig. 5). The share of model errors with a bias of {
In addition, we can highlight the areas where the model accuracy is lower and the dynamical features are not so well reproduced by the
model. Essentially, the seasonal thermocline and permanent halocline are not reproduced by the model as well as the layers with small vertical gradients
of salinity and temperature. The accuracy of the model in reproducing a seasonal thermocline has a peak share of overestimated temperature of 0.25
(bias of 3.78
Model accuracy is relatively low in the Danish straits. The model has underestimated salinity in winter and overestimated salinity in summer
(bias of 3.44
Clustering of model errors could provide information about the accuracy of external fields that are used for the forcing and for the boundary
conditions of the model. The overestimated temperature at the river plume areas (Fig. 5b) may indicate a mismatch of river water temperature that
takes the value from a grid cell adjacent to the river mouth. Although the air–sea fluxes are correctly reproduced by the model, as indicated by
a good match at the surface (Fig. 5c), the following downward flux of heat could be too strong, as the share of overestimated temperature is
relatively high between the depth of 10–40
Ideally, researchers like to know the model accuracy over the whole model domain and time period simulated. Commonly used methods provide a limited set of metrics (e.g. bias, standard deviation, root mean square error, correlation coefficient) for the assessment of overall quality of the model. In this study, we have proposed a new method for the assessment of model skills. The aim of using the method is the clustering of multivariate model errors. Model errors consist of differences between model values and the measured multivariate data. The main advantage of this method is the possibility to use clustered errors for the analysis of the spatio-temporal accuracy of the model.
The method was tested in the validation of the circulation model results of the 40-year period in the Baltic Sea. Temperature and salinity were used for validation because they are essential parameters of the physical model, and these data have been the most extensively measured in the Baltic Sea. This method enables us to use all available observations, with the only restriction being the need to measure multivariate data simultaneously. In model validation, the problem usually lies in the spatio-temporal distribution of measurement data over the 4-dimensional model domain. In our case, the measurement data were sufficient and had good spatial and temporal coverage. In total, we had more than 1 300 000 pairs of measured temperature and salinity values. In many cases, reduction of available data or homogenization of the data is needed prior to the calculation of model errors, and clustering is applied to have simultaneous multivariate data. The number of measurements should be sufficiently large to determine stable clusters. In our case, about 100 000 randomly selected data pairs showed relatively stable centroids and the stability of the model accuracy.
We have applied the
The
The clustering was done for the entire Baltic Sea and the whole simulation period. In comparison, conventional model validation with station measurements of temperature and salinity is presented in Maljutenko and Raudsepp (2014, 2019). The analysis of clusters of errors at specific locations enables us to assess the quality of the model at these locations in the context of the overall quality of the model. Multivariate model quality assessment shows that if one parameter is well reproduced by the model but the other parameter is poorly reproduced at the same time, then the quality might not be good.
In addition to model quality, error clustering can provide implicit information about the quality of prescribed input variables and forcing
fields. Error clustering has shown that the temperature of river runoff water could be overestimated. This is especially relevant in the case of
biogeochemical models, where discharges of different nutrients and other state variables, which have to be prescribed, are usually poorly known. There
are problems in the prescribed salinity of the inflowing North Sea water at the open boundary of the model in the Kattegat. In addition, these errors
are transported into the model domain of southwestern Baltic Sea. However, atmospheric fields necessary for the calculation of the air–sea heat fluxes
do not produce significant errors.
The proposed method could be applied for the assessment of the quality of global ocean general circulation models. By the end of the year 2020, there
were approximately 3800 ARGO floats profiling the world ocean for salinity and temperature, with a spatial resolution of approximately one float for
every 3
The proposed method can be applied to different geoscientific models. The shortlist consists of biogeochemical models, atmospheric models, wave models, hydrological models, and geodynamic models. An application of the method for the assessment of a coupled physical and biogeochemical model of the Baltic Sea is presented in Kõuts et al. (2021). The method can be implemented in a multivariate high-dimensional error space as well as in a univariate error space. In addition to the validation of numerical models, the method can be used for the assessment of remote-sensing data and models.
The distribution of clusters in the error space for a different number of predefined clusters,
In the case of three clusters, the largest share of errors belongs to the cluster
In the case of five clusters, the clusters
The distribution of the error clusters for
The distribution of the error clusters for
The GETM model version 2.5 and GOTM model version 4.1 used in the current study are stored in the Zenodo repository “Source code for the GETM and GOTM software” (
UR provided the concept and methodology and wrote the manuscript. IM prepared the software, ran the models and processing tools, and carried out visualization and data curation. UR and IM conducted formal analysis.
The contact author has declared that neither they nor their co-authors have any competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We would like to thank Rivo Uiboupin and Jüri Elken for valuable comments. We very much appreciate Ragini Kihlman, School of Computer Science and Electronic Engineering University of Essex, for her initiative in performing the tests with different clustering algorithms. Special thanks are given to Meelimari Aljasmäe from Urmas Raudsepp for support during the preparation of the manuscript.
This research has been supported by the European Regional Development Fund within the National Programme for Addressing Socio-Economic Challenges through R&D (grant no. RITA1/02-52-04).
This paper was edited by Olivier Marti and reviewed by two anonymous referees.