A method for assessment of the general circulation model quality using K-means clustering algorithm

Abstract. The model's ability to reproduce the state of the simulated object or particular feature or phenomenon is always a subject of discussion. Multidimensional model quality assessment is usually customized to the specific focus of the study and often to a limited number of locations. In this paper, we propose a method that provides information on the accuracy of the model in general, while all dimensional information for posterior analysis of the specific tasks is retained. The main goal of the method is to perform clustering of the multivariate model errors. The clustering is done using the K-means algorithm of unsupervised machine learning. In addition, the potential application of the K-means clustering of model errors for learning and predicting is shown. The method is tested on the 40-year simulation results of the general circulation model of the Baltic Sea. The model results are evaluated with the measurement data of temperature and salinity from more than one million casts by forming a two-dimensional error space and performing a clustering procedure in it. The optimal number of clusters that consist of four clusters was determined using the Elbow cluster selection criteria and based on the analysis of the different number of error clusters. In this particular model, the error cluster of good quality of the model with a bias of 0.4 °C (std = 0.8 °C) for temperature and 0.6 g kg−1 (std = 0.7 g kg−1) for salinity made up 57 % of all comparison data pairs. The prediction of centroids from a limited number of randomly selected data showed that the obtained centroids gained a stability of at least 100 000 error pairs in the learning dataset.



Introduction
Ocean general circulation models are valuable tools for hindcasting and forecasting ocean state. The values of the simulated fields depend on the quality of the modeling products. Assessment of model quality is a basic step that is taken before the model results are used for evaluation of the ocean state or used for other specific purposes. For instance, product quality 25 assessment is routinely done for all products of the Model Forecast Centers within Copernicus Marine Environment Monitoring Service (CMEMS, 2016) and National Oceanic and Atmospheric Administration (NOAA, https://www.esrl.noaa.gov/fiqas/, https://sats.nws.noaa.gov/~verification/; https://www.ncdc.noaa.gov/sotc/global/202101). Ocean general circulation model output consists of a set of variables in space and time, i.e., 4-dimensional fields. The classical approach is that statistical metrics are calculated independently for each variable used for validation. In addition, the 30 dimensionality of the output is frequently reduced by either doing certain averaging and/or selecting one or two dimensional subsets. Common statistical metrics for a single prognostic variable (e.g., bias, root mean square difference, correlation coefficient, standard deviations) are used to assess the model skills (Murphy et al., 1989;Murphy, 1995;Wȩglarczyk, 1998;Jolliff et al., 2009;Dybowski et al., 2019). Taylor diagrams (Taylor, 2001) or target diagrams (Joliff et al., 2009) are usually implemented for compact visualisation of the model performance statistics. Stow et al. (2009) studied 149 papers based on 35 numerical modeling. They found that the majority (68%) of the model validation works were based on visual comparison and comparing simple statistics such as bias and variance, 9% of the works calculated the correlation coefficient and roughly 11% of the works implemented various cost-function techniques (e.g., Holt et al. 2005;Eilola et al., 2009). Even if all available data with sufficient spatio-temporal coverage is used for multivariate comparison, the end result is a single metric https://doi.org/10.5194/gmd-2021-68 Preprint. Discussion started: 30 April 2021 c Author(s) 2021. CC BY 4.0 License. or limited set of metrics that, indeed, characterize the model general quality. The shortcoming of this approach is that 4-40 dimensional information embedded in the huge dataset used for the validation will be lost.
Temperature and salinity are widely used state variables for the assessment of the accuracy of general circulation models.
Ideally, researchers like to know the model accuracy for the whole model domain and time period considered. The amount of observational data has increased tremendously over the past decades. Temperature and salinity are usually measured simultaneously and form a major share of the data in the databases. 45 We suggest a new method that takes advantage of a large set of all available data and belongs to the category of multivariate comparison. The method is not limited to the set of two variables. The only requirement is that all variables should be simultaneously measured. Preprocessing can be done to make data simultaneous, i.e., averaging over some space and time.
The method is based on the machine learning K-means clustering algorithm (Jain, 2010). he intuitive prerequisite for using any clustering approach is that the dataset should have a natural cluster structure (Jain, 50 2010). A prior knowledge about the model accuracy and distribution of model errors in space and time is usually missing. If there is a large number of data for comparison, then distribution of the model errors might not show visually identified clusters. If more than two variables are used for the model quality assessment, then visualisation of the errors for the identification of the clusters becomes more complicated.
In this study, we will show that implementing the K-means clustering algorithm for the analysis of model temperature and 55 salinity errors provides meaningful information about the model accuracy. Clustering procedure using the K-means algorithm includes quantitative metrics for general assessment of the model performance. Posterior analysis of error clusters is an essential part of the proposed method and enables us to understand model data misfit and to explain the errors in relation to the dynamic features of the natural water basin under consideration.
The proposed clustering methods can also be used in learning-predicting sequence. The latter is important in the operational 60 use of the model. The learning period consists of the model run for a certain period and error clustering. The learning period is for determining the number of clusters and the coordinates of the centroids. Based on the learning period error clustering, we can presume that a similar error distribution is valid for the forward model simulation results. During the predicting period, new available errors are added to the clusters. The coordinates of the centroids and other metrics are updated. The value of this process lies in the fact that exploitation of the model simulation results can start before new validation is 65 completed. In this study, we implement the learning-predicting sequence in the form of clustering stability tests.
We apply proposed K-means clustering methods for the assessment of model quality of the circulation model of the Baltic Sea. The model has been used for the analysis of long-term water circulation in the Gulf of Finland (Maljutenko and Raudsepp, 2019). Conventional model validation with station measurements of temperature and salinity is presented in Raudsepp (2014, 2019

The Baltic Sea
The Baltic Sea (Fig. 1a) is a wide non-tidal estuary-type marginal sea with a longitudinal salinity between 0 and 20 g kg -1 (Leppäranta and Myrberg, 2009;Omstedt et al. 2014). The longitudinal salinity gradient is maintained by saline water inflows from the North Sea through Danish straits and freshwater input by rivers. Large volumes of saline water are 75 transported to the Baltic Sea by the Major Baltic Inflows (MBI) that occur seldom (Mohrholz, 2018). The other smaller inflows occur almost every winter (Mohrholz, 2018;Raudsepp et al., 2018). Inflowing saline water spreads downstream into the Baltic Sea along the cascade of deep basins: the Bornholm Basin, Gdansk Basin and the Eastern Gotland Basin. The saline water of the Gotland basin is pushed into the western Gotland Basin and the Gulf of Finland. During the MBIs, dense inflown water spreads along the bottom while other large volume inflows renew the halocline layer of the Baltic Sea. The 80 permanent halocline in the Baltic Sea is at a depth of 60-80 m (Väli et al., 2013). The Gulf of Bothnia and the Gulf of Riga do not have a permanent halocline (Raudsepp, 2001). The Gulf of Finland has a very dynamic halocline due to intensive estuarine circulation (Maljutenko and Raudsepp, 2019), occasional stratification collapses due to reverse estuarine circulation 2003) and winter mixing. Seasonal thermocline at a depth range of 10-30 meters starts to develop in spring, reaches its maximum strength in summer and erodes in autumn. In the gulf-type regions of freshwater 85 influence, like the Gulf of Finland (Maljutenko and Raudsepp, 2019) and the Gulf of Riga (Soosaar et al., 2014), seasonal thermocline coincides with seasonal halocline in spring and summer. During maximum river runoff in spring, the river bulge affects the salinity distribution in the coastal sea (Soosaar et al., 2016;Maljutenko and Raudsepp, 2019).
Salinity fronts are formed in the straits that connect different sub-basins of the Baltic Sea: between Kattegat and southwestern Baltic Sea, the Gulf of Riga and Baltic Proper, the Gulf of Bothnia and Baltic Proper. The Danish straits and 90 Kattegat are situated in a region with a very dynamic and strong front that separates the brackish Baltic sea water and the saline North Sea water (Nielsen, 2005). The Baltic Sea water of low salinity is transported towards the North Sea in summer, but saline water of the North Sea inflows to the Baltic Sea in winter (Mohrholz, 2018). A dynamic front is present in the transition area between the northeastern Baltic Proper and the Gulf of Finland, although that is a wide and deep area.
The Baltic Sea is seasonally ice-covered. Inter-annually variable and dynamic ice coverage (Raudsepp et al., 2020) has 95 considerable effect on the evolution of the thermohaline fields in the Baltic Sea.

Model simulation
The General Estuarine Transport Model (GETM; Burchard and Bolding, 2002) is a numerical 3D circulation model initially developed for coastal and estuarine applications (Gräwe et al., 2013;Holtermann et al., 2014). The hindcast simulation of the general circulation of the Baltic Sea was carried out for the period of 1966-2006(Maljutenko and Raudsepp, 2019. 105 Model open boundary was located in Kattegat where sea level elevation, temperature and salinity are prescribed. The model horizontal resolution was set to one nautical mile, which was consistent with the horizontal resolution of the digital bathymetry of the Baltic Sea (Seifert and Kayser, 1995). Vertically, 40 bottom-following adaptive layers were used, which resulted in a vertical resolution of less than 5 m.
The initial conditions of salinity and temperature were compiled using observation data from the Baltic Environmental 110 Database (BED; http://nest.su.se/bed) (Gustafsson and Medina, 2011;Wulff et al., 2013). Atmospheric forcing was prepared from the BaltAn65+ reanalysis dataset (Luhamaa et al., 2011). The heat fluxes are parameterized using bulk formulation (Kondo, 1975). Monthly river runoff data from the 37 largest rivers from the E-HYPE hydrology model (Donnelly et al., 2016) were used. We have stored daily mean values of temperature and salinity and used them for the analysis.

Dataset 115
We use salinity and temperature measurements for the Baltic Sea from the EMODnet Chemistry database (SMHI, 2018).
From the original dataset, we have extracted 1 376 674 measurements, which met the following conditions: 1) time range of 1966 -2005; 2) spatial range of the model domain, excluding coastal observations, which fell outside the model grid; 3) S and T values exist simultaneously; 4) S is in the range of 0 ... 35 g kg -1 ; 5) T is in the range of -2.5 ... 30 °C.
A preliminary check of the spatial and temporal distribution of the validation data is done. The spatial density of the data is 120 presented on the 25 km 2 grid (Fig. 1a). Spatially, there are only a few horizontal cells of 25 km 2 that do not have any measurements. Vertically, the number of measurements decreases monotonically from the surface to the bottom following the hypsographic curve of the Baltic Sea (Jakobsson et al., 2019) (Fig. 1d). The measurements at the standard depth stick out from the overall curve. Since the end of the 1980s, the number of monthly measurements increased continuously more than an order of magnitude compared to the preceding period ( Fig. 1b). Seasonally, the number of winter and early spring 125 measurements is smaller than the number of summer measurements (Fig. 1c), which is consistent with seasonal ice coverage of the Baltic Sea (Raudsepp et al., 2020). This complicates data collection.

K-means clustering
The K-means clustering algorithm is a widely used algorithm in unsupervised machine learning (Jain, 2010). We use a Kmeans clustering algorithm for the cluster analysis of temperature and salinity errors. In the current study, two dimensional 130 error space is defined from simultaneous salinity and temperature errors {dS,dT}∈R 2 , where dS≡(Smod -Sobs ) and dT≡(Tmod -Tobs). In general, the method can be extended to the n-dimensional error space. The distribution of the errors in the {dS,dT}∈R 2 error space is presented in Fig. 2a. Before calculating K-means, the error space has been normalized by the standard deviation of temperature and salinity errors.
The first step of the method is to determine the number of clusters. To maintain a deterministic structure of the cluster, a 135 regular pattern of initial centroids was chosen for this study (Fig. 2b). When we start with only one cluster, we can choose its location at {dS=-1,dT=-1}. Using two clusters means that we start with the locations corresponding to 1 and 2 marked on Fig. 2b. With the increasing number of clusters, we use corresponding initial locations of the clusters marked with numbers 1, 2, 3, etc. Any other more advanced methods for the selection of initial centroids (Celebi et al., 2013) could be implemented just as well. The squared Euclidean distance was used as the measure of the distance between data points and 140 the centroid coordinates of the cluster. The number of iterations was limited to 100, which ensured the convergence of the clustering algorithm. The disadvantage of the K-means clustering algorithm is the lack of a unique way for defining the optimal number of clusters. For the final selection of the number of clusters, we used the Elbow method (e.g., Bholowalia and Kumar, 2014;Yuan and Yang, 2009)  In general, the errors retain their 4-dimensional structure, i.e., {dS,dT} (t,x,y,z), while assigned to specific clusters. Any kind of analysis can be done using the clustered errors.

Normalization 155
Each error pair belongs to a fixed cluster k but retains their 4-dimensional structure, i.e. {dS,dT} k (t,x,y,z). For the visualization of the model accuracy, some reduction of dimensionality of the error pairs is needed.
For the spatial distribution of errors, we take the error pairs as independent of time and vertical coordinate, i.e., {dS,dT} k (x,y). For each horizontal grid cell (i,j) of 25 km 2 , we have a number of points (error pairs) , that belong to cluster k. The total number of points that belong to the grid cell is , = ∑ , , where K is the number of clusters. For the 160 normalization, we divide each , with , and plot the horizontal maps for each k. There is no need to do normalization when we look at time series in a fixed spatial location or plot the Hovmöller diagram of the error clusters. 170 3 Results

Clustering procedure
We start by clustering bulk data covering the entire modeling period and domain. Error representation does not provide a simple idea on how many clusters should be predefined or how the clusters will form. The initial location of the centroids is selected according to the scheme shown on Fig. 2b. The coordinates of the centroid of one cluster (Fig. 3a) provide a model 175 bias of 0.64 °C for temperature and 0.26 g kg -1 for salinity (Table 1). Corresponding standard deviations were 1.5 °C and 2.0 g kg -1 , respectively. The root-mean square difference was 1.67 °C for temperature and 2.04 g kg -1 for salinity. The corresponding linear correlation coefficients were 0.97 and 0.95, respectively.
Increasing the number of clusters results in splitting of the error space into clusters with centroids close to the zero point ( Fig. 3). A representative structure of distribution of the errors emerges in the case of four clusters (Fig. 3d). We can confirm 180 the choice of four clusters by implementing cluster selection criteria. The distance between points and designated centroids reduces exponentially with the increasing number of clusters (Fig, 4). The rate of distance reduction with increasing number of clusters shows local minima at K=4.
The K=4 clustering distributes 1 376 674 error data pairs into the following four clusters, each with N(k) = {263230, 196615, 134326, 782503} datapoints. Cluster k=1 characterizes the set of errors with the basic feature of "underestimated salinity" 185 (Table 1). This cluster is present already in the case of three clusters (Fig. 3c). Increasing the number clusters splits this cluster into two clusters (e.g., for K=9, it splits into clusters k=1,5). Cluster k=2 envelops the errors of "overestimated salinity". This cluster changes into cluster k=4 (K=5), then splits into two clusters (K=8) and three clusters (K=9). Cluster k=3 of "overestimated temperature" is established already in the case of three clusters. Increasing the total number of clusters does not result in a split of the cluster. However, the centroid shifts towards high temperature bias (Table 1). The cluster k=4 190 represents "good match" of the model and measurements. The bias is about 0.4 °C for temperature and 0.6 g kg -1 for salinity (Table 1) Table 1. The biases are marked with the center of the ellipsoid and the standard deviations with the major semi-axes. The error space has been zoomed in for better visualization of the clusters. The full range of error space and distribution of the clusters is shown in Fig. A1 in Appendix A.
https://doi.org/10.5194/gmd-2021-68 Preprint. Discussion started: 30 April 2021 c Author(s) 2021. CC BY 4.0 License. Table 1. The coordinates of the centroids and the standard deviations of salinity and temperature errors within the clusters for a 200 different set of predefined clusters, K=1-9. The numbers of the clusters and the colors in column k correspond to the numbers and colors of the clusters in Fig. 3. The brighter background colors of MEAN and STD columns correspond to parental and descendant clusters of the K=4 cluster distribution.

Analysis of the clusters
Retrieving spatial coverage of K=4 cluster errors shows that the model has "good match" in the whole model domain (Fig.  210   5b). The share of the other errors remains less than 0.3. The model "overestimates salinity", "underestimates salinity" and has "good match" at the Danish straits. "Underestimated salinity" errors have a share of about 0.2 in the deep basins of the Baltic proper, i.e., the Bornholm Basin, Gdansk Basin, eastern Gotland Basin, northern Baltic Proper, western Gotland Basin and western Gulf of Finland. Model "overestimates temperature" at the transition area between the northeastern Baltic proper and the Gulf of Finland, in some coastal locations and within the river plumes. The latter indicates that river water 215 temperature is overestimated in the present model implementation.
Vertical distribution of the error clusters confirms that the share of "good match" errors ranges between 0.5 and 0.9 of all data (Fig. 5e). In the surface layer, in almost 50% of cases we have "overestimated salinity" and "underestimate salinity". In comparison with horizontal distribution of errors, a large part of these errors probably belong to the Danish straits (Fig. 5b).
The "overestimated temperature" has a considerable share centered at a depth of 25 meters. The "underestimated salinity" 220 has a high share at the depth range of 60-100 m. The share of "underestimated salinity" once again increases in the deep layer of the Baltic Sea.

230
A decrease in time of a "good match" coincides with an increase of the share of "underestimated salinity" and "overestimated salinity" (Fig. 5c). Seasonally "overestimated salinity" has a higher share in summer, while "underestimated salinity" has a higher share in winter (Fig. 5d). Combining horizontal (Fig. 5b) and seasonal distribution of errors (Fig. 5d) we could conclude that the salinity is overestimated in the Danish straits in summer and underestimated in winter. In addition, we would like to note that the share of "good match" decreases and "underestimated salinity" increases abruptly at 235 the end of the 1980s, when the number of the measurements becomes larger in the database. The "overestimate temperature" has an almost constant share of 0.1 in time (Fig. 5c). The elevated share of "overestimated temperature" errors in summer confirms that the model overestimates the temperature in the seasonal thermocline (Fig. 5d). For comparison, we have provided a similar analysis of the errors for K=3 and K=5 in the Appendix B. We extract error profiles from Gotland Deep station BY15, which is widely used for the validation of the physical and 245 biogeochemical models of the Baltic Sea. In the upper layer of 60 m, the model has "good match" (Fig. 6a,b). There are isolated occasions of 10% in total, when the model "overestimates temperature" in the seasonal thermocline (Fig. 6b). At the https://doi.org/10.5194/gmd-2021-68 Preprint. Discussion started: 30 April 2021 c Author(s) 2021. CC BY 4.0 License. depth range 60-100 m, the share of model "underestimating salinity" increases. From a depth of 100 m, the proportion of the model that "underestimates salinity" gradually increases with depth. The Howmüller diagram shows that there are extended time periods when the model "underestimates salinity" (Fig. 6a). In the surface layer, the model has "good match", although 250 model salinity starts to deviate from the measurements starting from 1995 (Fig. 6c,e) . At the bottom, the model reproduces temperature very well at the end of 1970s and beginning of 1980s, but as salinity is underestimated, the errors belong to the cluster of "underestimated salinity" (Fig. 6d,f). In general, the model has "good match" in the water column from 1991 to 2003 (Fig 6a,f). Dynamically, this corresponds to the end of the stagnation period and recovery of the bottom salinity and strengthening of the permanent halocline. 255 Figure 7. Learning (a-c) and predicting (d-f) of the K=4 clusters. The learning and predicting datasets have a share of 1% (a) and 99% (c), 20% (b) and 80% (e), 99% (c) and 1% (f) of the full dataset, respectively. Blue crosses mark the location of initial centroids and blue lines connect initial and final locations (marked with numbered diamonds) of the centroids.

Learning of the clusters 260
As the first step, the whole 4-dimensional {dS,dT} dataset is divided randomly into two separate sets for learning and predicting. The dataset for the learning of the error clusters is initiated from a set of a different number of clusters according to initial distribution of the centroids shown on Fig. 2b. Resulting centroids of the learning dataset are then used to initiate https://doi.org/10.5194/gmd-2021-68 Preprint. Discussion started: 30 April 2021 c Author(s) 2021. CC BY 4.0 License. the centroids for the clustering of the predicting dataset. The mean length of shifts between learning and predicting centroids is used to evaluate the effect of dataset size on predicting the representative error clusters. We have used different learning 265 and predicting datasets with sizes ranging from a share of 10 -4 to 0.9999 of the total dataset of 1 376 674 error pairs. The average distances are calculated from 30 trials to have a statistical ensemble of randomly selected datasets. The learning and predicting procedure is illustrated in Fig. 7 for K=4.
If the learning dataset makes up 10-95 % of the total dataset (>100 000 comparison points), then the difference between the learned and predicted centroids does not change significantly (Fig. 8). The clustering of K=4 is most sensitive to the choice 270 of initial centroids. Therefore, the distance between learned and predicted centroids is larger compared to other choices of K.
Below 1% of the learning data size (<10 000 comparison points), the difference in distance between learned and predicted datasets is >0.03 normalized standard deviation. Thus, the size of the learning dataset is significant for predicting the error clusters. The rough estimate of the number of comparison points is about 100 000 for the current model, which ensures relatively stable centroids. 275

Interpretation of the clusters 280
Total number and the spatio-temporal coverage of the comparison points (Fig. 1) indicates that the model performs well over the Baltic Sea and the simulation period considered (Fig. 5). The share of model errors with a bias of {dS,dT}={0.44 g kg -1 ,0.57 °C} and with a standard deviation of {dS,dT}={0.69 g kg -1 ,0.81 °C} (Table 1) is between 0.5 and 0.9.
In addition, we can highlight the areas where the model accuracy is lower and the dynamical features are not so well reproduced by the model. Essentially, seasonal thermocline and permanent halocline are not reproduced by the model as well 285 as the layers with small vertical gradients of salinity and temperature. The model accuracy in reproducing seasonal thermocline has a peak share of "overestimated temperature" of 0.25 (bias of 3.78 °C and standard deviation of 1.73 °C) at a 25 m depth. The error share of 0.25 is observed in the layer of 60-90 meters, which corresponds to the depth range of the permanent halocline. The model "underestimates salinity" (bias of -1.96 g kg -1 and standard deviation of 1.63 g kg -1 ) there.
The model accuracy is relatively low in the Danish straits. The model has "underestimated salinity" in winter and 290 "overestimated salinity" in summer (bias of 3.44 g kg -1 and standard deviation of 1.59 g kg -1 ) there. The "underestimated salinity" errors in the deep basins of the Baltic Sea (Fig. 5b) are caused by the spreading of inflowing North Sea water downstream in the cascade of the deep basins. These inflows mainly take place in winter, while outflow of the Baltic Sea water dominates in summer.
Clustering of model errors could provide information about the accuracy of external fields that are used for the forcing and 295 for the boundary conditions of the model. The "overestimated temperature" at the river plume areas (Fig. 5b) may indicate a mismatch of river water temperature that takes the value from a grid cell adjacent to the river mouth. Although the air-sea fluxes are correctly reproduced by the model, as indicated by "good match" at the surface (Fig. 5c), the following downward flux of heat could be too strong, as the share of "overestimated temperature" is relatively high between the depth of 10-40 meters in summer (Fig 5c,d). 300

Summary
Ideally, researchers like to know the model accuracy over the whole model domain and time period simulated. Commonly used methods provide a limited set of metrics (e.g., bias, standard deviation, root mean square error, correlation coefficient) for the assessment of the model overall quality. In this study, we have proposed a new method for the assessment of the model skills. The aim of using the method is the clustering of multivariate model errors. Model errors consist of differences 305 between the model values and the measured multivariate data. The main advantage of this method is the possibility to use clustered errors for the analysis of the spatio-temporal accuracy of the model.
The method was tested in the validation of the circulation model results of the 40-year period in the Baltic Sea. Temperature and salinity were used for the validation, because they are essential parameters of the physical model and these data have been the most extensively measured in the Baltic Sea. This method enables us to use all available observations, with the only 310 restriction being that multivariate data has to be measured simultaneously. In model validation, the problem usually lies in https://doi.org/10.5194/gmd-2021-68 Preprint. Discussion started: 30 April 2021 c Author(s) 2021. CC BY 4.0 License. the spatio-temporal distribution of measurement data over the 4-dimensional model domain. In our case, the measurement data was sufficient and with good spatial and temporal coverage. In total, we had more than 1 300 000 pairs of measured temperature and salinity values. In many cases, reduction of available data or homogenization of the data is needed prior to the calculation of model errors, and clustering is applied to have simultaneous multivariate data. The number of 315 measurements should be sufficiently large to determine stable clusters. In our case, about 100 000 randomly selected data pairs ensured the stability of the clusters.
We have applied the K-means unsupervised machine learning algorithm for the assessment of the quality of general circulation models by clustering the temperature and salinity errors. The model output fields are 4-dimensional, and the 4dimensional distribution of the errors was retained after the clustering was completed. As a result, cluster numbers were 320 assigned to each error pair. In addition, the errors belonging to one cluster had their bias determined by the location of the centroid in the error space. Further on, common statistical metrics (e.g., standard deviation, root mean square error, correlation coefficient) can be calculated for each cluster and variable. In general, any other partitional clustering algorithm can be used instead of K-means for the clustering of multivariate model errors. We have implemented the K-means algorithm because of its simplicity and robustness. The outcome clusters have direct information of the model bias. The 325 output clusters can be used for calculation of classical statistical metrics. Resulting clusters contain information about common statistical metrics.
The K-means clustering algorithm has a well-known deficiency. There is no unique way to determine the number of clusters.
We used Elbow methods, which gave good results. The selection of four clusters was supported by the analysis of the error clusters in relation to the geographical distribution of the errors, the physical process and the features. The analysis showed 330 that the "underestimated salinity" cluster was mainly in the Danish straits, within the halocline layer and along the pathway of transport of saline water in the Baltic Sea. "Overestimated temperature" had a high share in the seasonal thermocline.
"Overestimated salinity" accounted for the model errors in the Danish straits. For confidence, the analysis was complemented with using three and five clusters. Thus, the analysis of the error clusters enables to shed light on the physical processes and features where model performance should be improved. 335 The clustering was done for the entire Baltic Sea and the whole simulation period. Analysis of clusters of errors at specific locations enables us to assess the model quality there in the context of the overall quality of the model. Multivariate model quality assessment shows that if one parameter is well reproduced by the model, but the other parameter is poorly reproduced at the same time, then the quality might not be good and vice versa. In addition to model quality, error clustering can provide implicit information about the quality of prescribed input variables 340 and forcing fields. Error clustering has shown that the temperature of river runoff water could be overestimated. This is especially relevant for the biogeochemical models, where discharges of different nutrients and other state variables, which have to be prescribed, are usually poorly known. There are problems in the prescribed salinity of the inflowing North Sea The proposed method could be applied for the assessment of the quality of global ocean general circulation models. By the end of the year 2020, there were approximately 3800 ARGO floats profiling the world ocean for salinity and temperature, with a spatial resolution of approximately 1 float for every 3 degrees of latitude and longitude. The annual total number of profiles added to the database is over 100 000, which takes the total available number of profiles to over 2 000 000 (Argo, 350 2020). This huge validation data set probably needs some computational solution, i.e., implementation of parallel computing or specific methods on how to deal with big data within the K-means clustering. In the context of operational oceanographic models, the model validation can be done in "real time" by implementing the learning-predicting sequence. The ARGO data, which are available within 24 hours of collection, could be added to the learned clusters for the updating of the coordinates of the centroids and statistical metrics. 355 The proposed method can be applied to different geoscientific models. The shortlist consists of biogeochemical models, atmospheric models, wave models, hydrological models, geodynamic models. The method can be implemented in a multivariate high-dimensional error space as well as in a univariate error space. In addition to the validation of numerical models, the method can be used for the assessment of remote sensing data and models.

Appendix B 365
In the case of three clusters, the largest share of errors belongs to the cluster k=2 with a bias of {dS,dT}={1.3 g kg -1 ,0.66 °C} and with a standard deviation of {dS,dT}={1.52 g kg -1 ,0.85 °C} (Fig. B1). This cluster provides the main contribution to the clusters of "good match" and "overestimated salinity" when a larger number of clusters is used. The share of the errors of this cluster is between 0.6 and 0.9. Cluster k=1 with a bias of {dS,dT}={-1.35 g kg -1 ,-0.51 °C} and with a standard deviation of {dS,dT}={1.57 g kg -1 ,0.99°C} is the cluster of "underestimated salinity", which retains these features throughout the 370 increasing of the number of clusters. Spatially, "underestimated salinity" has a significant share in the Danish straits and on the pathway of inflowing saline water through the deep basins of the Baltic Sea. Vertically, these errors have a large share of 0.5 in the layer of 60-110 m, which corresponds to the permanent halocline of the Baltic Sea, and below 200 m, which is the bottom layer of the Gotland Deep. The share of "underestimated salinity" is relatively high in the whole water column below the halocline. Seasonally, these errors are significant in winter, when saline water inflows through the Danish straits to the 375 Baltic Sea occur. Cluster 1 with a bias of {dS,dT}={-1.03 g kg -1 ,3.54 °C} and with a standard deviation of {dS,dT}={1.97 g kg -1 ,0.73 °C} has a steady share of errors of 0.1. The errors of "overestimated temperature" are significant in the depth range of 10-50 m and during summer. These errors account for the model accuracy in reproducing seasonal thermocline.
In the case of 5 clusters, the clusters k=2 with a bias of {dS,dT}={0.42 g kg -1 ,1.54 °C} and with a standard deviation of {dS,dT}={0.95 g kg -1 ,0.66 °C} and k=5 with a bias of {dS,dT}={0.3 g kg -1 ,-0.22 °C} and with a standard deviation of 380 {dS,dT}={0.72 g kg -1 ,0.77 °C} dominate over the others (Fig. B2). These clusters are formed as a split of the "good match" cluster with partial contribution from the "underestimated salinity" cluster and the "overestimated salinity" cluster of K=4.
https://doi.org/10.5194/gmd-2021-68 Preprint. Discussion started: 30 April 2021 c Author(s) 2021. CC BY 4.0 License. 390 salinity and temperature error pairs (model minus observation) in the 2-dimensional error space (a). Error bins have a resolution of 1 °C for temperature and 1 g kg -1 for salinity (a). The spatial (b), vertical (c), temporal (d) and seasonal (e) distribution of the share of error points belonging to the four different clusters (b). The share is calculated as explained in Section 2.4. The horizontal bins have a resolution of 25x25 km (b), vertical bins have a resolution of 5 m (c), temporal and seasonal bins have monthly resolution (d,e). The lines (d) have been smoothed using a running mean with a 12-point window size. Line colors correspond to 395 the colors of the clusters on (a). https://doi.org/10.5194/gmd-2021-68 Preprint. Discussion started: 30 April 2021 c Author(s) 2021. CC BY 4.0 License. Figure B2. The distribution of the error clusters for K=5 (a). The colormap shows logarithmic distribution of the number of salinity and temperature error pairs (model minus observation) in the 2-dimensional error space (a). Error bins have a resolution of 1 °C for temperature and 1 g kg -1 for salinity (a). The spatial (b), vertical (c), temporal (d) and seasonal (e) distribution of the 400 share of error points belonging to the four different clusters (b). The share is calculated as explained in Section 2.4. The horizontal bins have a resolution of 25x25 km (b), vertical bins have a resolution of 5 m (c), temporal and seasonal bins have monthly resolution (d,e). The lines (d) have been smoothed using a running mean with a 12-point window size. Line colors correspond to the colors of the clusters on (a).