CLASSIC v1.0: the open-source community successor to the Canadian Land Surface Scheme (CLASS) and the Canadian Terrestrial Ecosystem Model (CTEM) - Part 1: Model framework and site-level performance

. Recent reports by the Global Carbon Project highlight large uncertainties around land surface processes such as land use change, strength of CO 2 fertilization, nutrient limitation and supply, and response to variability in climate. Process-based land surface models are well-suited to address these complex and emerging global change problems, but will require extensive development and evaluation. The coupled Canadian Land Surface Scheme and Canadian Terrestrial Ecosystem Model (CLASS- 5 CTEM) framework has been under continuous development by Environment and Climate Change Canada since 1987. As the open-source model of code development has revolutionized the software industry, scientiﬁc software is experiencing a similar evolution. Given the scale of the challenge facing land surface modellers, and the beneﬁts of open-source, or community model, development, we have transitioned CLASS-CTEM from an internally developed model to an open-source community model, which we call the Canadian Land Surface Scheme including Biogeochemical Cycles (CLASSIC) v. 1.0. CLASSIC 10 contains many technical features speciﬁcally designed to encourage community use including software containerization for serial and parallel simulations, extensive benchmarking software and data (Automated Model Benchmarking ; AMBER), self-documenting code, community standard formats for model inputs and outputs, amongst others. Here we evaluate and benchmark CLASSIC against 31 FLUXNET sites where the model has been tailored to the site-level conditions and driven with observed meteorology. Future versions of CLASSIC

in from an external file, 7) containerization, and 8) extensive benchmarking of model state (via checksums) and performance.
Each of these features will be expanded upon in the following sections.
In Section 2 we describe the submodels, CLASS and CTEM that form the scientific basis of CLASSIC and the initial state of the model framework at the start of CLASSIC development. Section 3 details the model developments implemented along with tools to help existing users migrate to the new model framework. Section 4 outlines our model evaluation and benchmarking 5 approach while Section 5 presents the present state of CLASSIC as evaluated by the AMBER protocol. Section 6 describes future technical and scientific directions for CLASSIC.

Model physics: CLASS
The Canadian Land Surface Scheme (CLASS) was initiated in 1987 to produce a 'second generation' land surface scheme, 10 characterized by more soil moisture and thermal layers with a separate treatment of the vegetation canopy (Verseghy, 2000), for inclusion in the Canadian general circulation model (GCM) and has been under continual development since. The first publications introducing CLASS described the physics calculations for movement of heat and water through the soil and snow layers (Verseghy, 1991), and the physics algorithms for energy and moisture fluxes within the vegetation canopy as radiation and precipitation cascade through it, with an explicit thermal separation of the vegetation from the underlying ground 15 (Verseghy et al., 1993). Development of CLASS has been predominantly within Environment Canada, a federal department of the Government of Canada (later renamed Environment and Climate Change Canada; ECCC) with the exception of an organized community effort as part of the Canadian Climate Research Network (1994-1997Verseghy, 2000) as well as ad-hoc collaborations with the broader research community. CLASS is presently at version 3.6.2 (Verseghy, 2017). While CLASSIC includes a biogeochemical component (CTEM; described below), it is possible to turn it off and run CLASSIC with 20 specified structural vegetation attributes such as tree height, plant area index, etc, which is desirable in some situations. As a land surface scheme, CLASS simulates the fluxes of energy, momentum, and water between the atmosphere and land surface ( Figure 1). CLASS operates on a sub-daily time-step that varies depending on its application. When within the framework of the most recent version of the Canadian Earth System Model (CanESM5; Swart et al., 2019), it operates at a time step of 15 minutes, 25 consistent with the atmospheric component. A time-step of 30 minutes is used for offline, uncoupled simulations where it is driven with observed meteorological data. Differing measurement heights for the observed temperature and wind speed are accounted for. Since the forcing data typically does not separate precipitation into its rainfall and snowfall components, three options are available in the CLASSIC modelling framework, 1) a strict threshold of 0 • C below which all precipitation is considered snowfall, 2) a linear gradual transition of rainfall into snowfall as temperature decreases from 2 • C to 0 • C, and 3) 30 a non-linear transition from rainfall to snowfall as temperature decreases from 6 • C to 0 • C. Here, we use option 1.
Vegetation characteristics are described by rooting depth, canopy mass, leaf area index (LAI), and vegetation height, which are then used in calculations of the transfers of energy, water, and momentum with the atmosphere. The physical land surface processes in CLASS are modelled using plant functional types (PFTs). Presently, global vegetation in CLASS is represented using four PFTs -needleleaf trees, broadleaf trees, crops and grasses. The structural vegetation characteristics of these PFTs are modelled as a function of driving meteorological data and atmospheric CO 2 concentration by CTEM, which handles the biogeochemical processes, as explained in the next section. The number of ground layers in CLASS can vary depending upon application. The standard offline model setup currently uses twenty ground layers starting with 10 layers of 0.1 m thickness, 5 gradually increasing to a 30 m thick layer for a total ground depth of over 61 m, while three ground layers with thicknesses of 0.1, 0.25 and 3.75 m are used in CanESM5 (the standard model setup is provided for each of the 31 FLUXNET sites; please see Appendix A). Water fluxes are calculated for soil layers within the permeable soil depth of the ground column, but not the underlying bedrock layers, whereas temperatures are calculated for both soil and bedrock layers (depicted in Fig. 1). Both water fluxes and ground temperatures are calculated each time step. The permeable soil depth varies geographically and is read in from the model initialization file. Also calculated each time-step are the temperature, mass, albedo, and density of the single 5 layer snow pack (where it exists), the temperature and interception and storage of rain and snow on the vegetation canopy, and the temperature and depth of ponded water on the soil surface. Each grid cell is independent with no lateral transfers of heat or moisture between them. While runoff can be routed to estimate streamflow, such as in CanESM5 using the approach of Arora et al. (1999) or the MEC-Surface & Hydrology System (MESH) framework (Pietroniro et al., 2007), in both cases, once the runoff leaves a grid cell, it does not modify the soil moisture of downstream grid cells. 10 Application and evaluation of the performance of CLASS over the course of its development has been extensive with dozens of publications using the model at point, regional, and global scales both in coupled and offline modes. CLASS has been applied in an offline context, i.e. forced with observed meteorology (e.g., Bailey et al., 2000;Kothavala et al., 2005;Roy et al., 2013;Bartlett et al., 2006;Brown et al., 2006;Wu et al., 2016;Verseghy and MacKay, 2017;Melton et al., 2019d), as the physical land surface component of regional climate models, e.g. CRCM (Ganji et al., 2015;Paquin and Sushama, 2014) and 15 CanRCM (Scinocca et al., 2016), and integrated into each version of the Canadian Atmospheric Model (CanAM;von Salzen et al., 2013), Coupled Global Climate Model (CanCM), and Earth System Model (CanESM; Arora et al., 2011;Swart et al., 2019) since the early 1990s.

Model biogeochemistry: CTEM
Development of the Canadian Terrestrial Ecosystem Model (CTEM) began in the early 2000s at Environment Canada in re-20 sponse to the need for a land surface carbon cycle component for the CanESM. CTEM, which is presently at version 2.0 , couples with CLASS through the exchange of information describing the state of the physical land surface and overlying vegetation. CLASS provides CTEM with physical land surface information including soil moisture, soil temperature and net radiation. CTEM uses this information to simulate photosynthesis and prognostically calculate carbon in its three live vegetation components (leaves, stem, and roots) and two dead carbon pools (litter and soil). The carbon amounts 25 in these five carbon pools evolve prognostically, in the default CLASSIC configuration, for nine PFTs that map directly onto the four PFTs used by CLASS. Needleleaf trees are divided into their deciduous and evergreen types, broadleaf trees are divided into cold deciduous, drought deciduous, and evergreen types, and crops and grasses are divided based on their photosynthetic pathways into C 3 and C 4 versions. The finer distinctions between PFTs in the CTEM PFTs is required for modelling biogeochemical processes. For instance, simulating leaf phenology prognostically requires the distinction between deciduous 30 and evergreen versions of broadleaf trees. However, once the LAI has been dynamically determined by CTEM, CLASS only needs to know that this PFT is a broadleaf tree since the physics calculations do not require information about underlying the deciduous or evergreen nature of the leaves. The prognostic carbon masses of leaf, stem, and root simulated by CTEM are used to calculate the structural vegetation characteristics required by CLASS: rooting depth (using a dynamic root distribution; Arora and Boer, 2003), canopy mass, LAI, and vegetation height (Arora and Boer, 2005a). Other than these structural vegetation attributes, CTEM also provides canopy conductance values to CLASS based on photosynthesis calculations at the CLASS time step, as explained in Arora and Boer (2003). The remainder of the biogeochemical process simulated by CTEM ( Figure 1) and the resulting vegetation dynamics operate on a daily time step. CTEM models all primary terrestrial ecosystem processes including maintenance and growth respiration (Arora and Boer, 2005a); heterotrophic respiration (Melton et al., 2015); tissue 5 turnover, allocation of carbon, and phenology (Arora and Boer, 2005a); disturbance (fire; Arora and Boer, 2005b;Arora and Melton, 2018); competition for space between PFTs (Arora and Boer, 2006;Melton and Arora, 2016), and land use change (Arora and Boer, 2010).
CTEM also dynamically calculates wetland extent, methane emissions from wetlands and fires, and methane uptake by soils (Curry, 2007) as described in Arora et al. (2018). To determine the wetland extent, as the liquid soil moisture in the top soil 10 layer increases above latitudinally dependent threshold values, the wetland fraction increases linearly up to a maximum value, equal to the flat fraction in a grid cell with slopes less than 0.2%. Methane emissions from wetlands are calculated by scaling the heterotrophic respiration flux from the model's litter and soil carbon pools. A separate submodule simulates peatland specific processes following Wu et al. (2016).
Both CLASS and CTEM have the capability to be run in a 'mosaic' configuration in which grid cells are divided into 15 separate and independent tiles that simulate their own energy and water balance. There is no lateral transfer of energy or water between tiles that lie within a grid cell. As a result, the soil moisture in the permeable soil layers and the temperature of soil and bedrock layers in each tile evolve independently of each other, despite being driven with same meteorological data. A grid cell may be divided into tiles using any criteria, e.g. by each PFT on its own tile, different soil textures, or peatlands vs.
uplands. In contrast, with the 'composite' or single-tile approach, structural vegetation attributes (LAI, rooting depth, vegetation 20 height, canopy mass, and albedo) are averaged over all PFTs for use in energy and water balance calculations. Soil texture for permeable soil layers is also common to all PFTs within a grid cell resulting in a single liquid and frozen soil moisture for each permeable soil layer, and a single temperature for each soil and bedrock layer for the grid cell. The effect of tiling on the basis of fractional coverage of PFTs has been evaluated on estimation of the terrestrial carbon sink (Melton and Arora, 2014), competition between PFTs (Shrestha et al., 2016), and the use of soil texture clusters as a tiling criteria has been evaluated by 25 Melton et al. (2017).
Terrestrial biogeochemical processes simulated by CTEM have been evaluated at scales from site-level to global (e.g. Peng et al., 2014;Melton and Arora, 2014;Melton et al., 2015;Melton and Arora, 2016). The fire subroutine has been extensively evaluated as part of the Fire Model Intercomparison Project (FireMIP; Hantson et al., 2016;Forkel et al., 2019) in addition to being used to estimate carbon cycle implications of the reduction in global wildfire since the 1930s . 30 The parameterization for competition between PFTs has been evaluated at the site-level (Shrestha et al., 2016) as well as at the global scale . CLASS-CTEM has contributed to the Global Carbon Project's methane assessment by providing wetland methane emissions (Poulter et al., 2017). An assessment of natural methane emissions simulated by CTEM is presented in Arora et al. (2018) who use a one-box model of atmospheric methane together with prescribed anthropogenic and geological sources to reproduce atmospheric methane concentrations consistent with the observational record over the historical period. CLASS-CTEM has also contributed regularly to the Global Carbon Project's annual carbon budget analyses since 2016 (Le Quéré et al., 2016, 2018b. At the start of our project to transform CLASS-CTEM into CLASSIC, the model code base was not well-suited to modern computing and model development practices. CLASS-CTEM was written following the Fortran 77 standard, documentation was provided in stand-alone documents that often significantly lagged model development or scientific papers, code management did not use modern version control systems, model parameters such as the number of soil layers and PFTs were hard-coded into subroutines, and the offline framework (as opposed to the version coupled into the family of CanESM models) used fixed 10 format ASCII text files for model inputs and outputs. While these issues are not unique to CLASS-CTEM, they made model development and evaluation challenging. For example, while ASCII inputs are reasonable for site-level simulations, they quickly become difficult to prepare and manage over large modelling domains, which entailed handling thousands of ASCII files that are poorly suited to parallel computation. Additionally as computer hardware or compiler versions change the model results are susceptible to issues of reproducibility due to a changing computational environment. While CLASS-CTEM has been used 15 extensively by the Canadian research community, as evidenced by the publications cited in the previous section, the modelling framework was poorly designed for new users who faced a steep learning curve in attempting to run and understand the model.
In our modernization and improvement of the CLASS-CTEM code base to develop the CLASSIC model, we have addressed these concerns and made several important changes to the model code, which we describe below.

Containerization
A virtual machine (VM) emulates a physical computer allowing different operating systems to run on different hardware, for e.g., the Linux operating system on a Windows machine. A software container is a form of operating system virtualization, essentially a pared-down version of a virtual machine with only the necessary software required to complete the intended task.
Both options have been used to facilitate the running of complex models. The Weather Research and Forecasting model (WRF; 25 Hacker et al., 2016) has been implemented in both VM and container configurations, while the United Kingdom Chemistry and Aerosols (UKCA) composition-climate model has been implemented within the Met Office VM (Abraham et al., 2018).
In choosing between a VM or container to provide users of CLASSIC, we chose a container due to its light computational overhead, and design geared to high performance computing (HPC) applications (Arango et al., 2017). The advantages of containers for scientific computing are numerous. Firstly, as the environment within the container is consistent across computer 30 systems, the reproducibility of the code is enhanced. Containers are encapsulated portable environments that can contain both the software and data dependencies, along with libraries needed by the software, ensuring all necessary environment variables are provided and consistent. Secondly, they prevent orphaning of old code due to environment changes. If a container image is retained, the container can recreate the same environment for the code each time it is run. This ability allows a simple means to rerun old code, without having to consider environment changes that would make running the code outside of the container onerous. Third, the use of a container makes it easier to get up and running with a model quicker as all dependencies are 5 included in the container. Lastly, many HPC centres allow the use of containers on their systems reducing the need for users to reconfigure models for different server systems.
For CLASSIC, we use a Singularity container (https://www.sylabs.io/), which is portable to Linux, Mac, and Windows systems as well as HPC clusters, for which Singularity was specifically designed (Kurtzer et al., 2017). Tests of Singularity's performance on HPC benchmarks have demonstrated a negligible performance overhead (Le and Paz, 2017). Singularity is 10 already installed on many national, university and government HPC centres including Compute Canada, Argonne National Labs, Fermilab, National Supercomputer Centre (Sweden), amongst others. The CLASSIC container image is available from our Zenodo community page (https://zenodo.org/communities/classic/). Version 1.0 of our container contains all software libraries needed to run the model in serial or parallel (using MPI) as well as the capability to run our benchmarking software (described in Section 4.2). The general model workflow is outlined in Figure 2.

Code design and management
To permit flexibility in the code, CLASSIC's parameters are read in at run time from a Fortran namelist file. This use of a namelist permits rapid testing of model sensitivity to parameter values, as the model does not require recompilation between tests. Additionally, new PFTs can be more easily added to the code as all parameters are located in the namelist file, rather than 5 distributed throughout the code. However, by design, new PFTs cannot be introduced into CLASSIC without due care. Within the code, case structures are used at each PFT-level branching calculation to determine the code branch a particular PFT is assigned to. These case structures have integrated error checks, whereby an unknown PFT causes the model to abort with a flag thrown informing the user where in the code the model detected an unknown PFT. This safety check helps to encourage thoughtful introduction of new PFTs to the model.

10
In transitioning from CLASS-CTEM to CLASSIC, the code base was ported into the distributed version control system Git (https://git-scm.com/) and is distributed via the software development tool Gitlab (https://gitlab.com/cccma/classic). The Gitlab issues tracker and a Nabble forum (http://classic-message-board.158658.n8.nabble.com/) will be used to facilitate communication between CLASSIC users. The issues tracker can be used to submit bug reports, elaborate on new parameterizations, and discuss code changes whereas the forum can be used to ask other users for advice and assistance. 15 As a further aid to development, CLASSIC can output checksums to speed model development for code changes that should not impact upon the model outputs. The checksum subroutine computes content-based checksums (summation of the flipped bits, i.e. the 1 in binary representation) of several groups of variables after a run has completed.
Checksums are an imperfect means of ensuring no logical changes, as two numbers may have different binary representations with the same number of flipped bits. However, as the number of variables checked increases, it becomes highly unlikely to 20 render a false-positive. For example, if we take a data item of bit-length n, having b flipped bits in its representation, the number of same-length data items with the same checksum can be expressed through the binomial coefficient n b . (1) If we assume the worst case where b = n 2 , then Equation 1 becomes n n/2 .
(2) 25 Dividing by the total number of possible values for an n-digit binary number, we find the probability of a false positive checksum to be: So, for example, if the checksum is computed using only three 32-bit numbers, the probability of a false positive is about 2.87 ×10 −7 .

Self-documenting Code
To modernize the CLASS-CTEM code base, it was first transformed from the fixed form (Fortran 77) to a free-form structure (allowing coding constructs permitted by the Fortran 2008 standard). Best practices suggest that it is most desirable to embed documentation within the software, as this increases the likelihood that when developers update code, they will update the 5 documentation at the same time (Wilson et al., 2014). Our model documentation was then incorporated directly into the code using the syntax of the Doxygen documentation generator (http://www.doxygen.nl/). The use of Doxygen allows both the programmatic and scientific reasoning behind the code to be detailed within the generated manual along with variable dictionaries. This ensures that the documentation for a particular model version is always included in the model source code distribution, and lowers the burden for developers to maintain its currency as it can be edited as the source code is updated.

10
The full Doxygen-generated CLASSIC documentation for version 1.0 can be found in the Zenodo archive and, for the most recent CLASSIC release, at https://cccma.gitlab.io/classic. As Doxygen is included in the CLASSIC container, the user can run 'doxygen Doxyfile' to generate a local version of the documentation as required.

Coding standards
As part of the code modernization effort, we refactored the code to follow a recently developed CLASSIC coding convention. 15 Coding conventions help to ensure code portability, readability, and maintainability. In adopting our coding conventions, we have developed a tool (a 'linter') to enforce code quality that can be used on legacy code, or new code, to either change the code to meet the coding standards, or to flag suspect sections of the code for manual intervention. Our coding convention is available in the Supplementary Material and on the CLASSIC website. The linter, which is written in python, is available in the CLASSIC code repository.

Serial and parallel computations (MPI)
As CLASSIC can be run at the site-level as well as over gridded domains it is important to support both modes of operation while maintaining only one codebase. The newly created CLASSIC driver allows compilation for serial or parallel processing of the model code. For site-level simulations, serial compilation and running of the model is sufficient whereas, for speed, runs over gridded domains require parallel processing which uses message passing interface (MPI) directives within the CLAS- 25 SIC code. Pre-compiler directives, used sparingly, allow the same code for both applications. The default compiler supplied within the CLASSIC container is the GNU compiler (gfortran for serial and mpif90 as wrapper around gfortran for parallel computation) with additional Makefile scripts for Intel and Cray compilers.

Simulation domains
CLASSIC is designed to be easily run over simulation domains from site-level to global. To run the model at a point location, the 30 model input files can be either, also site-level with point-scale information, or regional (two-dimensional fields). For regional domains, the model grid can be specified either with one-dimensional vectors or two-dimensional grids of longitudes and latitudes. The 1D case includes regular grids equally spaced in longitude and latitude as well as regular Gaussian grids. The 2D case is more general and can support any rectangular grid. The model call determines whether a simulation is run across a region or at a point. When using a regular grid, the model domain can be set by calling CLASSIC with a longitude latitude box. When the input files are provided on an irregular grid, CLASSIC uses grid cell indexes, instead of geographic coordinates 5 to delineate the domain for the simulation.

NetCDF input/outputs
While CLASS-CTEM used ASCII text files for model input/output (I/O), this method is cumbersome over gridded domains.
The CLASSIC framework uses netCDF files instead (https://www.unidata.ucar.edu/software/netcdf/. NetCDF is a machine independent data format that is self-describing, allowing extensive metadata within the files, and which is the most common 10 data format within the land surface modelling community. Scripts are provided in the CLASSIC code repository to convert legacy ASCII input files (meteorology and model initialization files) into the new netCDF format. Additionally, for site level users, a Fortran tool is included to convert Fortran namelist files, which are easier to work with than the legacy ASCII files, into a netCDF file suitable for CLASSIC model initialization.

Output files (xml)
15 To best use netCDF as an output format requires writing the output file metadata at the time of the file's creation. To facilitate this CLASSIC uses extensible markup language (XML) files that are edited through a web interface and the resulting files are validated using an adjacent schema. The web interface allows a user to configure the output variables, as well as add new variables and edit metadata, using any JavaScript-compatible browser. Once the changes are complete, the user may download an updated version of the modified XML configuration file for use by CLASSIC. Where possible CLASSIC output files are 20 Climate and Forecast (CF)-compliant (http://cfconventions.org/) and use variable names consistent with the data request of the Coupled Model Intercomparison Project (CMIP6; http://clipc-services.ceda.ac.uk/dreq/mipVars.html).

Meteorological inputs
Meteorological inputs required by CLASSIC include downwelling shortwave and longwave radiation, surface precipitation rate, surface air pressure, specific humidity, wind speed, and air temperature together with a reference height at which these interpolated. Long-wave radiation is uniformly distributed across the reanalysis time period. Shortwave radiation is distributed diurnally using the day of year and a grid cell's latitude with the maximum value occurring at solar noon. The total reanalysis time period precipitation amount determines the number of wet half-hours in each period following Arora (1997). In a conservative manner, the total time period's precipitation amount is then randomly spread across the wet physics timestep periods.
For example, if the reanalysis meteorology is provided in 6 hour time periods, the non-zero precipitation for each 6 hours is spread across the number of wet 30 minute time steps as determined by the disaggregation scheme. If CLASSIC is being run 5 with observed meteorology, such as from an eddy covariance tower, on a 15 -30 minute timestep, the meteorology is used without modification.
4 Running and Benchmarking CLASSIC 4.1 Quick-start tutorial to run and benchmark CLASSIC Appendix A contains a tutorial that guides the reader through downloading, compiling, and running CLASSIC over a set of 10 FLUXNET sites. FLUXNET is a global network of micrometeorological tower sites that uses the eddy covariance method to measure the exchanges of carbon dioxide, water vapour, and energy between terrestrial ecosystems and the atmosphere (https://fluxnet.ornl.gov). FLUXNET data has been used extensively to improve and evaluate land surface models hydrology, energy, and carbon fluxes as well as model processes (e.g., Blyth et al., 2010;Stöckli et al., 2008;Melaas et al., 2013). The outputs of the model runs are then run through a benchmarking system described below. All figures presented here except the 15 schematic figures (Figs. 1 and 2) are produced by the benchmarking software.

Automated Model Benchmarking (AMBER)
The increasing complexity of land surface models necessitates more advanced methods of model evaluation and assessment. ILAMB has developed a framework that summarizes model performance across multiple statistical metrics using a dimensionless skill score system implemented in an open-source benchmarking and diagnostics software tool for land surface model 25 evaluation written in Python (Collier et al., 2018). The evaluation of CLASSIC presented in this study is based on ILAMB's statistical framework, which we implemented in a new R package referred to as the Automated Model Benchmarking (AM-BER) package (Seiler, 2019). The development of AMBER allowed us to tailor the ILAMB approach to CLASSIC model outputs allowing: (i) a seamless data ingestion that does not require pre-processing steps, (ii) the ability to evaluate CLASSIC in different simulation modes (i.e. global and regional simulations on differing grids, and site-level runs), and (iii) full control 30 on how the statistical framework is implemented. AMBER uses a variety of observation-based reference data against which it evaluates CLASSIC. These data consist of global scale remote sensing-based products, eddy covariance flux towers and annual streamflow measurements at the mouth of major rivers. Site-level evaluations are based on eddy covariance flux tower data alone. All AMBER outputs from site-level evaluation of CLASSIC v. 1.0 are in the model benchmarking archive (Table 2) with AMBER itself included in the CLASSIC software container.

Skill scores 5
In ILAMB (Collier et al., 2018) and AMBER the performance of a model is expressed through scores that range from zero to one, where higher values imply better performance. These scores are computed for each variable in five steps: (1) computation of a statistical metric, (2) nondimensionalization, (3) conversion to unit interval, (4) spatial integration, and (5) averaging scores computed from different statistical metrics. The statistical metrics considered are the bias, root-mean-square error, phase shift, interannual variability, and spatial distribution. For example, the scalar score for the bias is calculated as: where v mod (t, λ, φ) and v ref (t, λ, φ) are the mean values in time (t), over a specified period of time, of a variable v, which varies geographically as a function of longitude λ and latitude φ for model and reference data, respectively. Nondimensionalization is achieved by dividing the absolute bias by the standard deviation of the reference data (σ ref ) that corresponds to the same period over which the mean of model simulated quantities is calculated: A bias score that ranges between zero and one is calculated next: The spatial integration of s bias leads to the scalar score: 20 The spatially integrated score values for the root-mean-square error (S rmse ), phase shift (S phase ), interannual variability (S iav ), and spatial distribution (S dist ) are found in a similar way although the actual calculation of the metrics is, of course, different as shown in Table A1. To demonstrate the intermediate statistical metrics for the component scores, the components of S rmse are listed in Table A2. The score values based on these metrics are then combined to derive a single overall score for each output variable: 25 S overall = S bias + 2S rmse + S phase + S iav + S dist 1 + 2 + 1 + 1 + 1 .
S rmse is assigned twice as much value as the other metrics since we consider it more important than the other metrics.
Interpretation of benchmarking scores should be done carefully. While ILAMB uses 'stoplight' colours to distinguish different thresholds, at least in a visual sense, of model performance (e.g. see Figure 1 in Collier et al., 2018), we don't adopt that approach. The use of thresholds may be a useful visual tool when distinguishing the performance of several models (as in 5 Figure 1 of Collier et al., 2018), or versions of the same model, however here we are presenting the results of a single model and thus any thresholds chosen would be arbitrary. Following Collier et al. (2018) we also '... do not view these aggregate absolute scores as a determinant of good or bad models. We envision the scores as a tool to more quickly identify relative differences among models and model versions which the scientist must then interpret'. We also agree with Collier et al. (2018) that the absolute value of score is not particularly meaningful. As well, a perfect score is not achievable due to reference datasets having 10 measurement error and uncertainty, a lack of consistency between datasets for the same variables, and a lower score may only highlight the need for an enhanced observational sampling effort, or the lack of an appropriate metric of model performance (Collier et al., 2018). Therefore, we do not use the model score as the basis to determine whether model performance is acceptable or deficient for different model outputs, but rather use these scores to define the initial model performance that future model developments will be compared against. 15

Evaluation of CLASSIC
We used three separate approaches to benchmark CLASSIC. First, we used FLUXNET sites (Pastorello et al., 2017) where CLASSIC was driven with observed meteorology and the model initialization file was set up to correspond to site conditions, i.e. vegetation composition and coverage, soil texture and permeable depth, based on publications describing the tower sites.
These 'site-level' simulations were performed for the 31 FLUXNET sites listed in Table 1   FLUXNET sites in our study have between 3 and 19 years of available meteorological data (Table 1). Since these available years of meteorological data may not be representative of the mean climate at a given site, and/or a given site may be recov-ering from events that occurred prior to the start of the EC tower installation, the comparison between observations and model simulated fluxes is somewhat confounded. The modelled annual NEE thus always sums to zero, by construction, while we know that in the real world, at most sites, GPP is currently higher than RECO making annual NEE positive with land acting as a sink of carbon in response to rising atmospheric CO 2 concentrations. Past episodic extreme climatic events (e.g. drought or 5 extreme precipitation) and disturbances (e.g. prior timber harvest or tree planting such as site CZ-BK1 (Sedlák et al., 2010)) or fire, e.g. at CG-Tch (Merbold et al., 2008)) that we are unable to take into account confound the mismatch between model and observation-based fluxes even more. In an ideal world, we would have historical meteorological data and disturbance history available at each site. All components of the carbon budget -GPP, RECO, and NEE -are affected by these issues, but since NEE is a residual of GPP and RECO, the model to observations comparison is confounded most significantly for NEE.

10
The other two benchmarking approaches drive CLASSIC with globally gridded data products of meteorological data, vegetation cover, soil permeable depth, and soil textures. In the first of these two approaches, CLASSIC is evaluated against globally gridded data sets for variables where observation-based data are available. The third and final benchmarking approach also uses gridded data products to drive CLASSIC, but the model outputs are evaluated against the entire FLUXNET2015 release of 212 sites by comparing the EC tower data against the model outputs from the CLASSIC grid cell that each tower lies within. The 15 evaluation from these two approaches is presented in a companion paper that evaluates CLASSIC on a global scale (Seiler et al. 2019, in prep.).

Further performance metrics
We present here also the statistical metrics of root mean squared error and coefficient of determination (both calculated using the scikit-learn python package; Pedregosa et al., 2011). The coefficient of determination is calculated as: whereŷ is the predicted value of the n-th sample and y i is the corresponding observed value for total n samples.ȳ is found viā While the coefficient of determination is commonly termed R 2 , in this formulation, the best possible goodness of fit is 1.0 while the R 2 value can go negative (as can be seen from Equation 9). The RMSE is calculated by 5.2 Benchmarking observation-driven CLASSIC at FLUXNET sites

Energy fluxes
The mean seasonal cycle of net radiation (RNS) simulated by CLASSIC compared against observation-based estimates from all FLUXNET sites is shown in Fig. 5. For each site, the plot title shows the site ID (  The sensible heat fluxes (HFSS) simulated by CLASSIC were generally closer to the EC tower observations (Figure 7) than HFLS, but with a slight overestimate compared to observations. The mean RMSE across sites was 24.7 W m −2 with a median R 2 of 0.31 (See Supplementary Table S3 for per site values). The AMBER scores for HFSS were higher than HFLS with the lowest component score being S rmse with 0.52 and an overall score of 0.66, equal to the HFLS overall score.

Carbon fluxes
5 Figure 8 compares the simulated mean seasonal cycle of GPP to FLUXNET observations. While we removed implausible values from the observations, i.e. negative GPP, we did retain other data points that we suspect could be spurious (e.g. 2005 GPP at SD-Dem in Supplementary Fig. S16). Over all 31 sites, the CLASSIC outputs have an average RMSE of 0.07 kg C m −2 month −1 and a median R 2 of 0.15 (See Supplementary Table 4  and France (FR-Pue). These mid-latitude EBF sites are for Quercus ilex (evergreen Oak) which don't correspond well to the CLASSIC formulation for evergreen broadleaf PFT, which is typically assumed to be a tropical PFT. CLASSIC generally has a small bias, simulating somewhat lower GPP for the EBF sites (excepting GH-Ank; Fig. 8) with an RMSE for those sites of Post-drought most of the native bunchgrasses were dead, along with high shrub mortality, leading to the establishment of an 25 invasive grass (Scott et al., 2010). The site is also lightly to moderately grazed. Changes in vegetation such as this, along with unique drought conditions (which are only partly captured in the observation period) are difficult to capture adequately by CLASSIC if the goal is to evaluate a model across a large number of sites to capture the diversity of biomes globally. Other sites present similar difficulties to replicate the observed GPP due to complex histories or conditions. As a result it is important to compare across several sites to reduce the importance of the site-level conditions and to observe the model performance on Seasonal plots of simulated ecosystem respiration (RECO) are compared against FLUXNET estimated values in Fig. 9.
CLASSIC has generally lower variability in RECO than suggested by EC-tower derived estimates, especially at low latitudes sites such as BR-Sa1, GF-Guy, MY-PSO, ZM-Mon, and SD-Dem. In evergreen needleleaf forest sites, the simulated RECO is generally higher than the FLUXNET values. This overestimate is consistent with high GPP values at these sites. Similarly, the low GPP simulated for the evergreen broadleaf forests yields a low biased RECO. MY-PSO is a notable exception with CLASSIC simulating smaller GPP but larger RECO than the EC-tower derived quantities. Also, while RU-SkP (DNF) was reasonably well simulated for GPP, CLASSIC simulates a much larger RECO than the EC-tower derived value. The mean 5 RMSE across all sites is 0.06 kg C m −2 month −1 with a median R 2 of 0.12 (See Supplementary Table 5 for per site values).
The AMBER scores for RECO are similar to the GPP scores with a slightly lower overall score (0.65) but a higher S rmse of 0.49.
NEE is directly observed at eddy-covariance towers. The comparison between model and observations for NEE is, however, confounded due to the issues discussed in Section 5. Due to these factors, the AMBER NEE scores are lower than all others work also lays the groundwork for incorporation of a N-cycle in CLASSIC that is presently in development. Other works in progress includes the incorporation of high-latitude shrub PFTs for both the physical and biogeochemical aspects of this growth form. Following on from recent work demonstrating CLASSIC's capable performance in simulating the physics of permafrost regions (Melton et al., 2019d), CLASSIC's bulk soil C pools will be replaced with an explicit tracking of soil C per soil layer that is better suited to simulate permafrost C dynamics. As part of this work, a C tracer is being incorporated to allow tracking 20 of 14 C or other isotopes through the model C pools. Planned technical developments include further modularization of the physics code, greater adoption of data structures to move away from lengthy argument lists, and modifications to allow the same code to be seamlessly used offline as well as in the CanESM framework. While future releases of CanESM will include CLASSIC, CanESM5 contains older model versions (CLASS v.3.6.2 and CTEM v. 1.1). Benchmarking developments include adding peatland EC tower sites to allow evaluation of our peatland module (Wu et al., 2016) as well as sites for biomes that are 25 presently poorly represented.

Conclusions
CLASSIC is the open source community successor to CLASS-CTEM. CLASS and CTEM have decades of development behind them and have been extensively evaluated and applied at scales from site-level, involving evaluation aginst data from EC towers, to coupling to an atmosphere model as part of CanESM. While the science within CLASS-CTEM has continued 30 to advance in response to changing scientific questions and applications, the technical aspects of CLASS-CTEM have fallen behind. Our study details the transformation of CLASS-CTEM from a primarily internally developed model to one designed to encourage community use and development. Code and data availability. The CLASSIC software container, CLASSIC site-level FLUXNET benchmarking data and AMBER reports, and all code both for CLASSIC v. 1.0 and for preparing the plots of model outputs as well as site-level data presented here are archived on the CLASSIC community Zenodo page. (Melton et al., 2019a, b, c). See Table 2 for the location of all resources.    Table 1 lists the full site names.  Table 1). Shaded regions indicate the standard deviation across the sample years. See caption of Figure 5 for biome names. Seasonal RECO (kg C/m 2 /month) observed CLASSIC_v1_0 Figure 9. The mean seasonal cycle of RECO for 31 FLUXNET sites as simulated by CLASSIC. IGBP biome of each site is listed alongside its FLUXNET name (see Table 1). Shaded regions indicate the standard deviation across the sample years. See caption of Figure 5 for biome names. This document will walk through the basic setup, compilation, and running of CLASSIC at FLUXNET sites.

Requirements
This guide assumes that the reader is using a Linux machine, and that they have Singularity and tar installed. The installation guide for Singularity can be found at https://sylabs.io/docs/#singularity. Tar can be installed from https://www.gnu.org/ software/tar/. If using a Windows or Mac machine see instructions on setting up Singularity at https://sylabs.io/guides/3.3/ 10 user-guide/installation.html#install-on-windows-or-mac.

Obtaining the Source Code
The first step is setting up a directory structure. For the sake of this guide, we will work out of a directory located at, '/home/eccc'. So first we navigate to our chosen directory location: cd /home/eccc 15 The CLASSIC source code is hosted on GitLab, and can be cloned with: git clone https://gitlab.com/cccma/CLASSIC.git If you are familiar with git's workflow and wish to contribute to the codebase in the future, then it is a good idea to first fork a copy to your own GitLab account. Once you have a fork, you can clone the repository with: git clone https://gitlab.com/ ** your_gitlab_username ** /CLASSIC.git 20 Once the cloning process is complete, there should be a directory titled CLASSIC located in your directory: /home/eccc/CLASSIC

Obtaining other necessary files
All other necessary files are found on Zenodo in the form of compressed packages. We will use the 'tar' command to un- From the '/home/eccc' directory, we will decompress the tarballs and move the decompressed files to their correct locations.

5
The following list of commands will accomplish this: wish to deviate from the automated run scripts, that command will be of great use.
First, we'll make sure we have a clean working directory.
singularity exec CLASSIC_container.simg make mode=serial clean We are now ready to compile the source code for serial simulation. This is done with the command: singularity exec CLASSIC_container.simg make mode=serial The compilation process can take several minutes. If you get errors, check that you have the latest stable version of CLASSIC pulled from the repository.
With the FLUXNET sites in place and CLASSIC compiled, we can now setup the job options file(s) to run over the FLUXNET sites. To do this, run the job options script by invoking:

tools/siteLevelFLUXNET/prep_jobopts.sh
This script puts a new job-options file in every FLUXNET site directory tailored to that particular site. The 'singularity exec' command is not necessary for this particular script.

Running CLASSIC over FLUXNET sites
This guide will cover two methods for running the CLASSIC binary on FLUXNET sites. They can be run individually, or through the batch-run script.

Running all sites with the batch script (recommended)
A script is provided in the CLASSIC repository to run all FLUXNET sites, provided the directory structure of this document 15 is followed. The script will run for all sites it finds within the inputFiles/FLUXNETsites directory and is invoked by: tools/siteLevelFLUXNET/run_sites.sh This would simply move the site out of the way, and it could be placed back into the FLUXNETsites directory when needed.

Running individual sites (advanced)
Individual sites are run through the Singularity container directly referencing the job options file for that particular site. The command takes the form: Since this is a site-level simulation the shorthand 0/0 or 0/0/0/0 can be used in place of the actual longitude and latitude of the site. Piecing it all together, if we want to run on the site AU-Tum, the command would be: singularity exec CLASSIC_container.simg bin/CLASSIC_serial \ inputFiles/FLUXNETsites/AU-Tum/job_options_file.txt 0/0/0/0 If you decide to shell into the singularity container (recommended if you're familiar with Singularity and terminal com-5 mands), then this becomes bin/CLASSIC_serial inputFiles/FLUXNETsites/AU-Tum/job_options_file.txt 0/0/0/0 More information on running CLASSIC can be found in the CLASSIC manual at https://cccma.gitlab.io/classic/index.html.

Processing output
Whether you ran multiple sites with the batch job, or a single site individually, the output can be found in 'outputFiles/FLUXNETsites/ 10 sitename/'. These outputs are in the form of netCDF files. More information on netCDF can be found at https://www.unidata.
ucar.edu/software/netcdf/, but briefly, it is a machine-agnostic array-oriented form of data storage.
A script is provided to convert this data to csv format, which may be easier to work with if unfamilier with netCDFs, as well as generate plots of several output variables against observational values of those variables. The script is run with: tools/siteLevelFLUXNET/process_outputs.sh 15 NOTE: if not all sites were run with the batch script, errors will be seen in the output of this script. This is expected, and does not mean that the processing has failed.
Once the output plots have been generated, the 'process_outputs' script will run the Automated Model Benchmarking (AMBER) tool. More information on AMBER can be found at https://cran.rstudio.com/web/packages/amber/amber.pdf.
After the script completes, PDF copies of the plots are found in 'outputFiles/plots', while AMBER results are in 'outputFiles/AMBER'. Table A1. Equations used for computing spatially integrated scalar scores for each variable. The four steps refer to (1) the definition of the statistical metric, (2) nondimensionalization, (3) conversion to unit interval, and (4) spatial integration. The integration limits t0 and t f are the initial and final time step, respectively. The variables c mod and c ref refer to the monthly mean annual cycle of the model and reference data, respectively. crmse is the 'centralized RMSE', which is the RMSE of the anomalies. This ensures that the RMSE score excludes also considering the bias, which is captured by the bias score.

Siav
(1) iav ref (λ, φ) = (3) siav(λ, φ) = e −ε iav (λ,φ) (3) not applicable (4) S dist = 2(1 + R)/(σ + 1 σ ) 2 , where R is the spatial correlation coefficient of v ref (λ, φ) and v mod (λ, φ) Table A2. Site-level statistical metrics for calculating Srmse at the 31 FLUXNET sites. crmse is the 'centralized RMSE', which is the RMSE of the anomalies. This ensures that the RMSE score excludes also considering the bias, which is captured by the bias score. to the planning and implementation of CLASSIC along with contributing text. C. S. wrote AMBER, provided AMBER outputs and wrote sections of the manuscript. E. W-C wrote CLASSIC code around netCDF I/O, the MPI incorporation, the meteorological disaggregation, and the XML editor and also contributed text to the manuscript. E.C. worked on the model compilation scripts, aspects of CLASSIC code, and its use on supercomputers. L. T. processed FLUXNET data and setup the model initialization files for the tower sites used in the site-10 level evaluation. M.F. wrote and applied the linter used to refactor the code, drafted the coding standards documents, developed the parallel container recipe, and wrote the plotting scripts. All authors contributed to the final manuscript.
Competing interests. The authors declare no competing interests and a novel optimisation-based approach to plant coordination of photosynthesis, Geoscientific Model Development, 11, 2995-3026,