The CF (Climate and Forecast) metadata conventions are designed to promote the creation, processing, and sharing of climate and forecasting data using Network Common Data Form (netCDF) files and libraries. The CF conventions provide a description of the physical meaning of data and of their spatial and temporal properties, but they depend on the netCDF file encoding which can currently only be fully understood and interpreted by someone familiar with the rules and relationships specified in the conventions documentation. To aid in development of CF-compliant software and to capture with a minimal set of elements all of the information contained in the CF conventions, we propose a formal data model for CF which is independent of netCDF and describes all possible CF-compliant data. Because such data will often be analysed and visualised using software based on other data models, we compare our CF data model with the ISO 19123 coverage model, the Open Geospatial Consortium CF netCDF standard, and the Unidata Common Data Model. To demonstrate that this CF data model can in fact be implemented, we present cf-python, a Python software library that conforms to the model and can manipulate any CF-compliant dataset.

Network Common Data Form (netCDF) supports a view of data as a
collection of self-describing, portable objects that can be accessed
through standardised software libraries. For climate scientists, as
well as others, it has become a popular way to create, access, and
share array-orientated scientific data

The CF (Climate and Forecast) metadata conventions

CF metadata are designed to be interpretable without reference to external tables, readable by humans, easily parsable by programmes, and minimally redundant, which reduces the potential for inconsistencies. The development of CF began in 1999 and has proceeded incrementally, with new features added only when called for by common use cases, and with consideration of how the change might impact data producers. Archival of data is a major purpose of netCDF, for which reason backwards compatibility is an important consideration. So far, no backwards-incompatible change has been made to the CF conventions, meaning that a file written with a previous release of CF would still be compliant with the most recent version.

In general, CF tells data producers how they can provide information they
think is important for understanding their data, but it mandates very little
metadata. Projects which recommend or require the use of the CF conventions
may of course impose additional requirements on data producers, as is done,
for instance, by the Coupled Model Intercomparison Project
(

In this paper, we present a data model which is based on the version of the CF
conventions (1.6) which was the latest release at the time of writing. (Since
then, 1.7 has been released; see Sect.

The netCDF interface that underlies CF has an explicit data model (the yellow
layer in Fig.

Our aim is to create an explicit data model for CF (the green layer in
Fig.

The benefits of having a CF data model. The CF-netCDF conventions (CN) rely on the netCDF data model (NC) and, at present, a software application is forced to make its own interpretation of the CF-netCDF conventions – an interpretation that is likely to be different from that of other applications. The aim of this paper is to propose a comprehensive CF data model that provides a consistent interpretation of the conventions, thereby facilitating compatibility across applications that adopt it. This by no means precludes a variety of software implementations, as there is still considerable flexibility in mapping data model elements onto the data objects needed for a particular application.

The primary requirement of a data model is that it should be able to describe all existing and conceivable CF-compliant datasets. If we have been successful, then software libraries that adopt our CF data model in constructing their internal data structures will be able to represent and manipulate any CF-compliant dataset.

For our data model, we define a minimal set of elements that are sufficient for accommodating all aspects of the CF conventions. We restrict the elements of a data model to those that are explicitly mentioned in CF, but our data model elements do not have to be irreducible in that a data model element could describe more than one CF entity. For example, in CF, coordinates and coordinate bounds are distinct entities, but coordinate bounds cannot exist without coordinates. Therefore, it makes sense in our data model to group them into a single element.

Similarly, while it is possible to introduce additional elements not presently needed or used by CF, we believe this would not be desirable because it would increase the likelihood of a data model becoming outdated or inconsistent with future versions of CF.

The CF data model should also be independent of the encoding, meaning that it should not be constrained by the parts of the CF conventions which describe explicitly how to store (i.e. encode) metadata in a netCDF file. The virtue of this is that should netCDF ever fail to meet the community needs, we shall already have set the groundwork for applying CF to other file formats.

In Sect.

The existing CF conventions are for use with netCDF files following the
netCDF “classic” data model (the yellow layer in Fig.

The netCDF classic data model is described using Unified Modeling Language
(UML) in Fig.

The UML class associations used in this paper. See
Fig.

NetCDF classic files contain data in named variables, which can be single numbers (with no dimensions), one-dimensional arrays (vectors), or multi-dimensional arrays, and the dimensions are declared by name in the file. Variables can be of integer, floating point, or character data types. Variables may have attributes, of any data type, attached. Attributes can have a single value or consist of a one-dimensional array. NetCDF files also have “global” file attributes which provide information about the dataset as a whole. NetCDF library software has functions to define dimensions, variables, and attributes, and write and read data.

Key components of the netCDF classic data model
(corresponding to the yellow “NC” layer in
Fig.

It is important to appreciate that netCDF itself has no other semantics; for example, while coordinates can be stored in variables and described by attributes, the meanings of these variables and attributes and relationships between them and the variables containing data are not defined by netCDF. NetCDF makes no prescriptions or restrictions regarding the type of metadata which may be stored in the simple data structures that it offers. This flexibility is intended to provide a scope for users and scientific disciplines to develop their own conventions for encoding semantics so that datasets are sufficiently described by those who create them and that they remain valid for those who store and use them. CF is an example of this.

The original classic netCDF data model has been “enhanced” with the addition of several new features, including the ability to organise variables in hierarchical groups. Here, we adopt only one of the new features: we regard the character string as a data type, whereas the classic model treats strings as arrays of individual characters. Logically, these treatments are equivalent, but because strings are easier to manipulate in software codes, it is very likely that they will become a part of CF in the future.

In this section, we briefly describe how the most important of the CF
conventions are encoded in netCDF files. We do not consider any conventions
accepted after version 1.6 (see Sect.

In order to reduce storage occupied by netCDF files, the CF conventions
provide for lossy packing of data values and non-lossy compression
eliminating missing data values. Although practically valuable, these
mechanisms do not affect our conceptual data model and so we have chosen to
describe them in Appendix

Unidata provides a netCDF user guide (NUG)

When originally conceived, CF was an extension of the pre-existing COARDS
(Cooperative Ocean/Atmosphere Research Data Service) netCDF conventions
(

A CDL representation of the CF-netCDF file used for
examples in Sect.

The overarching purpose of the conventions is to provide conforming datasets
with sufficient metadata that they are self-describing, in the sense that
each variable in the file has an associated description of what it
represents, and that each value can be located (usually in space and time).
To meet this objective, we define a data variable

In CF-netCDF, the values and the description of

An example domain defined by three dimensions, one of which is single valued (height).

Within a CF-netCDF file, dimensions and coordinate variables may be used in
the definition of multiple domains, thus reducing redundancy. In our example
file,

NetCDF dimensions establish the size of the index space of data variables,
e.g. lines 3–5 in Fig.

In many cases, each dimension of a domain can be fully described by a single,
strictly monotonic coordinate variable (e.g. time, height, latitude,
longitude). For more complicated cases, however, such as parametric vertical
coordinates (e.g. dimensionless atmosphere sigma coordinates), CF provides a
way to record how to compute, from the original dimensional
coordinates, dimensional coordinates identifying the location
of the data in physical space (in the case of sigma, the air pressure). This
information is encoded with the

CF also defines “auxiliary coordinate variables” to provide mandatory or
optional coordinate information which is additional or alternative to that
contained in the coordinate variables in the NUG sense. Auxiliary coordinate
variables can be string valued, may contain missing values, and are not
necessarily monotonic. For example, we might like to associate the
coordinates of a vertical axis with model level number as well as sigma
coordinate or to provide location information and station names for the
points in a time series (as in Fig.

An important and mandatory use of auxiliary coordinates is to supply latitude
and longitude locations of each point when the horizontal axes of a grid are
themselves not latitude and longitude (e.g. if they refer to a rotated North
Pole or are based on a map projection, as is the case for

Some axes have only a single coordinate value. Regrettably, single-valued
coordinates are often omitted from metadata, although they are very useful;
for example, the time information for a field sampled at a single time, for
instance, 12:15 Z on 14 July 2015, or the level of a single-level field, e.g. air
temperature at a height of 1.5 m (Fig.

Calendar time in CF (year, month, day, hour, minute, second) is encoded with
units “time unit since reference date–time” (e.g. line 9 in
Fig.

Auxiliary coordinate variables store “alternative” coordinates for dimensions.

For grid axes based on a map projection, two-dimensional auxiliary coordinate variables must be used to store longitude and latitude values for each location (latitude–longitude lines dashed; grid lines solid).

A “discrete axis” is one which is not associated with any “continuous”
coordinate or auxiliary coordinate variables. A variable is continuous along
an axis if it makes physical sense to interpolate along that axis between its
values. If that is not the case, then either there are no coordinate values or
the coordinate values are discrete indices, whose order may or may not be
meaningful. Consider, for example, an ensemble of model experiments, each of
which produces a data variable

An important use of discrete axes in CF is to store data from a collection of
“discrete sampling geometries” (DSGs) in a single data variable. In a DSG,
the data have a lower dimensionality than the space–time domain, because
they apply to a point or path within the domain. For example, a collection of
time series of surface air temperature at meteorological stations can be
stored in a two-dimensional data variable (Fig.

Many DSGs many be stored in one file, but they might have different coordinates, e.g. each time series might have its own set of sampling times (different days or hours of observation), or each profile might have its own set of vertical levels (e.g. air pressure reported by radiosondes). If each feature in a large collection is stored as an individual data variable with its own dimensions and coordinate, the file will be cumbersome. If they are combined into a single data variable, a one-dimensional coordinate variable would need to contain the union of all the coordinates (times, levels, etc.) required, and the dimension of the combined data variable might be much larger than needed for the available data, containing a lot of missing data elements.

As an alternative, CF provides three other methods for storing collections of
data on DSGs, all intended to allow data with different dimensionality to be
stored in a single data variable without wasting so much space. In the
“incomplete multi-dimensional array” representation, the dimension required
for the longest feature is used for all features, so that the shorter
features must be padded with missing values; this sacrifices storage space to
achieve simplicity for reading and writing. The “contiguous ragged array”
and “indexed ragged array” representations eliminate the need for padding
and thus reduce further the storage required, but they are more complex to pack
and unpack. In the former case, each feature in the collection occupies a
contiguous block, requiring the size of each feature to be known at the time
that it is created. In the latter case, the values of each feature in the
collection are interleaved. This representation can therefore be used for
real-time data streams that contain reports from many sources, with the data
being written as they arrive. The ragged array representations are described
in more detail in Appendix

Because these storage methods were introduced (in CF version 1.6) at the same time as the recognition and definition of feature types, the two are often thought of as belonging together, but this causes confusion. The featureType is metadata, and it refers to the physical construction and interpretation of a DSG data variable. The three new storage mechanisms for DSGs do not involve any new or distinct physical concepts.

It is often necessary to know the extent of a cell as well as the grid point location, e.g. to calculate the area of a latitude–longitude box or the thickness of a vertical layer. If cell bounds are not provided, then there is no default assumption about cell sizes (an application might reasonably assume that grid points are at the centres of non-overlapping cells, but that is not required by CF).

CF provides a way to attach bounds variables to any variable containing
coordinate data. A bounds variable has an extra dimension to index the
vertices of the cells. The simplest case is shown for a one-dimensional
coordinate variable in Fig.

Some applications require information about the size, shape, or location of
the cells that cannot be deduced without specialist knowledge which is not
guaranteed to be available. For example, in computing the mean of several
cell values, it is often appropriate to “weight” the values by area, but
for some grids (such as some types of spherical geodesic grids) the cell
perimeter is not uniquely defined by its vertices and so the area cannot be
inferred from the available information. For this case, CF provides cell
measures variables which contain such information and are encoded as netCDF
variables which are referenced by the

A one-dimensional coordinate variable with grid points

CF describes variation within cells by use of “cell methods”. By default,
it is assumed that intensive quantities apply at grid points, e.g. temperature values apply at the spatial points and instants of time specified
by their coordinates, while extensive quantities apply to the entire
grid cell, e.g. a precipitation amount (kg m

By default, the method for horizontal cells is assumed to have been evaluated over the entire area of the cell. It is, however, possible to limit consideration to only a portion of a cell, e.g. to record that values apply only to the fractions of cells which are land (as opposed to sea).

A further use of cell methods is to characterise climatological
statistics where a series of data points represent sets of
subintervals which are not contiguous. There are three kinds to
consider:

corresponding portions of the annual cycle in a set of years, e.g. decadal averages for January;

corresponding portions of a range of days, e.g. the average diurnal cycle in April 1997; and

both at once, e.g. the average winter daily minimum temperature from the years 1961 to 1990.

Cell methods are encoded in the

When metadata to describe the data depend on location within the domain,
they are stored in independent variables called ancillary data variables. For
example, each value of an array of instrument data may have associated
measures of uncertainty or of the status of the recording instrument. An
ancillary data variable is encoded as a netCDF variable that is referenced by
the

A range of attributes is available, introduced by the netCDF user guide or CF, providing metadata for interpreting the values of individual variables or about the dataset as a whole. In this section, we discuss the two most important of these.

CF requires all variables with values (data variables, coordinate variables,
etc.) to have units unless they contain dimensionless numbers or cell
boundary values. The units are specified by a string attribute (e.g. lines 9
and 55 of Fig.

For systematic identification of the physical quantity contained in
variables, CF defines a “standard name” string attribute (e.g. lines 28,
54 and 62 of Fig.

The relationships between CF-netCDF elements (corresponding
to the blue CN layer in Fig.

CF also upholds the use of the “long name” defined by the netCDF user
guide, but this is ad hoc. In contrast, the CF standard names are
consistently constructed and documented. As CF is applicable to many areas of
geoscience, the standard names have to be more self-explanatory and
informative than would suffice for any one area. For instance, there is no
name for plain “potential temperature”, since we have to distinguish air
potential temperature and sea water potential temperature. Standard names are
often longer than the terms familiarly used by the experts in particular
discipline, because they answer the question, “What does this mean?”,
rather than the question, “What do you call this?”. For example, the
quantity often called “precipitable water” by meteorologists has the
standard name of atmosphere_mass_content_of_water_vapor. Standard names
have a detailed description which further defines parts of the name; for
example, the description of the standard name land_ice_calving_rate notes
that “land ice” means glaciers, ice caps, and ice sheets resting on
bedrock, and the land ice calving rate is the rate at which ice is lost per unit area
through calving into the ocean. Each standard name also implies particular
physical dimensions (mass, length, time, and other dimensions corresponding to
SI base units, expressed as a “canonical unit”); for example, large-scale
rainfall amount (canonical unit kg m

Standard names have been defined for both more general and more specific quantities, for different applications, e.g. ocean_mixed_layer_thickness and ocean_mixed_layer_thickness_defined_by_temperature. Some standard names require the existence of additional metadata and/or constraints on the values of the variables with which they are associated. For example, the standard name of downwelling_radiance_per_unit_wavelength_in_air requires there to be a coordinate variable storing the radiation wavelength.

The CF conventions use size one or scalar coordinate variables
(Sect.

The elements of the CF-netCDF conventions, a brief
description of each, and the section in which it is described in
more detail. The relationships to netCDF entities are shown in
Fig.

The nine constructs of our CF data model (corresponding to
the green layer in Fig.

The aspects of the CF conventions discussed in Sect.

The constructs of our CF data model, a brief description of
each, and the section in which it is described in more detail. The
relationships between the constructs and CF-netCDF elements are
shown in Figs.

The field construct is central to our CF data model and includes all the
other constructs (Fig.

The field construct consists of a data array and the definition of its domain
(i.e.

The field construct also has optional properties to describe aspects of the
data that are independent of the domain. These correspond to some netCDF
attributes of variables (e.g. the units, long_name, and standard_name;
Sect.

The standard_name property (Sect.

A domain axis construct (Fig.

When a collection of DSG features has been combined in a data variable using
the incomplete orthogonal or ragged representations to save space, the axis
size has to be inferred, but we regard this as an aspect of unpacking the
data, rather than its conceptual description. In practice, the unpacked data
array may be dominated by missing values (as could occur, for example, if all
features in a collection of time series had no common time coordinates), in
which case it may be preferable to view the collection as if each DSG feature
were a separate variable (Sect.

The relationship between domain axis, dimension coordinate,
and auxiliary coordinate constructs (Sect.

Coordinate constructs (Fig.

In both cases, the coordinate construct consists of a data array of the
coordinate values which spans a subset of the domain axis constructs, an
optional array of cell bounds recording the extents of each cell, and
properties to describe the coordinates (in the same sense as for the field
construct). An array of cell bounds spans the same domain axes as its
coordinate array, with the addition of an extra dimension whose size is that
of the number of vertices of each cell. This extra dimension does not
correspond to a domain axis construct since it does not relate to an
independent axis of the domain (for example, the

The dimension coordinate construct is able to unambiguously describe cell locations because a domain axis can be associated with at most one dimension coordinate construct, whose data array values must all be non-missing and strictly monotonically increasing or decreasing. They must also all be of the same numeric data type. If cell bounds are provided, then each cell must have exactly two vertices. CF-netCDF coordinate variables and numeric scalar coordinate variables correspond to dimension coordinate constructs.

Auxiliary coordinate constructs have to be used, instead of dimension coordinate constructs, when a single domain axis requires more then one set of coordinate values, when coordinate values are not numeric, strictly monotonic, or contain missing values, or when they vary along more than one domain axis construct simultaneously. CF-netCDF auxiliary coordinate variables and non-numeric scalar coordinate variables correspond to auxiliary coordinate constructs.

If a domain axis construct does not correspond to a continuous physical quantity, then it is not necessary for it to be associated with a dimension coordinate construct. For example, this is the case for an axis that runs over ocean basins or area types, or for a domain axis that indexes a time series at scattered points. In such cases, one-dimensional auxiliary coordinate constructs could be used to store coordinate values. These axes are discrete axes in CF-netCDF.

The domain may contain various coordinate systems, each of which is
constructed from a subset of the dimension and auxiliary coordinate
constructs. For example, the domain of a four-dimensional field construct may
contain horizontal (

A coordinate system of the field construct can be explicitly defined by a
coordinate reference construct (Fig.

The dimension coordinate and auxiliary coordinate constructs that define the coordinate system to which the coordinate reference construct applies. Note that the coordinate values are not relevant to the coordinate reference construct, only their properties.

A definition of a datum specifying the zeroes of the dimension and auxiliary coordinate constructs which define the coordinate system. The datum may be explicitly indicated via properties, or it may be implied by the metadata of the contained dimension and auxiliary coordinate constructs. Note that the datum may contain the definition of a geophysical surface which corresponds to the zero of a vertical coordinate construct, and this may be required for both horizontal and vertical coordinate systems.

A coordinate conversion, which defines a formula for converting coordinate values taken from the dimension or auxiliary coordinate constructs to a different coordinate system. A term of the conversion formula can be a scalar or vector parameter which does not depend on any domain axis constructs, may have units (such as a reference pressure value), or may be a descriptive string (such as the projection name “mercator”), or it can be a domain ancillary construct (such as one containing spatially varying orography data).

For

In some cases, the datum is not required as it is already described by the
dimension and auxiliary coordinate constructs. This is the case in CF for the
two-dimensional geographical latitude–longitude coordinate system based upon
a spherical Earth, which is assumed to have a datum at 0

In CF-netCDF, coordinate system information that is not found in coordinate or auxiliary coordinate variables is stored in a grid mapping variable or the formula_terms attribute of a coordinate variable, for horizontal or vertical coordinate variables, respectively. Although these two cases are arranged differently in CF-netCDF, each one contains, sometimes implicitly, a datum or a coordinate conversion formula (or both) and so may be mapped to a coordinate reference construct. A grid mapping name or the standard name of a parametric vertical coordinate corresponds to a string-valued scalar parameter of a coordinate conversion formula. A grid mapping parameter which has more than one value (as is possible with the “standard parallel” attribute) corresponds to a vector parameter of a coordinate conversion formula. A data variable referenced by a formula_terms attribute corresponds to the term of a coordinate conversion formula – either a domain ancillary construct or, if it is zero-dimensional, a scalar parameter.

A domain ancillary construct (Fig.

It also contains an optional array of cell bounds recording the extents of each cell (only applicable if the array contains coordinate data) and properties to describe the data (in the same sense as for the field construct). An array of cell bounds spans the same domain axes as the data array, with the addition of an extra dimension whose size is that of the number of vertices of each cell.

The relationship between coordinate reference and domain
ancillary constructs (Sect.

CF-netCDF variables named by the formula_terms attribute of a CF-netCDF coordinate variable correspond to domain ancillary constructs. These CF-netCDF variables may be coordinate, scalar coordinate, or auxiliary coordinate variables, or they may be data variables. For example, in a coordinate conversion for converting between ocean sigma and height coordinate systems, the value of the “depth” term for horizontally varying distance from ocean datum to sea floor would correspond to a domain ancillary construct. In the case of a named term being a type of coordinate variable, that variable will correspond to an independent domain ancillary construct in addition to the coordinate construct.

A cell measure (Fig.

The cell measure construct consists of a numeric array of the metric
data which span a subset of the domain axis constructs, and
properties to describe the data (in the same sense as for the field
construct). The properties must contain a “measure” property, which
indicates which metric of the space it supplies, e.g. cell horizontal
areas, and a units property consistent with the measure property,
e.g. m

The field ancillary construct (Fig.

The field ancillary construct consists of an array of the ancillary data, which is zero-dimensional or which depends on one or more of the domain axes, and properties to describe the data (in the same sense as for the field construct). It is assumed that the data do not depend on axes of the domain which are not spanned by the array, along which the values are implicitly propagated. CF-netCDF ancillary data variables correspond to field ancillary constructs. Note that a field ancillary construct is constrained by the domain definition of the parent field construct but does not contribute to the domain's definition, unlike, for instance, an auxiliary coordinate construct or domain ancillary construct.

The cell method constructs (Fig.

The field construct may contain an ordered sequence of cell method constructs describing multiple processes which have been applied to the data, e.g. a temporal maximum of the areal mean has two components – a mean and a maximum – each acting over different sets of axes. It is an ordered sequence because the methods specified are not necessarily commutative. There are properties to indicate climatological time processing, e.g. multiannual means of monthly maxima, in which case multiple cell method constructs need to be considered together to define a special interpretation of boundary coordinate array values. The cell_methods attribute of a CF-netCDF data variable corresponds to one or more cell method constructs.

The axes over which a cell method applies are either a subset of the domain axis constructs or a collection of strings which identify axes that are not part of the domain. The latter case is particularly useful when the coordinate range for an axis cannot be precisely defined, making it impossible to define a domain axis construct. For example, a climatological time mean might be based on data which are not available over the same time periods at every horizontal location – useful information can still be conveyed by recording the fact the data have been temporally averaged without specifying the range of times. The strings which identify such axes are well defined in that they must be standard names (e.g. time, longitude) or the special string “area”, indicating a combination of horizontal axes.

A data model does not exist on its own, and those exploiting it will need to
interpret it in the context of other data models with which they already
work, whether they are implicit or explicit (Sect.

Readers who are not familiar with other data models may wish to omit this
section on a first reading, as it is not required to understand the CF
conventions, the CF data model presented here, nor the software
implementation of Sect.

ISO 19123

Key concepts within the ISO 19123 view of coverages and the
associated coordinate systems:

An ISO 191xx coverage may be viewed as a function whose inputs are
spatiotemporal positions (the “domain”) which are related to outputs
comprising values of one or more geographical features (the “range” of the
coverage). Thus, a coverage is notionally a function over a domain which has a
range of values. Within ISO 19123, two types of coverage are defined: discrete
and continuous. The former is the most relevant here (see
Fig.

For point data, a discrete coverage is nearly identical to a CF field
construct (

Discrete coverages themselves can be further specialised into a set of more specialised coverages: sampling the domain with sets of points, a grid of points, or sets of curves, surfaces, or solids.

Of these, the most important (in terms of our CF data model) is the
DiscreteGridPointCoverage (Fig.

There is a clear correspondence between the CF dimension coordinate construct and an ISO coordinate reference system as used in a rectified grid, and between the underlying concepts of ISO parametric coordinate reference systems within a referenceable grid and a CF domain described using CF auxiliary coordinate constructs. This correspondence together supports the identification of an ISO grid (which itself carries little information apart from a name and a list of axes) with the abstract notion of a CF domain described by CF coordinate reference constructs.

Even with an ISO rectified grid, which has the easiest correspondence with CF, there are subtle but important differences in the treatment of coordinates, probably the most important of which is that the CF equivalent of the ISO datum is often held in the standard name of the coordinate construct. For example, a CF coordinate construct with a standard name of height means the coordinate is with reference to the surface, i.e. the bottom of the atmosphere (distinct from other valid vertical coordinates such as height_above_reference_ellipsoid and height_above_sea_floor). These coordinates are all distinct geophysical quantities, with vertical datums of the surface, the reference ellipsoid, and the sea floor, respectively, though they all have the same canonical unit of measure (metres) and direction (values increase for locations further above the datum).

Where a more precise specification of the datum may be needed (for example, the figure of the reference ellipsoid or the reference point for a latitude–longitude coordinate system where it is not the default of the intersection of the Equator and the Greenwich meridian), it can be supplied by the coordinate reference construct, not the standard name. This CF separation of grid mapping datum from coordinates adds value because changing the datum does not alter the geophysical nature of the coordinate and its interpretation. The partitioning is suitable and convenient for many purposes of data analysis, in which coordinate constructs are processed independently, without the need for awareness of a full ISO coordinate reference system (CRS). It arises from the generality of CF, in which spatiotemporal coordinates are used having a wider variety than in the geographic information system (GIS); non-spatiotemporal coordinates are also needed; and the data are often from idealised worlds (such as in climate models).

The relationship between ISO grid cells and footprints, and CF cells described by cell bounds.

Whether grid cells are described directly or implied via referenceable or
rectified grids, it is important to note that in the ISO world, the cells lie
between the edges laid out by the coordinates, whereas by contrast the notion
of a cell in CF – defined by the cell measure construct
(Sect.

It might appear that some of the more complex geometries underlying CF fields
which are expressed on domains sampled using the CF DSG features best be
mapped onto other specialisations of DiscreteCoverages – this is the
approach taken by

ISO 19156 Observations and Measurements

The Open Geospatial Consortium standard introduced above

It is not the complete CF version 1.6 (for example, it does not appear to include ancillary data variables).

Their model makes some elements of CF mandatory, in order to facilitate the ISO 19123 coverage interoperability, which is their target.

It is tied to the netCDF format.

It is constructed in order to map as closely as possible onto the ISO 19123 coverage model but without being faithful to CF; so, for example, it introduces the notion of a CF coordinate system including a notional HorizontalCRS, which is independent of explicitly identified horizontal and vertical coordinates (their Fig. 4). By contrast, we have only introduced new concepts as abstractions where they help interpret and use CF itself (again, for example, in our case the domain and abstract coordinate).

The Unidata Common Data Model (CDM;

a data access layer, which handles data reading and writing,
and merges the netCDF enhanced, OPeNDAP (Open-source Project for a
Network Data Access Protocol,

a coordinate system layer, which handles the coordinates of data arrays;

a feature type layer, which handles similar notions to those we express with CF field constructs and the CF sampling feature types; and

a mature Java-based implementation which reads, manipulates, and writes the CDM sampling features.

Key characteristics of the Unidata Common Data Model. There is a wider variety of fundamental data types than is supported by the netCDF classic data model; and the coordinate system includes the option of coordinate axes of specific types for use in the feature types, which limits the flexibility of the CDM data model.

The CDM data access layer has a broader scope than ours (being about more than
just netCDF). If we consider that most of the CF standard as expressed in our
data model is about handling coordinates, cells, and domains, then our CF
data model corresponds to CDM coordinate system layer, with the various CF
feature types of our CF data model field constructs corresponding to the CDM
feature type layer. The cf-python software (Sect.

The CDM data access layer handles more data types (Fig.

Within the coordinate system layer, there is much closer correspondence
between the CDM and our CF data model.

A CF dimension or auxiliary coordinate construct maps to a CDM CoordinateAxis.

The datum and coordinate conversion components of a CF coordinate reference construct are components of a CDM CoordinateTransform.

A CF coordinate reference construct maps to a CDM CoordinateSystem.

The relationship between features, feature types, and
feature collections in the Unidata Common Data Model. Only the
simplest feature types are shown; more complicated features are
built from these basic elements

Reading a file using cf-python.

A CDM CoordinateAxis may be subtyped into axes which can be specifically
exploited by the sampling feature types in the CDM feature type layer, where
there are significant differences from our CF data model. The CDM feature
type implementation is discussed in

A detailed inspection of a field object's metadata.

A key use of our data model is to enable the creation of wholly CF-compliant
software, i.e. software that can represent and manipulate any CF-compliant
dataset. Such software corresponds to one of the application boxes in
Fig.

cf-python implements the data model constructs and their relationships
exactly as shown in Figs.

create, delete, and modify a field object's data and metadata;

select and subspace field objects according to their metadata;

perform arithmetic, comparison, and other mathematical operations involving field objects;

collapse axes by statistical operations;

perform operations with date–time data;

regrid fields to new domains using the Earth System Modeling
Framework high-performance software infrastructure

visualise field objects by interfacing with the cf-plot
Python package, which is also open source and freely available
at

All of these operations are “metadata aware”, which means that
parameters needed for an operation need not be fully specified by the
user, provided that field objects have sufficient metadata to infer
the parameters unambiguously. This is greatly facilitated by having a
data model, because all standardised metadata are stored in a fully
defined manner and so the required parameters may be inferred
unambiguously. In practice, a field object's metadata may be
incomplete, in which case the user should use the cf-python API to
supplement the metadata. For example, the cf-python command

The cf-python class instances which correspond to CF data model constructs.

How cf-python implements the data model may be seen by using the library to
read the CF-netCDF file described in Fig.

one field object;

four domain axis objects and their sizes, including one which is implied by the “time” CF-netCDF scalar coordinate variable;

one cell method object indicating that each data array value is a time average constructed from daily samples;

one field ancillary object describing the uncertainty of the data array values;

four dimension coordinate objects, each one spanning a unique domain axis object;

two multi-dimensional auxiliary coordinate objects for true latitude and true longitude coordinates (as required by the CF conventions when the horizontal dimension coordinates are not canonical geographical latitudes and longitudes);

three domain ancillary objects utilised by the coordinate reference objects;

two coordinate reference objects: a vertical, atmosphere sigma coordinate system which references the domain ancillary objects and the vertical dimension coordinate object, and a horizontal Lambert conformal conic coordinate system which references the horizontal auxiliary and dimension coordinate objects; and

one cell measure object containing horizontal cell areas.

The one-to-one correspondence between the data model and cf-python's
interpretation of CF may also be demonstrated by inspecting the objects from
which field object

Whilst field object

Examples of cf-python field objects of medium and minimal complexity.

Any variable in a CF-netCDF file can always be viewed as a data variable in
addition to any metadata role it may have, simply by choosing to ignore any
other variables that may reference it. For example, a variable that is named
by the “coordinates” attribute of a data variable is always an auxiliary
coordinate variable (Sect.

When a CF-netCDF file is read, a decision must be taken as to which variables
are the data variables. By default, cf-python assumes that only unreferenced
variables are data variables that instantiate field objects (variables

An interesting situation arises if a netCDF file contains only CF-netCDF
coordinate variables and their associated dimensions. It may be natural to
assume that these coordinate variables define a single domain, but without
the explicit links provided by a data variable the existence of a domain
cannot be assumed by the software. These coordinate variables are not referenced
by a data variable but they are explicitly defined as “coordinates”. When
reading such a file, cf-python by default creates no field objects (because
only coordinates have been defined), and also no dimension coordinate objects
(as dimension coordinate constructs can only exist within a field construct;
see Sect.

A number of new features have recently been introduced in version 1.7 of the
CF conventions, published in September 2017, and there is no doubt that the
CF conventions will continue to evolve and meet the needs of the scientific
community for representing more types of data. Any CF data model will
therefore have to adapt to future enhancements. This is why our CF data model
was designed with a minimal number of simple constructs in mind
(Sect.

A CF data model could guide the development of CF by providing a framework for ensuring that proposed changes fit into CF in a logical, rather than just a pragmatic, way. A proposed enhancement would be assessed to see if its new features map onto the existing data model. If they do, then the enhancement may be incorporated in the CF conventions with no change to the existing data model.

As an example, it is instructive to consider one of the enhancements accepted
into CF at version 1.7, namely the ability to store a cell measure variable
(Sect.

If a change does not map onto the existing data model and cannot be modified to do so, the data model will need to be modified to accommodate the new features. This modification will be either backwards compatible or backwards incompatible. The former (preferable) case occurs if the data model may be extended or generalised in some way that allows the new features but does not affect its existing constructs and relationships. The latter case occurs if a change is required that would affect the interpretation of existing datasets and the design of software built around the data model. Much effort has been already been put into avoiding backward incompatibilities in CF, so that older datasets are still parsable by newer software, but doing so is not a rule but a “best practice” that could be overridden if the community consensus were that the benefits in doing so outweighed any inconvenience.

It is the authors' intention to ensure that the cf-python software library is kept up to date with the latest version of CF conventions and the CF data model presented here. To facilitate this, work is underway to create a reference implementation of this data model, which in essence will be like cf-python but without any of its higher-level functionality (such as regridding methods). This reference implementation will then be imported back into cf-python to provide not only its data model representation but also a “read” method for mapping datasets onto field constructs and a “write” method for mapping field constructs to new netCDF files. The reference implementation will be easier to understand and maintain than cf-python, could be used by other Python packages for manipulating CF datasets, and also has the potential to be used as a test bed for proposed new features of CF. This reference implementation will be open source (like cf-python), so any interested parties may contribute to its maintenance and development.

In this paper, we have presented a formal data model for the CF conventions, identifying the fundamental elements of CF and showing how they relate to each other. We have described the CF conventions in terms of their relationship to the physical world (real or simulated) and in terms of their netCDF encoding, and these steps led to our identifying the elements which contribute to a CF data model. The CF conventions themselves have been influenced by their netCDF encoding, and therefore our CF data model is indirectly influenced by netCDF, although it aims to be independent of the encoding. We have discussed the relationships of our CF data model to other data models which address the problem of storing data and metadata, and we have presented a software implementation of this CF data model capable of manipulating any CF-compliant dataset. We have described possible ways in which this CF data model and the cf-python software library may evolve over time.

It is important to note that our CF data model is a description of what CF
is, rather than what it ought to be, either in our opinion or anyone else's.
We believe that there is little doubt that a CF data model is of considerable
value, and this has been recognised by the CF community, which highlights that a
CF data model will aid future developments in the CF conventions and make it
easier to create CF-compliant software
(

Creating an explicit data model before the CF conventions were written would arguably have been preferable. A data model created a priori increases the likelihood that the problem space (i.e. storing and manipulating data and metadata) is fully spanned and encourages coherent implementations, which could be file storage syntaxes or software codes, the latter being a stated goal of CF. For example, in CF-netCDF, horizontal and vertical coordinate reference systems are described with very different structures – the grid mapping variable and formula_terms attribute, respectively – a situation that would likely not have occurred if a comprehensive CF data model already existed. Writing a CF data model a posteriori clearly cannot bring about all of these benefits, as the coverage of the problem space and file storage syntax is a given, but it can still be of use to software implementations and future developments in the conventions.

We believe that the data model proposed here is a complete and correct description of CF, because we have yet to find a case for which our implementation in the cf-python library fails to represent or misrepresents a CF-compliant dataset. Moreover, the development of cf-python proves that is possible to implement our CF data model. We consider that our CF data model is simpler and more flexible than other such models, because it defines a small number of general constructs rather than many specialised ones. While the latter approach is closer to an object-orientated software implementation, our aim is to describe CF in a way which is independent of any software.

If this CF data model were to be accepted by the community as a formal part of the CF conventions, then any future enhancements would have to be incorporated into the data model as part of the public discussion that leads to the acceptance of every enhancement. Version 1.7 of the CF conventions has been recently published and it is the authors' intention to review all of the new features for compatibility with this CF data model. As these enhancements have already been finalised, any conflict will necessarily force a change in the data model. Once up to date with version 1.7, the data model may then be considered in parallel with the discussions on enhancements for subsequent releases. Structural differences between different versions of the CF conventions would be plain to see if each release contains a data model, thus making it easier to write software that can cope with any backward incompatibilities that may have been introduced.

The code of cf-python is open source and freely
downloadable at

Throughout this paper, we rely on UML to construct diagrams that define the key relationships of the entities described in CF-netCDF files and in our data model. These diagrams show relationships between “classes” like those used in an object-orientated programming language or like data types in Fortran. The relationship of an “instance of a class” to its class is like that of a particular variable to its data type. A class is like a species of animal, and an instance of a class is like an individual animal. Classes can be included in other classes, just as components are included in the definitions of derived data types in Fortran, and organs comprise the body of an animal.

For reference in interpreting our UML diagrams, we describe the subset of UML
used here. As depicted in Table

The UML diagram elements relating to netCDF (Sect.

A worked example demonstrating the subset of UML used in
this paper (see Table

An important goal of the CF conventions is that datasets should be efficient
to create, store, and subsequently read, where efficiency is a measure of the
time taken for software to carry out a task or the amount of computer data
storage required for a file. The conventions describe various techniques for
optimising these requirements. A CF-netCDF file that uses any of the
optimisation techniques can always be recast without them and still contain
exactly the same scientific information; therefore, the optimisation
mechanisms do not affect the data model described in
Sect.

These parts of the CF convention were devised because the netCDF
classic file format and API do not
offer any methods for compression. However, the netCDF-4 API supports
lossless compression of variables stored in files

Storage space in netCDF files may be reduced by a packing, i.e. by altering
the data in a way that reduces their precision. Lossy compression may be
essential for the archiving of the huge volumes of data produced by modern
high-resolution models

As well as external methods of compression applied to the file, CF has support for space saving by identifying unwanted missing data. Such compression techniques store the data more efficiently and result in no precision loss.

Compression by gathering combines axes of a multi-dimensional array into a
new, discrete axis (the “list” dimension) whilst omitting the missing
values and thus reducing the number of values that need to be stored. The
information needed to uncompress the data is stored in a separate variable
(the “list” variable) that contains the indices needed to uncompress the
data. A list variable is encoded as a coordinate variable that has a

A collection of DSG features may be stored using
the contiguous or indexed ragged array representation, which minimises the
amount of file storage required (Sect.

In the contiguous case, each feature in the collection occupies a contiguous
block, and so can be used only if the size of each feature is known at the
time that it is created. It requires a “count” variable that gives the size
of each block and is encoded as a netCDF variable with a

For indexed ragged arrays, the values of each feature in the collection are
interleaved along the sample dimension. The canonical use case for this
representation is the storage of real-time data streams that contain reports
from many sources; the data can be written as they arrive. It requires an
“index” variable that specifies the feature that each element of the sample
dimension belongs to and is encoded as a netCDF variable with an

It is also possible to combine contiguous and indexed ragged array representations, which is useful for cases such as writing real-time data streams that contain vertical profiles from many trajectories, arriving randomly, with the data for each entire profile written all at once.

The authors declare that they have no conflict of interest.

We would like to thank Mark Hedley, Antonio Cofiño, Martin Juckes, Alison Pamment, and Paulo Ceppi for comments that greatly improved the manuscript. We are also indebted to members of the CF community, whose considerable efforts ensure the continuing success of the CF conventions – in particular, those who took part in the data model discussions that took place on the CF mailing list.

The research leading to these results has received funding from the core budget of the UK National Centre for Atmospheric Science, the European Research Council, and the European Commission's Seventh Framework programme (from ERC project “Seachange”, number 247220; and FW7 project “IS-ENES2”, number 312979). Work by Karl E. Taylor was performed under the auspices of the US Department of Energy (USDOE) by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344 with support from the Regional and Global Climate Modeling Program of the USDOE's Office of Science. Edited by: Steve Easterbrook Reviewed by: Venkatramani Balaji and Brian Eaton