The CMIP6 Data Request (version 01.00.31)

The data request of the Coupled Model Intercomparison Project Phase 6 (CMIP6) defines all the quantities from CMIP6 simulations that should be archived. This includes both quantities of general interest needed from most of the CMIP6endorsed Model Intercomparison Projects (MIPs) and quantities that are more specialised and only of interest to a single endorsed MIP. The complexity of the data request has increased from the early days of model intercomparisons, as has the data volume. In contrast with CMIP5, CMIP6 requires distinct sets of highly tailored variables to be saved from each of the 5 more than 200 experiments. This places new demands on the data request information base and leads to a new requirement for development of software that facilitates automated interrogation of the request and retrieval of its technical specifications. The building blocks and structure of the CMIP6 Data Request (DREQ) which have been constructed to meet these challenges are described in this paper.

1 Introduction 10 Phase 6 of the Coupled Model Intercomparison Project (CMIP6) seeks to improve understanding of climate and climate change by encouraging climate research centres to perform a series of coordinated climate model experiments that produce a standardized set of output. Twenty-three independently-led Model Intercomparison Projects (MIPs) have designed the experiments and have been endorsed for inclusion in CMIP6 (Eyring et al., 2016). An essential requirement of CMIP6 is that the thousands of diagnostics generated at each centre from hundreds of simulations should be produced and documented in a consistent manner 15 to facilitate meaningful comparisons across models. Hence, for each experiment the MIPs have requested specific output to be archived and shared via the Earth System Grid Federation (ESGF), and the CMIP6 organizers have imposed requirements on file format and metadata.
The resulting collection of output variables (usually in a gridded form covering the globe and evolving in time) and the associated temporal and/or spatial constraints on them are referred to as the CMIP6 Data Request (DREQ). The modelling 20 centres participating in CMIP6 are now archiving the requested model output and making it available for analysis. The DREQ definitions. The resulting variable definitions were subsequently aggregated into a consolidated structured document, which constitutes the DREQ and is the focus of this paper.
The challenge of the process arises from the scale and diversity of the subject matter. The 23 participating MIPs are all international consortia, some of them organized many years ago, others formed specifically for the CMIP6 exercise. The syntax 55 of the technical requirements relies largely on the NetCDF Climate and Forecast Metadata (CF) Conventions 2 and builds on long-standing CMIP practice, but there were also new aspects of the technical requirements which developed dynamically over the planning stages for CMIP6 (see Balaji et al., 2018) as part of a new CMIP6 endorsement process.
Evolving requirements added complexity to the design and implementation of the DREQ. These requirements arose through interactions between the data request, the MIPs, the committees governing CMIP6, and other elements of the infrastructure 60 described in Balaji et al. (2018), many of which were themselves evolving in response to the growth of CMIP6. These other activities and the linkages both supported and constrained the DREQ itself.

The Challenge of Scientific Complexity
The sophistication of climate models continues to increase (e.g. Hayhoe et al., 2017), driven by pressing societal challenges (Rockström et al., 2016). With the expanded scope of the intercomparison, and with the steadily increasing complexity of 65 the Earth System Models, CMIP6 posed new challenges for the data request. Here we illustrate some of that complexity by considering the cryosphere, as depicted in Figure 1, and then consider how this sort of complexity plays out over the data request.
The models, and hence the variables described in the DREQ, distinguish between land ice formed on land from the consolidation of snow, and sea ice formed at sea by the freezing of sea water. They have different properties, both at the microscopic 70 scale (land ice generally contains trapped air bubbles) and at the macroscale (sea ice is typically up to a few metres thick, land ice is often hundreds of metres thick). A few of the details shown in the figure are represented for the first time, or better represented in some CMIP6 models. These include the representation of sea water extending under floating ice shelves, more detailed representation of snow on ice (with different model representations of snow on sea ice versus snow on land ice), more detailed representation of snow and other frozen precipitation, and both the representation of melt pools on sea ice and potential 75 ice covering of those melt pools.
In the atmosphere, snow is made up of ice crystals and it is standard usage to consider "snow" as part of the atmospheric ice content. On the land surface, however, a snow-covered surface is generally understood to be distinct from an ice-covered surface. Hence, at the surface we have parameters for heat fluxes from snow to ice and rates of conversion from snow to ice (i.e. a mass flux from snow to ice). This distinction may sound obvious, but this subtle shift in the relationship between "snow" 80 and "ice" occurring when the snow lands on the ground or on surface ice can cause confusion in technical terms.
In the CMIP5 climate simulations the boundary between land and sea was clearly defined and fixed in time, but, in at least some models, the CMIP6 ensemble introduces more complexity. For the first time, some models have a realistic simulation of floating ice shelves. These deep layers of ice form on land, but flow to cover large areas of ocean such as the Weddell Sea.  G5 contains a mix of ocean, sea ice and land. As models can now represent sea water extending under the ice sheet (A to A ) there will be a difference between the grounding line (A) and the boundary at the ocean-atmosphere interface (B). In CMIP6 diagnostics, the land surface is taken to extend to B, so that diagnostics such as the surface radiation balance are treated consistently across the ice sheet surface.
The extent of the ice shelves can also, in a small number of experiments and models, vary in time. This introduces a range of 85 possible interpretations for the boundary between land and sea: the leading edge of the ice shelf, the grounding line underneath the ice shelf, or perhaps the line where mean-sea-level intersects the surface under the ice.
In the context of CMIP6, the earth surface modelling is mainly motivated by a desire to represent energy and material cycles that affect the climate. For these purposes it generally makes sense to ignore these distinctions between grounded ice sheets, floating ice shelves, and bare land masses. Hence, for the data request, most surface land diagnostics are expected 90 to extend over all land ice, including floating ice shelves. However, for a range of specialist diagnostics requested by IS-  Table B1 for full names and citations for each endorsed MIP), there are more specific area types defined: e.g., grounded_ice_sheet and floating_ice_shelf.
The complexities that we see in the cryosphere apply right across the domain simulated by CMIP6. shrubs, trees, grass). Further, plant respiration is broken down into contributions from roots, stems and leaves. There are also a number of diagnostics associated with carbon isotopes 13 C and 14 C.
Alongside the multiplicity of variables is a multiplicity of potential applications, not all of which require the highest possible output frequency -which is fortunate, as it would be completely infeasible to archive all variables at high frequency. However, this leads to the requirement of identifying, and specifying, output frequency requirements. In some cases output frequency 105 can be reduced by carrying out processing within the simulation, so only condensed diagnostics are needed, and in others, snapshots are all that is required. In all cases, the output frequency is related to potential application objectives.
3 Animals digesting plant matter

General Approach
The DREQ is designed to support a wide range of users belonging to four broad categories: the MIP science teams, modelling centres (data providers), infrastructure providers, and data users.

110
The MIPs contributing to CMIP6 provide input into the DREQ but also use it to coordinate their requirements with other MIPs and to obtain quantitative estimates of the data volumes associated with their planned work.
The modelling centres have two independent uses of the DREQ: first as a planning tool and second as a specification for the generation of data. When used as a planning tool, it allows exploration of the consequences of various levels of commitment in terms of data volumes and numbers of variables. When a centre has begun generating data, the DREQ provides the 115 specifications for each variable.
The main infrastructure providers who depend on the DREQ are the developers of the Climate Model Output Re-writer (CMOR) package, 4 those developing quality control software, and those doing planning for the Earth System Grid Federation (ESGF) data delivery services. 5 The relationship with the CMOR team is especially important as the DREQ and CMOR intersect in supporting the metadata specifications for CMIP6 output.

120
Users are mainly expected to use portal search interfaces (e.g. the ESGF search interface) to locate existing CMIP6 data, but, especially in early stages, may also rely on the DREQ to determine what data may eventually be found there.

Generic Requirements
The timetable for generation of the DREQ did not allow for a formal specification of technical requirements. The following list sets out the high level requirements that emerged from a range of informal discussions:

125
(a) Provide feedback to MIPs on feasibility of data requests, especially regarding estimated data volumes; (b) Provide precise definitions and fully specified technical metadata for each parameter requested; (c) Provide a programmable interface that supports automated processing of the DREQ; (d) Support synergies between MIPs, maximising the re-use of specifications and of data.
Item (a) is extremely important because attempting to store all variables at high frequency for all experiments would be 130 impractical, resulting in unmanageable data volumes. Data volume estimates provided through the DREQ can only be indicative because the actual volumes will be influenced by many choices taken by modelling groups during the implementation of the request, but these estimates have nevertheless provided a useful guide for resource planning. CMIP gains immense impact from the synergies of the many science teams working on over-lapping science problems.   Many of these requirements were already recognized in CMIP5; the major advance in CMIP6 was the ability to tailor data needs to each individual experiment and its scientific goals, and the introduction of a programmable interface supporting automated process of the DREQ.

145
The intent of the DREQ is to provide all the information needed for a modelling group to archive variables of interest for subsequent analysis. In doing so, it must support the CMIP ethos of both facilitating intercomparison of an inclusive range of models and addressing significant new areas of climate science. It must also facilitate contributions from both well established and new participants.
In order to achieve this, CMIP6, following practice of earlier CMIP phases, allows participating institutions to be selective 150 about the range of experiments they conduct and the diagnostics that they generate. This is facilitated by experiments defining various levels of priority for the variables requested. Hence, although the DREQ specifies all the variables requested for each experiment and ensures coherence in the data archive, it also allows some flexibility. Table 2 shows the choices available to data providers that determine the scope of their contribution to the archive. Despite the flexibility, there is a minimum requirement: when a modelling centre commits to participating in a MIP, it is expected to 155 provide all the priority 1 variables needed to address at least one of the scientific objectives of that MIP.
This approach ensures that CMIP has a large and representative model ensemble, but it also means that users who would like to have all models running the same collection of experiments and producing the same set of variables will not find the consistency that they want. The data provided by some models will be more limited than for others.
To ensure some consistency across the CMIP archive, the DREQ is structured to provide a menu of choices defining blocks 160 of variables with differing priorities and scientific objectives.

Structure
The data request contains an extensive range of specifications which define climate data products which will be held in the CMIP6 archive 6 . The data products will, when generated in accordance with the full data format specifications 7 , comply with the data model of the CF Conventions (Hassell et al., 2017). The data request on its own does not provide the full format 165 specifications, but does provide enough information for each variable to allow the automated production of compliant data files. That is, where there are multiple options available in the format specifications, the data request determines which choices should be made for each variable.
In order to manage these specifications which are aggregated across the many participating endorsed MIPs, the specifications themselves are required to fit within an information model, which we call the Data Request Information Model (DRIM) to The nature of the process of establishing the CMIP6 Data Request has required that the DRIM itself evolve as information is gathered. In order to manage this process, the DRIM is constrained to stay within a pre-defined framework.  . Figure (2) continued: • F1: A set of core attributes are used to define additional attributes (see also Table B2); F2: A simple python script is used to manage framework information; F3: A style sheet is used to map XML configuration information (C5) into a schema document (C2); F4: The python base class has dependencies on the core attributes built in.
• C1: An excel workbook, defining the attributes used in each section of DREQ; C2: The schema is expressed as an XSD document; C3: A sample XML document which complies with the schema is constructed. This allows verification of the logical consistency of the schema and facilitates construction of the full DREQ XML document; C4: A python class is defined for each section, combining the base class with configuration information; C5: The excel workbook (C1) is converted to a structured XML document for robust portability.
• P1: An XML document contains the aggregated information content; P2: A python API provides a programmable interface and command line options; P3: Web pages support browsing and searching.

Building blocks of the DREQ
The DREQ is constructed through 3 key sections, framework, configuration, and content, which are shown schematically in Taking these in reverse order, the content of the DREQ describes what is actually requested by including specific information about parameters and requirements in attributes of metadata records, such as the description of the baresoilFrac variable 180 given in Table 3. Each of the attributes is assigned a value that may be a free text string or a link to another DREQ record. The content can be accessed via several different methods (section 5).
The configuration provides the full specification of the sections in the DREQ and the attributes carried by records in each section. For instance, records in the var section carry the attributes uid, label, title, sn (a link to a CF standard name), units, unid (an identifier for the units) 8 , description, provmip (identifying the MIP responsible for initially defining 185 the parameter), prov (a hint about the provenance) and procComment (processing guidance) 9 .
The framework element defines how the configuration will be specified and provides some basic tools. It is designed to be flexible and provide some basic software functionality to support the development and use of the DREQ. It specifies that the content will consist of a collection of sections, each of which contains some header information and a list of data records. Each data record is a list of key-value pairs, with a specific set of keys defined for each section. Each key is, in turn, defined by a 190 record, as explained further in section 4.2 below. 6 CMIP6 -Coupled Model Intercomparison Project Phase 6: pcmdi.llnl.gov/CMIP6. 7 See WIP position papers (WGCM Infrastructure Panel, 2019) and CMOR documentation (Nadeau et al., 2018). 8 The redundancy between "unid" and "units" has not yet been eliminated because in the absence of a fully developed suite of tools for managing linked content, such redundancy has some value. It allows easy reading of content (via the units value) as well as robust linking (via the unid attribute). 9 This attribute is not fully implemented in the existing DREQ.

Schema and Content Implementation
The reference document for the Data Request content is an XML document (Bray et al., 2008) conforming to an XML Schema Definition (Gao et al., 2008) (XSD) document. The schema has been developed to satisfy the requirements that have emerged during the MIP endorsement process. The configuration-driven approach allows the Data Request Schema to be generated from 195 a framework document, and the same framework document is used to generate Python classes for the Application Programming Interface (API).
The Request Document aims to be self descriptive: each record is defined by its attributes, and for each attribute there is a record defining its role and usage. The apparent circularity is resolved as shown in Table B2, where the description attribute of the record defining description defines itself. The framework also constrains the set of value types used to 200 define attributes. Some of these are generic types, such as "integer" or "string", others are more specialised such as "integerList", for a list of integers. There are 29 sections in the DREQ, and the total number of attributes is 288. These are listed in a technical note 10 . Full details are in the schema specification (Juckes, 2018a).
The DREQ is presented as a document of 33 sections, where each section has the following characteristics: • The section is described by 8 attributes;

205
• Each section contains a list of records, each having a set of attributes; • Each record attribute is defined by the properties listed in Table B2.

Endorsed MIP Description
Request Link grid option

Variable Group
Reusable aggregation

Core Request Elements
The core DREQ sections are shown in Figure 3. Starting at the bottom left, a MIP Variable defines a physical quantity. Each variable has a unique label, a title conforming to the style guide (Juckes, 2018b), a standard name from the CF conventions, 210 and units of measure. The DREQ spans around more than 1200 different MIP variables, ranging from surface temperature to the properties of aerosols, microscopic marine species and a range of land vegetation types.
Each MIP Variable may be used by multiple CMOR Variables, which specialize the definition of a quantity by specifying its output frequency, coordinates (e.g. should it be on model levels in the atmosphere or pressure levels?), masking (e.g., eliminating all data over oceans), and temporal and spatial processing (e.g. averaging or summing). For instance, the near 215 surface air temperature is a MIP variable, tas, used in 10 different CMOR variables that differ in frequency from sub-hourly to monthly and that cover different regions (e.g., global or Antarctica only). There are more than 2000 distinct CMOR variables in the DREQ.
Each MIP determines which CMOR variables are needed for their planned scientific work, and they are asked to assign to each variable a priority from 1 to 3, with 1 being the most important, to each variable. The Request Variable section specifies 220 variable priority on an experiment-by-experiment basis, leading to over 6000 distinct Request Variables.
The 3-level hierarchy of MIP variable, CMOR variable and Request Variable provides some flexibility to re-use concepts, improving consistency in the DREQ. The foundation is provided by standard names from the CF convention: 927 of these are used in the CMIP6 Data Request, and for 728 of these there is a unique associated MIP variable.
The Standard Name may be re-used multiple times: 145 standard names used twice, 25 used three times. The standard name 225 re-used most often (33 times) is the area_fraction, which is used to represent the proportion of a grid cell associated with a particular category of surface type. These different categories are represented in an ancillary variable with standard name area_type. In most cases, when a standard name is re-used there will be additional CF metadata specifying details which distinguish between the different variables, such as the area_type. There are a handful of cases, such as "Upwelling The Request Variables differ from each other in terms of the MIP requesting the data and the priority which they attach to it. For instance, "Surface Downward Northward Wind Stress [tauv]" is requested at priority 1 by High-ResMIP and DynVarMIP and at priority 3 by DCPP. If a modelling centre is aiming to support HighResMIP or DynVarMIP, they should treat this CMOR Variable as being at the higher priority.
When MIPs request data, they need to provide information about the experiments that the data is required from: we do not expect all defined variables to be provided from all experiments, as that would generate substantial volumes of unnecessary output.

245
The process of linking the 6423 Request Variables to the hundreds of experiments is structured by first aggregating the Request Variables into 272 variable groups. Modelling centres should be able to identify the scientific objectives being supported by the data they distribute. This is done through a Request Link record that associates a Variable Group with one or more Objectives and a collection of Request Items.
The Request Items link to one or more Experiments and specifies the ensemble size and, optionally, a specified 250 temporal sub-set of the experiment for the requested output from that Experiment.

Simple lists
The sections denoted by orange chamferred shapes in Figure  senting precipitation falling on crops. The comment within the cell methods string is used to provide users with information about the related variable cropFrac, which gives the percentage of a grid cell area covered by crops.
The Time Slice section specifies the portions of each experiment for which output is required. A set of variables requested at 3-hourly intervals are, for instance, only required from the historical experiment for the period 1960 to 2014, rather than the whole simulation from 1850.

280
The Choices section lists situations in which the modelling centres must make a choice between variables. There are cases, for example, where a modelling centre can choose to report a variable as a climatology if in their model it is prescribed to be the same from year to year rather than allowed to evolve over time.
The  CMIP and unambiguously gain access to associated information, such as start and end dates and ensemble sizes. Such information is required to generate data volume estimates. There are a number of experiments for which requirements vary across different priority tiers (see Table 2). For example, the land-ssp126 experiment is requested for one ensemble member at 300 Tier 1 and an additional two ensemble members at Tier 2.

Links and Aggregations
The DREQ can be thought of in terms of triads (or triples) linking variables, experiments and objectives. That is, whenever a variable is requested from an experiment, it is linked to one or more objectives. There are over 350,000 potential variableexperiment-objective triads in the CMIP6 Data Request, arising from various combinations of 2068 variables, 273 experiments and 93 objectives. These three-way links may be supplemented with additional information, such as a specific sampling periods or a preferred spatial grid.
Less than 1% of the possible combinations are used, but this is still too many to manage individually, so rather than explicitly listing all these virtual triads, the Data Request organizes them in groups. This results in just 411 request links, with groups of variables needed to address one or more objectives linked to groups of experiments.

Additional implicit structure
Much of the DREQ structure is formalised by use of the XSD schema mechanism, however there is a significant amount of • a vertical coordinate (e.g. a variable describing a property of an atmospheric layer) required by a standard name is present; • a cell methods string is consistent with the CF Conventions syntax rules; • the spatial and temporal dimensions of a variable are consistent with the cell methods string (e.g. a time mean or maximum, specified in the cell methods string, requires a time dimension with a bounds attribute);

335
The CMIP5 request had 4 guide values for some diagnostics: minimum and maximum acceptable values and also minimum and maximum acceptable values of the global mean of the absolute value of the diagnostic. These ranges were not intended to provide any guide to physical realism, but rather to catch data processing errors such as sign errors that might arise from institutional sign conventions opposite to those of the DREQ or incorrect units (e.g. submitting data in degrees Centigrade with metadata units describing the data as Kelvin).

340
With a wider range of diagnostics, for CMIP6, guide values are not always appropriate and/or available (e.g. for novel diagnostics). The DREQ supports a three-level indication of the robustness of any specified guide values, to avoid inappropriate warnings. As an example, an analysis carried out by Ruosteenoja et al. (2017) noted that while near-surface relative humidity values of 140% can, in principle, be realistic at a point in space and time, many of the high values in the CMIP5 archive, which represented time and grid cell averages, are likely to be caused by processing errors. Hence the upper-limit is set at 100.001% 345 and categorised as suggested, in contrast to the limit for sea ice extent that has a robust limit of 100.001%. ( False if, as is the case for many models, the flux is treated as being confined to the surface. The value of this parameter then determines whether a two or three dimensional variable should be archived to represent this flux. This feature was added in response to requests from modelling centres for a mechanism to improve automation. Different MIPs have different requirements for data on pressure levels such as a need for zonally averaged data on 39 levels or high frequency data on 3 pressure levels. In total there are 10 different pressure axes defined as part of level harmonisation 355 in the DREQ ( Figure 5). This harmonisation has a small cost in extra data production: for example, if one MIP is asking for a variable on 8 levels and a second MIP is asking for the same variable on 23 levels then both requests can be satisfied by providing the data on 23 levels. However, if the 23-level data is only requested for a short time period and the 8-level data is requested for the whole experiment, redundant data may be requested. This is not ideal, but it appears that the volumes of redundant data will not be excessive.  7  10  15  20  30  50  70  80  90  100  115  125  130  150  170  175  200  220  225  245  250  300  350  375  400  450  500  550  560  600  620  650  700  740  750  775  800  825  840  850  875  900  925  950 975 1000 Figure 5. The pressure levels used for atmospheric variables in the DREQ. The right hand column, headed "single" contains pressure levels used for single level variables. Other columns represent collections of levels used as a vertical axis for a range of requested parameters. Black rectangles indicate a level which occurs in only one column. The plev7c axis is a special case that is used specifically to match diagnostics from the International Satellite Cloud Climatology Project (ISCCP: WCRP, 1982) cloud simulator. The three black rectangles in the "single" column, at 220, 560 and 840 hPa are also ISCCP levels.

Interfaces and Version Control of the Data Request
The DREQ content is provided as a version controlled XML document complying with the schema, but a range of interfaces are provided in order to make the contents more accessible. The use of XML documents ensures robust portability and allows users to import the DREQ into their own software environments.
For users who do not wish to confront the details of the XML schema, alternative views are provided by the web site 12 and 365 the python package dreqPy 13 .
The website provides a complete view of the DREQ content in linked pages, and also a range of summary tables as spreadsheets. These include, for instance, lists of variables requested by each MIP for each experiment.
The python package provides both a command line and a programming interface. The python code is designed to be self descriptive. Every record, e.g. the specification of a variable, is represented by an instantiated class with an attribute for each 370 property defined in the record. For example, if cmv is a CMOR variable record, cmv.valid_min will carry the value of the valid_min parameter for that record. The specification of the valid_min parameter is carried as an attribute in the parent class at cmv.__class__.valid_min, which is a similar instantiated class. For instance, the following code: cmv.__class__.valid_min.type = "xs:float" gives the data type of the attribute (floating point) and cmv.__class__.valid_min.title gives a short description.

375
The DREQ was version controlled with a 3 element version number, such as 01.00.31, following Coghlan and Stufft (2013). From July 2017 onwards, updates were preceded by a beta release to allow for some error checking before moving to the full release. Beta versions are labelled by appending b1, b2 etc. to the version number. For minor technical fixes, such as problem which prevent the python software from working with specific versions of the python library, post versions are used, such as 01.0031p3. The full version history can be viewed in the Python Package Index 14 .

6 Summary and Outlook
The CMIP6 data request, or DREQ, provides a consolidated specification of the data requirements of the 23 endorsed MIPs 15 participating in the CMIP6 process. In doing so, it supports those responsible for configuring simulation output, those developing software infrastructure, and those who are trying to anticipate what may be available before it appears in catalogues. The latter include both those responsible for storage systems, and potential data users.

385
The data request has a complex structure which arises from the inherent complexity of the problem: not only are there many more MIPs and experiments than in previous CMIP exercises, but not all modelling centres expect to address all the objectives of individual experiments, let alone all MIPs. This means that the request infrastructure has to handle varying aggregations 12 w3id.org/cmip6dr/browse.html 13 Data Request Python API: proj.badc.rl.ac.uk/svn/exarch/CMIP6dreq/tags/latest/dreqPy/docs/dreqPy.pdf 14 pypi.org/project/dreqPy 15 Table B1 has 25 rows because it also includes "DECK" and "CMIP", which refer to activities that have a role analogous to MIPs in the DREQ: "DECK" specifies a collection of experiments and "CMIP" specifies a set of data requirements. across the over 350,000 potential combinations of variables, experiments, and objectives, and deliver the appropriate metadata information, lists, and summaries for the groupings which arise. In practice 411 groups are needed to serve the objectives which 390 have been extracted from the experiment definitions.
The design of the data request delivers a separation of concerns between a request framework, configuration which specifies the sections and attributes of the request, and the actual content. In each domain (framework, configuration, content) there are information components (schema, instances) and code to support the use of that information.

395
Resolving the original ambiguities and errors in the specifications of diagnostics has resulted in frequent updates to the DREQ documents that, although cleanly version controlled, caused significant delays and inconvenience for those attempting to begin simulations as the output configuration was changing. Most of these arose not from the data request machinery, but upstream in the definitions of the MIPs, experiments, and output requirements.
The formal schema developed for CMIP6 establishes a robust structure, but it has some clear limitations. There are a number 400 of rules governing the content which are not captured by the schema, and arise from a semantic mismatch between the notion of a variable, and its implementation in the CF conventions for NetCDF. For example, certain cell methods strings, such as time: mean, require specific forms of dimensions or coordinates.
There are also issues around variable definitions, both in the data request, and in the conventions themselves. For example, variable names containing abbreviated references to parts of the variable definitions (e.g. "sw" for "shortwave", "lw" for 405 "longwave") lead to both inconsistency and transcription errors. Similarly, some CF Standard Names encode information about the nature of physical quantities and the relationships between them. However, there are variations in the syntax (e.g. variables relating to nitrogen mass may contain either nitrogen_mass_content_of_ or _mass_content_of_nitrogen in the standard name) which obscure some of this rich information.

410
There are a number of areas where technical improvements can be made to support future CMIP activities and, potentially, related work outside CMIP.
As discussed in 4.2.3 above, there are a number of areas where the DREQ intersects with ES-DOC and CVs. There is room for closer semantic alignment, as well as some streamlining of information flow between the MIP teams and those developing the technical documents and infrastructure. Significant overlaps with ES-DOC occur in the definitions of experiments, potential 415 model configurations, conditional variables, and objectives. Some further rationalisation of the interfaces between ES-DOC, the Data Request and the controlled vocabularies prior to new experiment and MIP design will aid all parties.
More use of re-usable and extensible lists is also anticipated. One obvious way forward would be to aim for future MIPs being able to exploit existing and re-usable variable lists, either as is, or with managed extensions.
The data request is complex, and establishing and upgrading the content of different components requires different commu-420 nication approaches. This can be seen by comparing just two of the many components: • The grids section defines some technical parameters used by community software tools. The priority here is to communicate clearly with the relatively small collection of software developers to ensure DREQ updates can be supported by the software to deliver the required outcomes in terms of data structures.
• The definition of parameters in the var section requires a discussion among a broad range of scientific experts to reach 425 a consensus on terminology. The definitions in this section are intended to be used by multiple MIP teams, so they must be acceptable to experts in different areas.
Upgrades to these two components are in some senses orthogonal, impacting on different groups. Further partitioning of the data request to facilitate more transparent management of request upgrades would be desirable. Such partitioning may also address complexity in the data request itself, ideally allowing more agility in its specification and use.  2. Endorsed MIPs should be required, as part of endorsement, to identify a technical expert responsible for liaising with, and supporting the data request.
3. Clear documentation should be in place for these technical experts so that expectations are clear as to what is required.

Clear and consistent version information should be provided in the web interface.
These steps would significantly reduce bottlenecks in the preparation for future CMIP exercises, and minimise the burden on 445 both the scientific leaders of the MIPs, and the modelling groups.
Juckes (2019) also covered some procedures which have already been implemented, including the publication of each new version of the request as a beta version to allow time for review so that changes made match the update intentions.

Ongoing Importance
The entire CMIP process is predicated on producing data for analysis, informing both science and policy. The central impor-450 tance of a data request to those goals is obvious, but the underlying obstacles to the construction of a well defined request are often unclear. We cannot take it for granted that the goals of participating science teams will be met without detailed attention to output requirements, particularly when, as in CMIP, so much of the value arises from the interactions between MIPs.
This detailed attention is only going to become more important in the future as the diversity of the Earth system modelling community grows and pressure for efficient use of the computing resources needed to carry out advanced simulations and store 455 output become greater. Getting output descriptions right will be crucial to delivering and evaluating scientific benefits, and to developing the necessary infrastructure.
The growing dependency on CMIP products by a broad sector of the research community and by national and international climate assessments, services and policy-making means that CMIP activities require substantial efforts in order to provide timely and quality controlled model output and analysis. 460 Although CMIP has been extraordinarily successful and leverages a large investment from individual countries, there are aspects that are fragile or unsustainable due to a lack of sustained funding. The impressive CMIP impact is highly dependent on volunteer efforts of the research community and individual scientists who contribute to the underlying essential infrastructure.
CMIP has now reached a stage where certain components and activities require sustained institutional support if the programme as a whole is to meet the growing expectation to support climate services, policy, and decision-making. Of particular 465 urgency is the systematic development of forcing scenarios that require institutionalized support so that quality controlled datasets and regular updates can be provided in a timely fashion. In addition, a more operational infrastructure needs to be put in place, so that core simulations that support national and international assessments can be regularly delivered. This includes the oversight, development and maintenance of the data requests, standards, documentation, and software capabilities that make possible this collaborative international enterprise.

470
A specific Resolution seeking the support of the World Meteorological Organization (WMO) to CMIP was presented and approved at the 18th World Meteorological Congress, held from 3-14 June 2019. The Resolution drew WMO Members's attention to the importance of CMIP and its critical role in supporting the global climate agenda. Members were requested to contribute institutional, technical and financial resources as necessary to ensure sustainable and robust CMIP and CORDEX (Coordinated Regional Climate Downscaling Experiment) climate change projections delivery to the IPCC.

475
Code and data availability. The current version of the DREQ is available from the project website: w3id.org/cmip6dr under the MIT License (BSD). It is provided as a versioned XML document, which can be used directly or programmatically (both command line tools and a python library are provided). The exact version of the DREQ discussed in this paper (01.00.31) is available from the Zenodo repository at 10.5281/zenodo.3361640. It is also available as a package from the Python Software Foundation at pypi.org/project/dreqPy/1.0.31/.
Author contributions. MJ led the development of the CMIP6 Data Request. KT has developed many of the underlying principles in the process of supporting CMIP5 and earlier phases of CMIP, and contributed substantially to the harmonisation and quality control of the CMIP6 Data Request. BL has contributed on the interface with ES-DOC and on the context of metadata for Earth System Models. MM and SS have provided input from the perspective of the operational climate modelling centres, and contributed significantly to the development of the request by being early adopters. AP is responsible for the procedures around the CF Convention, JP and PD contributed as data coordinators for two large sections of the request, for PaleoMIP and OMIP respectively. MR has helped to establish and maintain the 485 governance framework which facilitated the development of the request. The CMIP6 Data Request relies heavily on the Climate and Forecast Metadata Convention (CF). A number of modifications were required either to deal with new metadata structures or to clarify the interpretation of metadata constructs employed in 650 the past. These were all discussed on the CF discussion forum maintained by the Lawrence Livermore National Laboratory 17 .
The ticket numbers given below (#152 etc) can be used to find the relevant discussions on that site.
Temporal averaging over a region specified by a time varying mask offers some particular challenges. A long discussion ("Time mean over area fractions which vary with time [#152]") established a clear protocol for expressing the concept using the cell_methods attribute, and clarified the usage of methods applying to multiple dimensions.

655
Under the CF Convention variables can refer to geographical regions either by using the name of a region from the approved list or by using an integer flag. Some wording in the conventions document was ambiguous about the validity of the latter approach: this has now been clarified to allow the use of flags ("Clarification of use of standard region names in region variables [#151]").
Many standard names state that additional information should be supplied in additional CF variable attributes, or impose 660 requirements on the dimensions. Such rules are not currently checked by the CF checker, making their status in the convention ambiguous. The discussion "Requirements related to specific standard names [#153]" is still open, but has led to a proposal for a specific set of rules which have applied to the Data Request in order to ensure reasonable completeness of metadata.
A "Clarification of Conventions attribute [#76]", which was proposed long ago, has been concluded. This allows the CF convention to be used in parallel with other compatible conventions. This is required for use with the UGRID convention in 665 CMIP6.
A long discussion on "Subconvention for associated files, proposed for use in CMIP6 [#145]" concluded by defining a subconvention which allows variables in other files to be referenced from the cell_measures attribute. This allows explicit referencing of grid cell areas and volumes. Such ancillary data should, according to earlier versions of the CF Conventions, but was not included in CMIP5 files because it would, for some time varying ocean grids, substantially increase data volumes.

670
There is an open discussion on "Extension to external_variables Syntax for Masks and Area Fractions [#156]" which is exploring ways of making the link between masked variables and the relevant mask clearer. With the present convention it is possible to indicate that a variable is masked by, for instance, sea ice, but there is no mechanism for identifying the specific sea ice variable used. The discussion has not reached a conclusion, so the DREQ uses an ad-hoc syntax, placing the name of the masking variable in a comment string within the cell methods string.

A2 Standard Names
A total of 552 new standard names were proposed for CMIP6, of which 349 were accepted. Names were rejected when existing terms, possibly in combination with area types and other metadata, can be used to meet the requirements. The new names make up 36% of the standard names used in the DREQ.
The terms span a broad range of scientific domains, with new properties of aerosols, radiation, the cryosphere (including ice 680 shelves and dynamic floating ice sheets, sea ice, and a more detailed representation of snow packs), vegetation, atmospheric dynamics, and other aspects of the climate system. Tables   B1 Experiment Collections  Table B1. Labels used for collections of experiments in the DREQ and the number of experiments and variables in each collection. NV : The 685 number of variables requested by each MIP. "Experiments defined" refers to experiments that have been designed by that MIP. "Experiments used" refers to experiments that they are requesting data from (numbers entered in brackets). E.g. SIMIP is a diagnostic MIP, which means that they have not defined any experiments but they are requesting data from (i.e. "using") experiments defined by others.   Continued on next page Table B4. Table showing the data volumes requested, broken down in terms of the requesting MIPs (rows) and the experiments they request data from grouped according to the MIPs defining them. Units are terabytes (T), gigabytes (G) and megabytes (M). Data volumes are estimated for nominal model with 1 degree resolution and 40 levels in the atmosphere and 0.5 degrees with 60 levels in the ocean. The second from last column gives the sum of the volumes in all other columns, and the final column gives the volume of data which is uniquely requested by the MIP associated with that row. The final row represents the aggregate volume of the combined request and, since the data requests from different MIPs overlap, this is less than the sum of the individual requests. The ScenarioMIP is unusual in that they have not directly requested data: the role here is split between ScenarioMIP specifying experiments and VIACSAB requesting data to be used in the analysis of the impacts of projected climate change in different scenarios defined by ScenarioMIP.