Due to the proliferation of geophysical models, particularly climate models, the increasing resolution of their spatiotemporal estimates of Earth system processes, and the desire to easily share results with collaborators, there is a genuine need for tools to manage, aggregate, visualize, and share data sets. We present a new, web-based software tool – the Carbon Data Explorer – that provides these capabilities for gridded geophysical data sets. While originally developed for visualizing carbon flux, this tool can accommodate any time-varying, spatially explicit scientific data set, particularly NASA Earth system science level III products. In addition, the tool's open-source licensing and web presence facilitate distributed scientific visualization, comparison with other data sets and uncertainty estimates, and data publishing and distribution.
Today's scientific enterprise must consider the challenges and opportunities
associated with the growing scale of scientific observations, the need for
scalable analyses, and the benefits and obligations of sharing scientific
outputs. In climate models, in particular, a wealth of observations can be
generated or collected but rich, collaborative insight requires additional
frameworks and software tools. Hence, there is a renewed emphasis in the
Earth system sciences on tools and best practices for the documentation and
sharing of analyses, metadata generation (e.g., Earth System Documentation,
ES-DOC), and scientific provenance (e.g., The Kepler Project;
In this paper, we describe a new, web-based framework for managing,
analyzing, and collaboratively visualizing Earth system science data sets: the
Carbon Data Explorer (
Commensurate with the growth of computing power, geophysical models are
producing data with increasingly fine spatial and/or temporal resolution
The Carbon Data Explorer also adopts the data cube as a functional interface
for high-volume spatiotemporal data. A map view of a single point in time can
be visualized as slicing the data cube perpendicular to the time
(
A three-dimensional data cube in which spatial data of two
dimensions (e.g., latitude and longitude) are combined with a third dimension
of time. In this view, a horizontal slice perpendicular to the time (
While data cubes work well for storing scientific data offline, web browsers and web applications are designed to work largely with plain text documents (interpreted variously as HTML, XML, JavaScript, or other documents). Non-text formats can be downloaded directly from an online directory or through File Transfer Protocol (FTP). Indeed, many scientists, unable to procure or unaware of a more sophisticated solution, provide large collections of outputs directly through FTP – essentially a networked folder available to the public. Indexing, searching, or manipulating data must then be done offline.
As an alternative, open application programming interface (API) standards such as the Open Geospatial Consortium
(OGC) Web Map Service (WMS) allow two computers – a web browser and a remote
web server – to communicate about data through an agreed-upon protocol
Thus, dissemination of scientific data on the web typically requires a
metadata-driven API or resource
descriptor framework (RDF); these are implemented as a kind of text-based
communication protocol that describes (to a computer) where binary data can
be found and how they can be accessed. This enables web applications to
ultimately retrieve and display data in formats that are not native to the
web. However, these APIs incur considerable performance costs when online
analysis of data sets is required or when representations are generated
dynamically from incoming, real-time data streams (e.g.,
The Carbon Data Explorer solves this problem by introducing a new API for
text-based representations of data cubes, thereby enabling easy integration
with and high performance in browser-based web applications while also
providing capabilities for dynamic querying, aggregation, differencing, and
anomaly calculations. This text-based representation is not only compatible
with web browsers, it allows for the data to be manipulated directly in the
website, providing asynchronous rapid filtering and aggregation. Only the OGC
Web Coverage Service (WCS), a protocol also based in a data cube metaphor,
allows for this level of interaction and online analysis of data
The adoption of web APIs for sharing data is further evidence of the
scientific community's desire to share results with a wider audience. In
addition, the ubiquity of social media is bringing online conversations about
science, albeit informal, and there are even emerging social networks
dedicated to scientific discourse and exchange (e.g., ResearchGate,
Academia.edu). This unprecedented interconnectivity is also motivated by best
practices in collaborative science. The next phase of the Climate Model
Intercomparison Project, CMIP6, will for the first time allow “anyone at any
time [to] download model data for analysis”
In response to this need, the Carbon Data Explorer allows data providers to
share scientific data sets, analyses, and visualizations directly on the web.
A data provider might be a modeler, the principal investigator of an
interdisciplinary research team, or a technician or information technology
(IT) professional embedded in a research team. NASA estimates that these
scientists and model developers spend more than 60 % of their time
preparing model inputs and model inter-comparisons (as cited by
In its capacity as a data management and data access web server, the Carbon Data Explorer is similar to the THREDDS Data Server (TDS); its analytical capabilities make it similar to Ferret-THREDDS. The Carbon Data Explorer expands on both by providing an integrated front-end for visualization and analysis. While the THREDDS Client Catalog requires data to be registered with XML descriptors, the Carbon Data Explorer Python API has a user-friendly command-line interface that allows for faster, repeatable, one-time registration of data without the need to open a text editor and write XML. In data interchange, it also substitutes bulky XML for light-weight and more human-readable JavaScript Object Notation (JSON). In addition, it eschews the vulnerability-prone Java environment and Tomcat web server for a light-weight, non-blocking web server in Node.js that can be hidden behind a proxy server such as Apache. This design choice trades off the protocol interoperability of TDS, which was not identified as a requirement by our user community, for the ease of development and deployment of web services with newer, JavaScript-based technologies on the server.
The scientific data sets supported by the Carbon Data Explorer include any
gridded or non-gridded time-varying, spatially explicit data that can be
decomposed into one variable at a time. The canonical example of a supported
data set is any NASA level III scientific data product, defined as “variables
mapped on uniform space–time grid scales”
The Carbon Data Explorer shares similar aims with technologies such as NASA's World Wind virtual globe, Giovanni, and Mirador. Compared with World Wind, the Carbon Data Explorer provides access to analytical capabilities that would be awkward or impossible to reproduce in a virtual globe. Also, unlike World Wind, it requires neither a stand-alone installation nor a dependency library such as Java and runs in any web browser. While Mirador allows users to download spatially explicit scientific data sets from NASA missions, it has no analytical or visualization capabilities. The Carbon Data Explorer most closely resembles Giovanni in that both are web-based, map-centered viewers. While Giovanni provides more sophisticated analytical capabilities, the Carbon Data Explorer is designed to deliver results faster and allows for greater customization of the visualization and the querying of measurement values within the web client. In sum, the Carbon Data Explorer is intended for more rapid examination and comparison of climate model outputs by the modelers themselves.
In common with the Earth System Grid Federation
A unified modeling language (UML) deployment diagram for the Carbon Data Explorer (CDE), illustrating the configuration and connections between the components as currently deployed.
In the development and evaluation of the tool, we relied heavily on some
reference data sets exemplary of those we intend to support. These included a
1-degree-by-1-degree carbon flux estimate at 3 h time steps from the NASA
Carnegie Ames Stanford Approach (CASA) model run with Global Fire Emissions
Databaset (GFED) input data and 1-degree-by-1-degree
carbon concentration (X
The Carbon Data Explorer has three main components: a Python API for data management, a web server API, and a
client-side JavaScript web application (Fig.
The Python programming language (version 2.7) was chosen as the framework for
data management, manipulation, and storage due to its high-level language
design, wide adoption in the scientific community, and available open-source
libraries. In particular, as many scientific products are stored as
HDF or early Matlab files, Python provides fast and
robust support for reading scientific data products through the NumPy
The web server and web client are both implemented in JavaScript. This was a
strategic but also practical decision. JavaScript is fast and expressive. It
is also the de facto language of the web; the only language that is natively
supported by every modern web browser
Results of load testing in Apache JMeter; network speeds are in seconds to request completion. The off-network tests were performed over a wireless internet connection; on-network tests were performed with a wired, direct network connection to the server.
Request URIs:
Performance testing of the Carbon Data Explorer was conducted using Apache
JMeter. For each of the requests listed in Table
Open data APIs for science capitalize on storing and sharing text-based metadata associated with scientific data that are stored in a binary or hierarchical format. We took this a step further and designed a data model that is text-only; that is, the format of the data both on-disk and when transmitted over the web is plain text. Specifically, the data are stored and transmitted as JSON documents. These JSON documents are stored in a MongoDB database instance, which handles indexing and retrieval of plain-text representations.
MongoDB is one of several document-oriented databases capable of storing semi-structured data as key-value pairs. As the goal was to get the data on the web, we chose MongoDB for its transparent, text-based storage. Alternatives such as Apache Hadoop and Cassandra, while offering performance advantages, do not provide a clear pathway for rendering binary files as text. These alternatives may faithfully and rapidly operate on chunks of the data but would require that input binary files be split and transformed into some kind of operational format for handling in a map-reduce framework. As no obvious intermediate format was known at the time of development, we opted for a format that most closely resembled the output representation – the native, text-based representation required for the web browser – as the intermediate format to be stored and operated on in the database. This design choice trades off performance for flexibility and the operational demands of bringing data in binary files onto the web.
Array databases such as PostGIS, a spatial extension to the relational database management system (RDBMS) PostgreSQL, are another alternative to MongoDB that we considered. The use of an array database would have satisfied the need for an intermediate format – in this case, array stores – but for the purposes of enabling in-client manipulation of the data (e.g., querying measurement values, changing the stretch) this approach would have required the transformation of requested data to another format, likely text. Based on the authors' experience with PostGIS, there were also no clear performance advantages to array databases. Thus, a document-oriented database like MongoDB allows the latency associated with preparing scientific data for the web to be pushed offline, during initial registration and insertion of the data to the database. In addition, MongoDB features an aggregation pipeline, which allows us to make sophisticated queries such as net carbon flux over the last 16 days. The web server API, which facilitates connections to the MongoDB instance, contains libraries that enable further sophistication with queries, applying fast arithmetic operations for queries such as the difference between carbon concentration (in ppm) today and this day last year.
Users can shuttle scientific data into and out of the MongoDB instance by directly interacting with the Carbon Data Explorer Python API classes or by using a set of accompanying command line tools designed to ease workflow. Command line tools are available for querying database contents as well as for loading, renaming, and removing data sets from the database. When loading a data set, its metadata must be specified either via command line argument or via an accompanying JSON file. Examples of required metadata parameters include column identifiers, grid resolution, units, starting timestamp, and time step length. These metadata parameters inform the correct methods for transforming and querying the data for use within the web server API. The metadata also encode population summary statistics, which are calculated by the Python API during insertion to MongoDB, to aid in visualization (e.g., calculating a stretch).
The transformation of data from binary or hierarchical flat files to a
database representation is facilitated by two Python classes, models and
mediators, which are loosely based on the transformation interface described
by
Transforming heterogeneous data to a uniform structure is a typically onerous
task. In developing the interface for storing data in MongoDB, we aimed for a
flexible system predicated on sensible defaults. The Model class of the
Python API defines how measurement values can be read from any file interface
available in Python. This flexibility was also driven by the historical
development of related software systems. For instance, we discovered that
Matlab has changed the format of its saved binary output files over the years
from a proprietary data structure to one that is compatible with HDF5
Scientific data in the Carbon Data Explorer are conceived of as belonging to
a particular run of a scenario, i.e., a specific geophysical modeling
objective. Data cube(s) are stored as one or more scenarios. During data
insertion to MongoDB, the
Non-gridded data are assigned arbitrary unique identifiers, making it
possible to have two pieces of non-gridded data that represent the same
instance in time (or span of time) associated with the same scenario. At the
present time, the Carbon Data Explorer supports only structured grids; that
is, the gridded data in a scenario must share the same uniform, rectangular
grid. Measurement values are stored and transmitted independent of the
spatial reference information, eliminating redundancies and allowing for
rapid retrieval and display on the web. The
The Carbon Data Explorer web server API is designed to work out-of-the-box so that data can be served and visualized with the web application on any web browser connected to the same local area network. That is, any user on the same network as the computer running the server can access the Carbon Data Explorer through its internet protocol (IP) address in their web browser. Data providers might choose to host the Carbon Data Explorer locally so as to keep their data private and collaborate internally. Deploying the server and web application on the public web is also easy, though it may require some familiarity with networking technology.
The web server makes data available as resources that are each associated
with a uniform resource identifier (URI). The model used for organizing these
resources in a single namespace (i.e., under a single host or domain name) is
the Representational State Transfer (REST) model
As another example, a map of carbon flux on 18 January 2004 at 03:00 UTC
from the casa_gfed_2004 scenario can be obtained at
/scenarios/casa_gfed_2004/xy.json?time=2004-01-18T03:00 where xy
refers to the
Entry points for the Carbon Data Explorer web server API.
Screenshot of the Carbon Data Explorer web browser application in the Single Map View mode.
These limited examples showcase only a small part of the functionality of the
web server's API (Table
The RESTful design of the web server's API underscores an important point about having scientific data directly available in the user's web browser. We believe scale changes, changes in the palette, and similar changes are in the purview of the client application; as they are merely changes in the application's state, they should be performed asynchronously in the client application without requiring interaction with the remote server. Keeping data on the server requires that new representations are generated even for relatively minor changes in application state. One example is the seamless rescaling of the visualization, e.g., changing the stretch on-the-fly. We have seen performance issues in comparable approaches to this problem, e.g., with WMS, which must request the data again from the server whenever scaling changes are desired. While similar tools such as Giovanni have also enabled seamless changes to visualization parameters, they do not allow for map-based querying of measurement values or simultaneous comparison of measurement values across data sets, as the Carbon Data Explorer does.
In the Carbon Data Explorer client application, a rich user interface
(Fig.
List of features (present when marked with an X) in the two visualization modes of the Carbon Data Explorer web browser application.
The default view in the Carbon Data Explorer client application is the
Single Map View, which displays a geographic view (an
The Single Map View allows the user to explore the data as in a geographic information system (GIS). Users can zoom into the map display, pan the map around, and query the value of a data point by hovering over it with the cursor. Non-gridded data can be plotted on top of gridded data and automatically share the same color scale. An optional border drawn around the non-gridded data points can help to distinguish them from the gridded data. This feature allows, for example, the direct comparison of gridded carbon concentration with bias-corrected retrievals from atmospheric sounding.
Data can be quickly aggregated in time or space from within the web
application. The temporal aggregation is handled by the MongoDB aggregation
pipeline, which facilitates very fast aggregation of multiple
Spatial filters can be drawn directly on the map interface or imported as
polygons defined using GeoJSON or well-known text (WKT), a human-readable
representation of geometry. Currently, only a single polygon can be used at a
time. Map data can also be differenced – one
While in the Single Map View, the map can be animated in time, updating its
display (at time
A line plot at the bottom of the map shows the global time series for the
currently viewed scenario by default; it is the aggregate mean value across
the
The coordinated view allows for comparison of multiple adjacent map views; it is essentially a grid of multiple Single Map View elements. These maps synchronize their extent whenever the user pans or zooms so that the same portion of the globe is displayed in each one. The user's cursor will now display not just the value of a data point in one map but the value at that those spatial coordinates in every map facilitating pixel-to-pixel comparison across the maps. Up to nine (9) maps can be viewed at once, which allows for nine different time points or nine different models to be viewed simultaneously.
A user's Map Settings, Symbology, and other global settings are stored in the web browser so that, upon closing the browser and returning to the web application later, the same color scale, map projection, and other settings are automatically applied. This allows users to customize their view of a data set and their workspace within the tool. All of these settings can also be encoded as a URI (or URL). This allows specific views of a data set to be bookmarked or shared with others over the web. With this feature, a user can apply a specific color scale, stretch (or threshold to highlight a particular anomaly), or an aggregate or differenced model result and then share a link that ensures that their team member will see the data exactly the same way. This is similar to the virtual variables of Ferret-THREDDS but provides not only access to an analysis but also access to a client for visualizing and interacting with that analysis. For offline storage and sharing of results, model visualizations and data slices can be exported as image files, CSVs (for non-gridded data), or as geospatial data (for gridded data) in the form of ESRI ASCII Grid files or GeoTIFFs; the latter two formats enable model results to be downloaded and opened in a desktop GIS like ArcGIS or QGIS.
The Carbon Data Explorer is presented as a prototype for a comprehensive data management, analysis, visualization, and sharing framework for Earth system science data sets, particularly gridded spatiotemporal data sets (e.g., NASA level III data products). As with any design, there are inherent trade-offs. In prioritizing client-side queries and analysis for regional and global-scale climate data (e.g., 1-degree-by-1-degree), this design will not scale to higher-resolution data sets, particularly those derived from moderate resolution satellite sensors (e.g., MODIS products at 1 km ground sample distance). Future iterations of the Carbon Data Explorer and similar software tools can meet the challenge of increasing spatial resolution by combining support for client-side vector features with conventional raster imagery services (e.g., Web Coverage Service, OPeNDAP, THREDDS). The authors hope that the Carbon Data Explorer serves such future integrations as a model for best practices in client-side interaction, analysis, and visualization. In order to accommodate high-resolution data sets in the future, later versions of the Carbon Data Explorer will need a radically redesigned back-end, incorporating the scalability of alternatives such as Apache Hadoop or Cassandra, and the development of a text-based HTTP response middleware. A future tool that makes these scalable back-ends operational for rapid, client-side queries and analysis in a user-friendly web application, informed by visualization best practices (as with the Carbon Data Explorer), which accommodates high-resolution geophysical data, would be a significant advance.
The Carbon Data Explorer contains all the tools necessary for online scientific data analysis in one package, including a non-blocking web server, an extensible, light-weight API, and a user-friendly web application. The text-based JSON format for storage and data interchange is not only fundamentally compatible with web browsers, but also allows for scientific data to be manipulated in the web browser, providing asynchronous, rapid filtering, and aggregation. In response to the new protocols of CMIP6, the Carbon Data Explorer provides a framework for the distributed analysis of climate model outputs. Analyses can effectively be bookmarked with URIs serving as permanent links to a particular visualization and analysis of a data cube at a given point in time. The framework's open-source licensing and web integration enable the visualization and sharing of scientific data through either a secure network or public portal. Also, as a prototype, it is hoped that the software's seamless, interactive visualization and comparison features will inspire the expansion of existing data management and data access frameworks such as TDS and Ferret-THREDDS to support more rich, JavaScript-based visualization libraries. It is hoped they will also facilitate the future improvement of the Carbon Data Explorer and the inspiration of similar and better tools for Earth system science.
The source code is available from GitHub
(
The software described in this paper was developed by the Michigan Tech Research Institute (MTRI) in partnership with Anna Michalak at the Carnegie Institution for Science's Department of Global Ecology at Stanford University and with funding from NASA (Grant no. NNX12AB90G). The authors would also like to acknowledge the contributions of Nicholas Molen to the early development of the Carbon Data Explorer and of Reid Sawtell to the front-end web browser application. Special thanks go to Mae Qiu at Stanford University and Vineet Yadav at NASA Jet Propulsion Laboratory for their contributions to design and testing. The authors would also like to thank James Arnott and Scott Kalafatis at the School of Natural Resources and Environment at the University of Michigan for their reviews of an early draft of this paper. Edited by: J. Kala