In this paper, we present Par@Graph, a software toolbox to reconstruct and
analyze complex climate networks having a large number of nodes (up to at
least 10

Over the last decade, the techniques of complex network analysis have found
application in climate research. Many studies were focused on correlation patterns
in the atmospheric surface temperature

In most studies, the above so-called interaction networks were used. Here the observation
locations serve as nodes and edges (links) are based on statistical measures of similarity,
for example, a correlation coefficient, between pairwise time series of climate variables at these
different locations. Given time series of climate data, represented
by an

On the other hand, analyzing the resulting network (graph) is non-trivial and
also computationally challenging. Considering a graph

Stanford Network Analysis Platform see

The most popular approach to tackle such computational challenges is by
exploiting parallelism for both the construction and the analysis of those
massive graphs through the design of efficient algorithms for parallel
computing platforms. In this regard, some contributions have been made to the
development of algorithms that exploit parallel computing machines such as in
The Parallel BGL

Networkit see

Indeed most researchers tend to develop their own tools to build correlation
matrices beforehand, and thereafter they transform these matrices into
appropriate graph data structures that can be handled by the existing
libraries of graph analysis. An exception is the software package
Pyunicorn

Pyunicorn see

The networks which so far have been handled in climate research applications
had only a limited (at most

In this paper, we introduce a complete toolbox Par@Graph designed for
parallel computing platforms, which is capable of the preprocessing of large
number of climate time series and the calculation of pairwise statistical
measures, leading to the reconstruction of large-node climate networks. In addition, Par@Graph
is provided with a set of high-performance network analyzing algorithms for
symmetric multiprocessing machines (SMPs). It is also coupled to a
parallelized version of

The rest of the paper is organized as follows. In Sect. 2, we give an overview of the computational challenges associated with the reconstruction of climate networks and their analysis. In Sect. 3, we provide a description of the design of Par@Graph and its parallel algorithms for the reconstruction and analysis of climate networks from climate time series. In Sect. 4, we describe the application of the toolbox to data from a high-resolution ocean model including a performance and scaling analysis. Section 5 provides a summary and discussion of the results.

A common data set of climate observations or model results consists of
spatiotemporal grid points

To define a link between two nodes, both linear and nonlinear dependencies
can be considered. To measure linear correlations between the time series

In many climate applications, one is interested in propagating features, such
as that of ocean Rossby waves. Time-delayed (time-lagged) relationships that
exist between climate variables in different geographical locations have also
been addressed by the climate networks approach

Having derived the correlation matrix

Note that because

Many properties in climate networks have interesting physical interpretations and it is important to compute them efficiently. For later reference in Sects. 3 and 4, we list here the most important properties.

where

where

where

With the sequential algorithm which has been proposed in Brandes (2001), it
can be computed in

All these quantities can be obtained using the

In practice, the reconstruction and analyses of climate networks are carried out through performing a set of separate tasks, progressively. First, the preprocessing of climate time series occurs, then the correlation matrix is calculated, followed by network construction from either the correlation matrix or another graph data structure like an adjacency matrix, and finally the network is analyzed using the selected graph algorithms library. Contrary to these sequence of computations, Par@Graph is designed to provide end-to-end support for the creation and analysis of climate networks by integrating parallel computing tools to perform all the involved processing efficiently, with attention at the same time to optimize required computing memory.

Par@Graph is composed of a set of coupled parallel tools designed to leverage the inherited hybrid parallelism in distributed-memory clusters of multi-core (SMPs) machines, using MPI/OpenMP standards. The provided tools are classified into two major software modules, which we refer to as the Network Constructor and the Analysis Engine, together with additional interfacing tools and wrappers.

Provided a parallel machine of

This module carries out the calculation of the correlation matrix

The design of the constructer follows a master-worker parallel computing paradigm for distributed-memory parallel clusters of SMPs. The calculation of the correlations between time series is distributed over the computing elements (workers), forming a ring topology of processes (Fig. 1), which communicate between each other using MPI standards.

As soon as a process finds

A brief description of the processing associated with each ring process is
described in Algorithm 1 below.

Note that only a subset

The process of constructing the network itself is performed progressively in
the event that the master (

With attention to the overall performance, it is crucial not to overlook the I/O overhead, especially because the toolbox is intended to be processing large climate data sets. To that end, the Network Constructor is designed to perform multiple I/O collective operations at the same time (MPI-IO). In like manner, simultaneously, each ring process reads its chunk of time series from a parallel file system. Furthermore, owing to the fact that the elements of those time series are neither read nor stored contiguously, another key point in order to improve performance is to optimize memory access at each processor. This is provided at each process by performing preprocessing tasks that include the reordering of each process's chunk of time series, for the sake of reducing cache misses during calculation.

Once correlations and their coordinates are available at the master machine,
it consecutively runs graph algorithms to analyze the resulted network. The
developed parallel algorithms for network analysis are based on those in

With respect to the analyzing algorithms, a set of 20 of the core algorithms
of

Mean

Speedup ratio

In

For instance, in a global transitivity routine, by which the
network's average clustering coefficient is obtained, the value

The parallelized algorithms of

Additionally, special attention was given to the calculation of both the degree and strength centralities. As such, both metrics' algorithms were redesigned to be computed progressively during the time the network is being constructed. In other words, each time the master receives edges from one of the ring processes, these are added to the accumulated count of the edges that corresponds to their relative vertices. As soon as the last packet of edges is received by the master, these metrics are instantly available. A notable benefit of this approach, of course apart from saving time, is the significant reduction of memory requirements, as each time the master receives a new set of edges, the previous ones are released. Such technique enables computing machines of rather few gigabytes of memory to process degree and strength centrality metrics for large-scale networks.

In order to match a wider range of user requirements, Par@Graph is provided
with all the necessary tools to do the job, including parallel collective
tools to write the resulted correlation or mutual information matrices, where
each ring process writes its calculated portion to a common file in a
parallel file system. This is added to other tools to read (in parallel) and
also construct a graph directly from a matrix as well as tools to read and
write standard graph formats, including edge lists, adjacency lists and the
popular

Another key point is the flexible interface between the Network Constructor and the
analysis engine. That is, although the toolbox provides wrappers to the
parallelized

In this section, we will apply Par@Graph to reconstruct and analyze networks obtained
from high-resolution ocean model data. The motivation for performing these computations
is to understand coherence of the ocean circulation at different scales

The data used here are taken from simulations which were performed with the
Parallel Ocean Program (POP;

see

Correlation networks were built from 1 year (year 136 of the control run)
of the simulated global daily sea surface height (SSH) data. The seasonal
cycle was removed by subtracting for each day of the year its 5 days running
mean averaged over years 131 to 141. The mean and standard deviation of the
SSH for this year are plotted in Fig.

Performance of the parallel algorithms –

Two data sets have been used for network reconstruction, one with the actual
0.1

The results were computed on a bullx supercomputer

See

First experiments were performed to construct weighted Pearson correlation
networks from the 0.4

Different threshold values

The execution time falls nearly super linearly with the number of processors
up to 100. Moreover, the performance becomes strongly super linear for

The timing for both the parallel reading and the reordering of time series is
comparatively constant and pointless compared to the overall execution time,
regardless of the number of processors, as shown in Fig.

Results of the performance tests to determine the six network properties as
discussed in Sect. 2.2 are shown in Fig.

Although there are some differences in the performance gain in each of the
algorithms, a general improvement is achieved by our fine-grained parallel
implementation over the sequential

In some algorithms, like the clustering coefficient, parallel performance seems more sensitive to the density of the network, whereas in others, such as the degree centrality, performance remains intact. However, although an evident performance gain is observed here, one has to remember that the performance of the vast majority of network analyzing algorithms is highly dependent on the topology of the network itself, and thus a further study should be carried out to compare results for different types of networks.

In view of memory requirements, we show in Table 2 a comparison of the needed
memory to represent an edge (for different types of networks) when using

A single edge's size in memory when using the indexed edge list used
in

Similar performance results were obtained for tests using much larger
correlation networks from the

Being able to reconstruct and analyze the large complex networks arising from the POP ocean model, we now shortly demonstrate the novel results one can obtain. One of the important questions in physical oceanography deals with the coherence of the global ocean circulation. In low-resolution (non-eddying) ocean models, the flows appear quite coherent with near-steady currents filling the ocean basins. However, as soon as eddies are represented (when the spatial resolution is smaller than the internal Rossby radius of deformation) a fast decorrelation is seen in the flow field.

The issue of coherence has for example been tackled by looking at the
eigenvalues of the transfer matrix

In Fig.

The precise physical interpretation of these metrics is outside the scope of
this paper as it requires a background in dynamical oceanography. However,
one can observe that the subtropical gyres

Up to now, the data sets (both observational and model based) used to reconstruct
and analyze climate networks have been relatively small due to computational limitations.
In this paper we presented the new parallel software toolbox Par@Graph
to construct and analyze large-scale complex networks. The software exposes
parallelism on distributed-memory computing platforms to enable the construction
of massive networks from a large number of time series based on the calculation of common
statistical similarity measures between them. Additionally, Par@Graph is provided with
a set of parallel graph algorithms to enable fast calculation of important properties of the
generated networks on SMPs. These include those of the betweenness, closeness,
eigenvector and degree centralities as well as the algorithms needed for the calculation of
transitivity, connected components, entropy and diameter. Additionally, a parallel implementation
of a community detection algorithm based on modularity optimization

The capabilities of Par@Graph were shown by using sea surface height data of
a strongly eddying global ocean model
(POP). The resulting networks had number of nodes ranging from

With regards to the challenging issue of memory requirements in order to
compute such big networks, we showed that the presented toolbox notably
optimizes the usage of memory during the reconstruction of large-scale
networks by minimizing the accompanying data redundancy. Additionally, the
resulted networks themselves are markedly lighter in size compared to their
equivalents in

The availability of Par@Graph will allow one to solve a new set of questions in climate
research,
one of which, the coherence of the ocean circulation at different scales, was shortly discussed
in this paper. Apart from higher resolution data sets of one observable, it will now also be possible
to deal with data sets of several variables and to more efficiently reconstruct and analyze
networks of networks

Par@Graph is not yet provided with a license. For the time being, source code will be available from authors upon request. Authors will also provide support in the initial software installation and setup.

The authors would like to acknowledge the support of the LINC project (no. 289447) funded by EC's Marie-Curie ITN program (FP7-PEOPLE-2011-ITN). The computations on the Cartesius machine were funded by the Exact Sciences division of the Netherlands Organization of Scientific Research under grant SH-284-14. Edited by: D. Ham