Par@Graph – a parallel toolbox for the construction and analysis of large complex climate networks

. In this paper, we present Par@Graph, a software toolbox to reconstruct and analyze complex climate networks having a large number of nodes (up to at least 10 6 ) and edges (up to at least 10 12 ). The key innovation is an efﬁcient set of parallel software tools designed to leverage the inherited hybrid parallelism in distributed-memory clusters of multi-core machines. The performance of the toolbox is illustrated through networks derived from sea surface height (SSH) data of a global high-resolution ocean model. Less than 8 min are needed on 90 Intel Xeon E5-4650 processors to reconstruct a climate network including the preprocessing and the correlation of 3 × 10 5 SSH time series, resulting in a weighted graph with the same number of vertices and about 3 . 2 × 10 8 edges. In less than 14 min on 30 processors, the resulted graph’s degree centrality, strength, connected components, eigenvector centrality, entropy and clustering coefﬁcient metrics were obtained. These results indicate that a complete cycle to construct


Introduction
Over the last decade, the techniques of complex network analysis have found application in climate research.Many studies were focused on correlation patterns in the atmospheric surface temperature (Tsonis and Roebber, 2004;Tsonis et al., 2010;Donges et al., 2009bDonges et al., , a, 2011) ) and teleconnections (Tsonis et al., 2008).Up to now, the behavior of El Niño (Gozolchiani et al., 2008(Gozolchiani et al., , 2011;;Tsonis and Swanson, 2008;Yamasaki et al., 2008), the synchronization between different spatiotemporal climate variability patterns (Tsonis et al., 2007;Wyatt et al., 2011) and the connections between the sea surface temperature (SST) variability and the global mean temperature (Tantet and Dijkstra, 2014) have been investigated.In addition, network tools have also been used to detect the propagation of SST anomalies on multidecadal timescales (Feng and Dijkstra, 2014) and to develop early warning indicators of climate transitions (van der Mheen et al., 2013;Feng et al., 2014).
In most studies, the above so-called interaction networks were used.Here the observation locations serve as nodes and edges (links) are based on statistical measures of similarity, for example, a correlation coefficient, between pairwise time series of climate variables at these different locations.Given time series of climate data, represented by an N × M matrix, where N is the number of locations and M is the length of data attributes (daily or monthly values), one needs to calculate at least N 2 /2 correlation values.Such computations become challenging for large N; for example, with a network of 10 6 nodes, this would result in 5×10 11 calculations.A further challenge is the memory needed for such a computation.

To only keep the calculated correlation matrix in memory for
Published by Copernicus Publications on behalf of the European Geosciences Union.

3322
H. Ihshaish et al.: Par@Graph -a parallel toolbox for the construction further processing, about 3.7×103 GB of memory is required (consider 8 bytes of memory for each of the 5 × 10 11 matrix items), which is not available in the vast majority of current computing platforms.
On the other hand, analyzing the resulting network (graph) is non-trivial and also computationally challenging.Considering a graph G, with V vertices and E edges, a typical step in an algorithm to analyze G involves visiting each v ∈ V and its neighbors V ⊂ V (the set of vertices connected to v by an edge e ∈ E), then their consecutive neighbors, and so on.Processing such steps is normally done within a computational complexity on the order of | V | and/or | E | squared or cubed.For example the computation of the clustering coefficient, which measures the degree to which its vertices tend to cluster together, has a time complexity of O(| V | 3 ).In practice, there are various available software tools for graph analysis, some providing implementations of single-machine algorithms such as BGL (Boost Graph Library) (Siek et al., 2002), LEDA (Mehlhorn and Näher, 1995), NetworkX (Hagberg et al., 2008), SNAP1 and igraph (Csardi and Nepusz, 2006).However, the computation of a clustering coefficient for a network with | V |= 10 6 would be very challenging, if at all possible, with existing single-machine software.
The most popular approach to tackle such computational challenges is by exploiting parallelism for both the construction and the analysis of those massive graphs through the design of efficient algorithms for parallel computing platforms.In this regard, some contributions have been made to the development of algorithms that exploit parallel computing machines such as in The Parallel BGL (Gregor and Lumsdaine, 2005) and CGMgraph (Chan et al., 2005).However, due to structural irregularity and sparsity of real-world graphs, including those built from climate data, there are few parallel implementations that are efficient, scalable and can deliver high performance.Other factors which contribute to this inefficiency include a manifested irregularity of data dependencies in those graphs, as well as the poor locality of data, making graph exploration and analysis highly dominated by memory latency rather than processing speed (Lumsdaine et al., 2007).A recent intent with NetworKit2 has shown a remarkable step forward towards providing parallel software tools capable of analyzing large-scale networks.Yet as in most of the existing libraries, the processing and memory challenges involved in the construction of graphs with large |V | from statistical measures of time series, has not been addressed.
Indeed most researchers tend to develop their own tools to build correlation matrices beforehand, and thereafter they transform these matrices into appropriate graph data structures that can be handled by the existing libraries of graph analysis.An exception is the software package Pyunicorn 3 (Donges et al., 2013), developed at the Potsdam Institute for Climate Impact Research, that couples Python modules for numerical analysis with igraph.It can carry out both tasks; the construction of climate networks and the analysis of the resulted graphs.However, this software is bounded by the single-machine's memory and speed, making it impossible to reconstruct large-node climate networks and consequently, inappropriate to analyze them.
The networks which so far have been handled in climate research applications had only a limited (at most 10 4 ) number of nodes.As a consequence, coarse-resolution observational and model data have been used with a focus only on large-scale properties of the climate system.This system is, however, known for its multi-scale interactions and hence one would like to explore the interaction of processes over the different scales.Data are available through highresolution ocean-atmosphere-climate model simulations but they lead to networks with at least 10 5 nodes and hence they can neither be reconstructed, nor efficiently analyzed using currently available software.
In this paper, we introduce a complete toolbox Par@Graph designed for parallel computing platforms, which is capable of the preprocessing of large number of climate time series and the calculation of pairwise statistical measures, leading to the reconstruction of large-node climate networks.In addition, Par@Graph is provided with a set of high-performance network analyzing algorithms for symmetric multiprocessing machines (SMPs).It is also coupled to a parallelized version of igraph (Csardi and Nepusz, 2006) -a widely used graphanalysis library.The presented toolbox is provided with an easy-to-use and flexible interface which enables it to be easily coupled to any existing graph-analysis software.
The rest of the paper is organized as follows.In Sect.2, we give an overview of the computational challenges associated with the reconstruction of climate networks and their analysis.In Sect.3, we provide a description of the design of Par@Graph and its parallel algorithms for the reconstruction and analysis of climate networks from climate time series.In Sect.4, we describe the application of the toolbox to data from a high-resolution ocean model including a performance and scaling analysis.Section 5 provides a summary and discussion of the results.

Climate networks
A common data set of climate observations or model results consists of spatiotemporal grid points i, i = 1, . .., N at a given latitude and longitude, each having a time series of a state variable, for example, temperature, T i (t k ) of length L, with k = 1, . .., L. In order to reconstruct a climate network, some preprocessing tasks are required beforehand, including the selection of grid locations and calculation of anomalies (e.g., removal of a trend and/or a seasonal cycle that might produce strong autocorrelations between different locations).Having done this, each grid point is considered to be a node in the resulting network.

Network reconstruction
To define a link between two nodes, both linear and nonlinear dependencies can be considered.To measure linear correlations between the time series T i (t k ) and T j (t k ), the Pearson correlation coefficient R ij given by is widely used (Tsonis and Roebber, 2004).Alternatively, measures of nonlinear correlation can be used, such as the mutual information M ij , given by Here P i (T i ) is the probability density function (PDF) of time series T i , and P ij (T i , T j ) is the joint PDF for (T i , T j ).The issue whether R ij or M ij is better to quantify the statistical similarity between nodes i and j is discussed in (Donges et al., 2009a).Whatever the choice, however, a correlation matrix (C) of N × N elements is produced, where and N is again the number of grid points.In many climate applications, one is interested in propagating features, such as that of ocean Rossby waves.Timedelayed (time-lagged) relationships that exist between climate variables in different geographical locations have also been addressed by the climate networks approach (Gozolchiani et al., 2008;Berezin et al., 2012;Tirabassi and Masoller, 2013;Feng and Dijkstra, 2014;Tupikina et al., 2014).These are commonly measured by examining the correlation between the time series of two locations relatively shifted in time with respect to one another.Technically this can be done by defining a time-lag interval and computing the correlation measures between the shifted time series (Feng and Dijkstra, 2014).One can also define a time interval, say [t min , t max ], and then find the value of t in this interval where C ij (t) is maximal (Gozolchiani et al., 2008).
Having derived the correlation matrix C, a threshold τ is usually applied to define strong similarities between nodes as "links".The adjacency matrix A for the network is then found by where is the Heaviside function and δ is Kronecker delta.If correlation values are to be considered as weights for the resulted links, the elements of A after thresholding C with τ become (4) Note that because C is symmetric, the resulting network is always undirected when C is calculated at zero time lag.However, when time-lagged correlations are studied, directions are added to the links between nodes, reflecting the direction of the shifting of their corresponding time series.If only thresholding is applied, but the values of the correlation matrix are kept, a weighted network will result.

Network analysis
Many properties in climate networks have interesting physical interpretations and it is important to compute them efficiently.For later reference in Sects.3 and 4, we list here the most important properties.
-Degree centrality.The degree centrality k i of a node i refers to the number of its incident vertices, that is, where N (i) is the set of vertices adjacent to i.
-Strength centrality.For a weighted network, the strength centrality is given by the sum of the weights of the edges between the node and its incident vertices.
-Clustering coefficient.The Watts-Strogatz clustering coefficient C i measures the probability that two randomly chosen neighbors of a node are also neighbors (Watts and Strogatz, 1998).This metric is calculated for each i ∈ V by where k i is the number of neighbors of i and i is the number of connected pairs between all its neighbors.-Entropy.The Shannon entropy (H i ) of the incident edges' weights (Anand and Bianconi, 2009) is given for node i ∈ V by where k i is the (total) degree of node i and w ij is the weight of edge(s) between nodes i and j .The computation of the entropy for all nodes in a graph is obtained in -Eigenvector centrality.This centrality metric is commonly used to evaluate the influence of a vertex in a network qualitatively.Unlike degree centrality, which weights every edge equally, the eigenvector centrality assigns relative scores to all vertices in the network based on the concept that edges with high-scoring vertices contribute more to the score of the vertex in question than equal edges with low-scoring nodes.As a result, one would find that a vertex having a high degree does not necessarily imply a high eigenvector centrality, since its connectivity might be with less important vertices.Equally, a vertex with a high eigenvector centrality is not necessarily highly linked (the vertex might have few but important links).As defined in Bonacich (1972), let A = (A ij ) be the adjacency matrix of the graph G(V , E), the centrality score (x i ) of node i ∈ V is then found by where λ is a constant.Note that there could be many eigenvalues λ for which an eigenvector exists, however, the centrality score is determined by calculating the eigenvector corresponding to the largest positive eigenvalue of the adjacency matrix.This metric is obtained computationally in O(|V |+|E|) time and O(|V |) space.
-Betweenness centrality.This measure, indicated here by BC i is based on the shortest-path enumeration.It is considered one of the more commonly used metrics to quantify the relative importance of nodes in a graph (Freeman, 1977).To obtain this metric given a graph G(V , E), let σ st denote the number of shortest paths between the vertices s and t.When the count of those which pass through the node i is σ st (i), then the BC i is obtained by With the sequential algorithm which has been proposed in Brandes (2001), it can be computed in All these quantities can be obtained using the igraph library for relatively small-node networks.

Description of the toolbox
In practice, the reconstruction and analyses of climate networks are carried out through performing a set of separate Figure 1.Provided a parallel machine of p processors, p − 1 processes are initialized and assigned with equal blocks of time series, each block's set of time series are correlated, then these blocks are exchanged (p − 1)/2 times (half round of the ring) between processes to complete the all-to-all correlations between the whole set of time series.Conversely, p 0 (the master computing element) is initialized as a master process to gather the resulting calculations and perform the analysis tasks on the resulted network.tasks, progressively.First, the preprocessing of climate time series occurs, then the correlation matrix is calculated, followed by network construction from either the correlation matrix or another graph data structure like an adjacency matrix, and finally the network is analyzed using the selected graph algorithms library.Contrary to these sequence of computations, Par@Graph is designed to provide end-to-end support for the creation and analysis of climate networks by integrating parallel computing tools to perform all the involved processing efficiently, with attention at the same time to optimize required computing memory.
Par@Graph is composed of a set of coupled parallel tools designed to leverage the inherited hybrid parallelism in distributed-memory clusters of multi-core (SMPs) machines, using MPI/OpenMP standards.The provided tools are classified into two major software modules, which we refer to as the Network Constructor and the Analysis Engine, together with additional interfacing tools and wrappers.

Network Constructor
This module carries out the calculation of the correlation matrix C from the given time series.It also applies a userdefined threshold τ to generate the corresponding network adjacency matrix A. Then it proceeds to the transformation of the resulted matrix into a network data structure which will later be analyzed by the analysis module.
The design of the constructer follows a master-worker parallel computing paradigm for distributed-memory parallel clusters of SMPs.The calculation of the correlations between time series is distributed over the computing elements (workers), forming a ring topology of processes (Fig. 1), which communicate between each other using MPI standards.
As soon as a process finds C ij ≥ τ , then the pair (i, j ) is copied to a local process's buffer of a user-configurable size, and sends the iteratively filled buffer to the master p 0 , where the network is to be analyzed.Note that if the network is weighted, the value of C ij itself is also copied and sent to the master side by side with its pair of nodes i and j (and in like manner time-lag values).
A brief description of the processing associated with each ring process is described in Algorithm 1 below.MPI standards.As soon as a process finds Cij ⌧ , then the pair (i,j) is copied to a local process's buffer of a 180 user-configurable size, and sends the iteratively filled buffer to the master p0, where the network is to be analyzed.
Note that if the network is weighted, the value of Cij itself is also copied and sent to the master side by side with its pair of nodes i and j (and in like manner time-lag values).
A brief description of the processing associated to each ring process is described in Algorithm 1 below.Note that only a subset C of C, such that ∀ ∈ C, Cij ≥ τ , is sent progressively to the master computing element.This indeed means that the under-threshold values of C are discarded directly at each ring process.This reduces both the amount of data sent to the master element and the memory required there for the construction of the network.
The process of constructing the network itself is performed progressively in the event that the master (p 0 ) receives edges' coordinates (and attributes, e.g., weights/lags) from any ring process.Initially p 0 , having the number of data set grid points, constructs a completely unconnected network, that is, no edges between graph vertices.As soon as ring processes start sending edge coordinates to p 0 , these edges are added to the network straightaway.In the long run, constructing the network following this approach results in saving time, since the master is idle (except when receiving data from workers) during the ring processing iterations.And more importantly, because the coming edges are added directly to the graph data structure, memory usage is optimized at the master as data redundancy are markedly minimized.
With attention to the overall performance, it is crucial not to overlook the I/O overhead, especially because the toolbox is intended to be processing large climate data sets.To that end, the Network Constructor is designed to perform mul-tiple I/O collective operations at the same time (MPI-IO).In like manner, simultaneously, each ring process reads its chunk of time series from a parallel file system.Furthermore, owing to the fact that the elements of those time series are neither read nor stored contiguously, another key point in order to improve performance is to optimize memory access at each processor.This is provided at each process by performing preprocessing tasks that include the reordering of each process's chunk of time series, for the sake of reducing cache misses during calculation.

Analysis engine
Once correlations and their coordinates are available at the master machine, it consecutively runs graph algorithms to analyze the resulted network.The developed parallel algorithms for network analysis are based on those in igraph.
The intent here is that this design (coupled with the Network Constructor) will achieve three primary goals: (1) to construct the network rapidly, (2) to enable efficient and safe multi-threading of the core library algorithms and (3) to reduce memory usage for network representation.
With respect to the analyzing algorithms, a set of 20 of the core algorithms of igraph have been parallelized using POSIX threads and OpenMP directives.Generally speaking, the embedded routines of those algorithms -a sample pseudocode is shown below in (a) -could naively be parallelized by transforming their iterative instructions into parallel loops; see (b) pseudocode: This indeed means that the under-threshold values of C are discarded directly at each ring process.This reduc both the amount of data sent to the master element and the memory required there for the construction of network.

190
The process of constructing the network itself is performed progressively in the events that the master (p receives edges' coordinates (and attributes, e.g.weights/lags) from any ring process.Initially p 0 , having number of dataset grid points, constructs a completely unconnected network, i.e. no edges between graph vertic As soon as ring processes start sending edge coordinates to p 0 , these edges are added to the network straightaw In the long run, constructing the network following this approach results in saving time, since the master is i 195 (except when receiving data from workers) during the ring processing iterations.And more importantly, becau the coming edges are added directly to the graph data structure, memory usage is optimized at the master as d redundancy is markedly minimized.
With attention to the overall performance, it is crucial not to overlook the I/O overhead, especially because toolbox is intended to be processing large climate datasets.To that end, the Constructor is designed to perfo 200 multiple I/O collective operations at the same time (MPI-IO).In like manner, simultaneously, each ring proc reads its chunk of time series from a parallel file system.Furthermore, owing to the fact that the elements those time series are neither read nor stored contiguously, another key point in order to improve performance to optimize memory access at each processor.This is provided at each process by performing preprocessing tas that include the reordering of each process's chunk of time series, for the sake of reducing cache misses duri 205 calculation.

Analysis Engine
Once correlations and their coordinates are available at the master machine, it consecutively runs graph alg rithms to analyze the resulted network.The developed parallel algorithms for network analysis are based on tho in igraph.The intent here is that this design (coupled with the Network Constructor) will achieve three p 210 mary goals:-1) to construct the network rapidly, 2) to enable efficient and safe multithreading of the core libra algorithms and 3) to reduce memory usage for network representation.
With respect to the analyzing algorithms, a set of 20 of the core algorithms of igraph have been paralleliz using POSIX threads and OpenMP directives.Generally speaking, the embedded routines of those algorithms sample pseudocode is shown below in (a)), could naively be parallelized by transforming their iterative instru For instance, in a global transitivity routine, by which the network's average clustering coefficient is obtained, value result is scalar (average value), so that parallelism appears straightforward and safe multithreading could achieved by applying reduction binary operators over its parallelizable loop.Although this may be approacha 8 For instance, in a global transitivity routine, by which the network's average clustering coefficient is obtained, the value result is scalar (average value), so that parallelism appears straightforward and safe multi-threading could be achieved by applying reduction binary operators over its parallelizable loop.Although this may be approachable in similar cases, unfortunately in most routines result's value does not depend linearly on the iteration variable i in some arbitrary way (depending on the algorithm).This is added to the synchronization overhead which could be imposed in algorithms where dependent iterative operations are found, which need careful consideration to prevent conflicts commonly caused by the concurrent access to shared memory spaces.The parallelized algorithms of igraph are those mostly used to obtain important network metrics needed to evaluate structural (local and global) properties of graphs.Amongst them are the algorithms of the shortest paths, centrality measures (e.g., betweenness, closeness, eigenvector), transitivity and clustering coefficient, connected components, degree   bandwidth.The same technology is used to connect the nodes to a Lustre parallel file system of 48 OSTs each 285 with multiple disks.
First experiments were performed to construct weighted correlation networks from the 0.4 POP grid, having 300,842 grid points.Different edge densities (see Table (1 and strength centralities, entropy and diameter.A complete list of the parallelized routines and algorithms as well as the particular approach of parallelism for each would make this paper too technical and will be reported elsewhere.However, our approach to achieve efficient fine-grained parallelism for the targeted algorithms of igraph included major changes in their internal routines and the used data structures.For example, shared memory queues were added to achieve safe multithreading, loops' iterations were optimized to minimize synchronization costs, and iterative workload was accordingly designed to be scheduled dynamically amongst threads in order to improve load imbalance caused by the poor locality of data.Furthermore, the internal data structure of the graph itself was modified from indexed edge lists (supported by iterators and internal stacks) to graph adjacency lists which resulted in achieving significant reduction of memory requirements, especially in the case of sparse networks.Additionally, special attention was given to the calculation of both the degree and strength centralities.As such, both metrics' algorithms were redesigned to be computed progressively during the time the network is being constructed.In other words, each time the master receives edges from one of the ring processes, these are added to the accumulated count of the edges that corresponds to their relative vertices.As soon as the last packet of edges is received by the master, these metrics are instantly available.A notable benefit of this approach, of course apart from saving time, is the significant reduction of memory requirements, as each time the master receives a new set of edges, the previous ones are released.Such technique enables computing machines of rather few gigabytes of memory to process degree and strength centrality metrics for large-scale networks.

Interfaces and other features
In order to match a wider range of user requirements, Par@Graph is provided with all the necessary tools to do the job, including parallel collective tools to write the resulted correlation or mutual information matrices, where each ring the overall performance.The timing for both the parallel reading and the reordering of time series is comparatively constant and pointless compared to the overall execution time, regardless of the number of processors, as shown in Fig. 5.

325
Results of the performance tests to determine the six network properties as discussed in section 2.2 are shown in Fig. 7.Although there are some differences in the performance gain in each of the algorithms, a general improvement is achieved by our fine-grained parallel implementation over the sequential igraph algorithms.
In some algorithms, like the clustering coefficient, parallel performance seems more sensitive 330 to the density of the network, whereas in others like the degree centrality, performance remains intact.However, although an evident performance gain is observed here, one has to remember that the performance of the vast majority of network analyzing algorithms is highly dependent on the topology of the network itself, and thus, further study should be carried out to compare results for different types of networks.

335
In view of memory requirements, we show in Table 2 a comparison of the needed memory to represent an edge (for different types of networks) when using igraph's data structures with Par@Graph's adjacency list.Indeed, the presented networks constructed by our toolbox, as a result of changing the internal data structure, are at least 60% lighter in size compared to their size in memory when using the original data structures of igraph.

340
Similar performance results were obtained for tests using the much larger correlation networks 13 process writes its calculated portion to a common file in a parallel file system.This is added to other tools to read (in parallel) and also construct a graph directly from a matrix as well as tools to read and write standard graph formats, including edge lists, adjacency lists and the popular Pajek format which contains metadata added to an edge list.
Another key point is the flexible interface between the Network Constructor and the analysis engine.That is, although the toolbox provides wrappers to the parallelized igraph, those are quite flexible to be used with any other analysis library other than igraph, or any other user developed routines.Additionally, users are provided with a configuration input file where they can specify their experimental settings.These include the selection of the data grid (location coordinates), preprocessing parameters, the threshold (τ ), the type of the network (weighted, unweighted, directed, etc.), timelag intervals, whether to construct a network from time series, a matrix or another graph format.

Application and performance
In this section, we will apply Par@Graph to reconstruct and analyze networks obtained from high-resolution ocean model data.The motivation for performing these computations is to understand coherence of the ocean circulation at different scales (Froyland et al., 2014) .

The POP model data
The data used here are taken from simulations which were performed with the Parallel Ocean Program (POP; Dukowicz and Smith, 1994), developed at Los Alamos National Laboratory.This configuration has a nominal horizontal resolution of 0.1 • and is the same as that used by Maltrud et al. (2010).We note that this configuration has a tripolar grid layout, with poles in Canada and Russia, and the model has 42 non-equidistant z-levels, increasing in thickness from 10 m just below the upper boundary to 250 m just above the lower boundary at 6000 m depth.We use data from the control simulation of this model as described in Weijer et al. (2012), where the POP is forced with a repeated annual cycle from the (normal-year) Coordinated Ocean Reference Experiment (CORE4 ) forcing data set (Large and Yeager, 2004), with the 6-hourly forcing averaged to monthly.
Correlation networks were built from 1 year (year 136 of the control run) of the simulated global daily sea surface height (SSH) data.The seasonal cycle was removed by subtracting for each day of the year its 5 days running mean averaged over years 131 to 141.The mean and standard deviation of the SSH for this year are plotted in Fig. 2a and  b, respectively.Strong spatial and temporal variability can be observed in the region of the western boundary currents (e.g., the Gulf Stream in the Atlantic, the Kuroshio in the Pacific and the Agulhas Current in the Indian Ocean) and the Southern Ocean.
Two data sets have been used for network reconstruction, one with the actual 0.1 • horizontal resolution of the model, resulting in 4.7×10 6 grid points, and an interpolated one with a lower 0.4 • horizontal resolution resulting in 3 × 105 grid points.The latter data set has been used for the performance analysis in the next subsection.

Performance analysis
The results were computed on a bullx supercomputer 5 composed of multiple "fat" computing nodes of 4-socket bullx The speedup ratios correspond to the analysis of the networks 1-3 presented in Table 1.
R428 E3 each, having 8-core 2.7 GHz Intel Xeon E5-4650 (Sandy Bridge) CPUs, with a shared Intel smart cache of 20 MB at each socket, resulting in SMP nodes of 32 cores which share 256 GB of memory.The interconnection between those "fat" nodes is built on InfiniBand technology providing 56 Gbits s −1 of inter-node bandwidth.The same technology is used to connect the nodes to a Lustre parallel file system of 48 object storage targets (OSTs) each with multiple disks.First experiments were performed to construct weighted Pearson correlation networks from the 0.4 • POP grid, having 300842 grid points.Different edge densities (see Table 1) were obtained as a result of applying different threshold values τ for the link definition.The parallel speedup of the toolbox and the corresponding computational time are plotted in Fig. 3.
The execution time falls nearly super linearly with the number of processors up to 100.Moreover, the performance becomes strongly super linear for τ > 0.5 as the number of  processors increases.This super linearity is due to a reduction in cache misses at each processor's cache (note that 20 MB of cache are shared among each of the 8 cores) as less time series are needed to fit in those shared caches when more cores are implied.In a further analysis, we also observed that the reordering of input time series did improve the performance of the toolbox, mainly when the number of processors was less than 100.In the case of τ = 0.5, performance drops with larger system sizes.First thing to remember here is that regardless of the value of the applied τ , the all-to-all correlations are calculated amongst the ring processes.However, the only difference when different values of τ are applied is the amount of data (edges, weights/attributes) that are sent to the master processor, as they will be more when lower values of τ are applied.That is to say that in such cases, communication overhead hinders the overall performance.
The timing for both the parallel reading and the reordering of time series is comparatively constant and pointless compared to the overall execution time, regardless of the number of processors, as shown in Fig. 4.
Results of the performance tests to determine the six network properties as discussed in Sect.2.2 are shown in Fig. 5.
Although there are some differences in the performance gain in each of the algorithms, a general improvement is achieved by our fine-grained parallel implementation over the sequential igraph algorithms.
In some algorithms, like the clustering coefficient, parallel performance seems more sensitive to the density of the network, whereas in others, such as the degree centrality, performance remains intact.However, although an evident performance gain is observed here, one has to remember that the performance of the vast majority of network analyzing algorithms is highly dependent on the topology of the network itself, and thus a further study should be carried out to compare results for different types of networks.
In view of memory requirements, we show in Table 2 a comparison of the needed memory to represent an edge (for different types of networks) when using igraph's data structures with Par@Graph's adjacency list.Indeed, the presented networks constructed by our toolbox, as a result of changing the internal data structure, are at least 60 % lighter in size  compared to their size in memory when using the original data structures of igraph.Similar performance results were obtained for tests using much larger correlation networks from the 0.1 • POP grid, resulting in networks of 4.7 × 10 6 nodes and edges ranging from 1.5 × 10 10 to 1.4 × 10 12 for thresholds from 0.8 to 0.4, excluding, however, the performance for betweenness centrality and clustering coefficient algorithms for the network of 1.4 × 10 12 that have not been performed.In summary, it is possible to construct large-scale climate networks in quite reasonable times on modest parallel computing platforms.

Coherence of global sea level
Being able to reconstruct and analyze the large complex networks arising from the POP ocean model, we now shortly demonstrate the novel results one can obtain.One of the important questions in physical oceanography deals with the coherence of the global ocean circulation.In low-resolution (non-eddying) ocean models, the flows appear quite coherent with near-steady currents filling the ocean basins.However, as soon as eddies are represented (when the spatial resolution is smaller than the internal Rossby radius of deformation) a fast decorrelation is seen in the flow field.
The issue of coherence has for example been tackled by looking at the eigenvalues of the transfer matrix (Dellnitz et al., 2009;Froyland et al., 2014) but also complex networks are very suited to address this question (Tantet and Dijkstra, 2014).Preliminary results on some of the important properties (degree, clustering and betweenness) of the complex network derived from the SSH data of the 0.4 • POP simulation are shown in Fig. 6a-c.In all cases, a weighted, undirected network was constructed by using the Pearson correlation with zero lag and a threshold value τ = 0.5.
In Fig. 6d, the degree field for the network constructed with τ = 0.4 from the 0.1 • POP SSH data is shown.The overall features of the degree field for the 0.1 already found in the degree field for the 0.4 • POP data, but additional small-scale correlations can be distinguished.The precise physical interpretation of these metrics is outside the scope of this paper as it requires a background in dynamical oceanography.However, one can observe that the subtropical gyres (Dellnitz et al., 2009;Froyland et al., 2014) tend to have a large degree while the regions of the western boundary currents, near the Equator and Southern Ocean tend to have smaller degree.

Summary and conclusions
Up to now, the data sets (both observational and model based) used to reconstruct and analyze climate networks have been relatively small due to computational limitations.In this paper we presented the new parallel software toolbox Par@Graph to construct and analyze large-scale complex networks.The software exposes parallelism on distributedmemory computing platforms to enable the construction of massive networks from a large number of time series based on the calculation of common statistical similarity measures between them.Additionally, Par@Graph is provided with a set of parallel graph algorithms to enable fast calculation of important properties of the generated networks on SMPs.These include those of the betweenness, closeness, eigenvector and degree centralities as well as the algorithms needed for the calculation of transitivity, connected components, entropy and diameter.Additionally, a parallel implementation of a community detection algorithm based on modularity optimization (Blondel et al., 2008) is provided.
The capabilities of Par@Graph were shown by using sea surface height data of a strongly eddying global ocean model (POP).The resulting networks had number of nodes ranging from 3.0×10 5 to 4.7×10 6 , with the number of edges ranging from 3.2×10 8 to 1.4×10 12 .The performance of Par@Graph showed excellent parallel speedup in the construction of massive networks, especially when higher thresholds were applied.When lower values of τ were used, communication overhead was seen to decrease the performance.On the other hand, we observed a significant speed gain in the calculation of the discussed network characteristics which was obtained by our parallel implementation of igraph.
With regards to the challenging issue of memory requirements in order to compute such big networks, we showed that the presented toolbox notably optimizes the usage of memory during the reconstruction of large-scale networks by minimizing the accompanying data redundancy.Additionally, the resulted networks themselves are markedly lighter in size compared to their equivalents in igraph as a result of changing the data structures from indexed edge lists to adjacency lists.
The availability of Par@Graph will allow one to solve a new set of questions in climate research, one of which, the coherence of the ocean circulation at different scales, was shortly discussed in this paper.Apart from higher resolution data sets of one observable, it will now also be possible to deal with data sets of several variables and to more efficiently reconstruct and analyze networks of networks (Berezin et al., 2012).However, apart from climate research, Par@Graph will also be very useful for all fields of science where very large-scale networks are applied, and it is hoped that the toolbox will find its way into the complexity science community.

Fig. 1 :
Fig. 1: Provided a parallel machine of p processors, p 1 processes are initialized and assigned with equal blocks of time series, each block's set of time series are correlated, then these blocks are exchanged (p 1)/2 times (half round of the ring) between processes to complete the all-to-all correlations between the whole set of time series.Conversely, p0 (the master computing element) is initialized as a master process to gather the resulting calculations and perform the analysis tasks on the resulted network.

Fig. 3 :
Fig. 3: Mean (a) and standard-deviation (b) of the daily SSH (units in cm) for year 136 of the POP control run as in ?.
The results were computed on a bullx supercomputer 5 composed of multiple "fat" computing nodes of 4-socket bullx R428 E3 each, having 8-core 2.7 GHz Intel Xeon E5-4650 (Sandy Bridge) CPUs, with a shared Intel smart cache of 20 MB at each socket, resulting in SMP nodes of 32 cores which share 256 GB of memory.The interconnection between those "fat" nodes is built on InfiniBand 305 technology providing 56 Gbits/s of inter-node bandwidth.The same technology is used to connect the nodes to a Lustre parallel file system of 48 OSTs each with multiple disks.

Figure 2 .
Figure 2. Mean (a) and standard deviation (b) of the daily SSH (units in cm) for year 136 of the POP control run as in(Weijer et al., 2012).

Figure 3 .
Fig. 3: Speedup ratio (a) for the parallel construction of SSH climate networks the POP model data having 0.4 spatial resolution.The shown speedup also includes the parallel reading and reordering of the input time series.The corresponding execution times (in seconds) over different sizes of computing processors starting from 5 processors upwards is given in (b).

Fig. 5 :
Fig. 5: In (a) the overall runtime of the experiment in Fig. 4b (for ⌧ = 0.5) is shown in the upper curve and compared to the time for parallel reading and reordering of time series (the lower curve).The shaded area corresponds to real cpu time for the calculation of the correlation matrix and the communication between processors.Both times for parallel reading and reordering preprocessing tasks are shown respectively in (b).

Figure 4 .
Figure 4.In (a) the overall runtime of the experiment in Fig. 3 (for τ = 0.5) is shown in the upper curve and compared to the time for parallel reading and reordering of time series (the lower curve).The shaded area corresponds to real-cpu time for the calculation of the correlation matrix and the communication between processors.Both times for parallel reading and reordering preprocessing tasks are shown in (b).

Fig. 7 :Figure 5 .
Fig.7: Performance of the parallel algorithms running on a single SMP bullx node of 30 compute cores.The speedup ratios correspond to the analysis of the networks 1-3 presented in Table1.
Different threshold values τ used in the reconstruction of Pearson correlation networks from the 0.4 and 0.1 • POP data sets and corresponding number of network vertices and edges.

Figure 6 .
Fig. 9: (a) Degree, (b) Clustering and (c) Betweenness for the SSH POP data interpolated on the 0.4 grid and a threshold of ⌧ = 0.5.(d) Degree field for the 0.1 grid and a threshold ⌧ = 0.4; here the reconstructed network has 4.7 ⇥ 10 6 nodes and 1.4 ⇥ 10 12 edges.

Table 2 .
A single edge's size in memory when using the indexed edge list used in igraph compared to its corresponding size when applying Par@Graph.Additionally, a vertex in igraph is represented by 16 bytes in memory, whereas it needs only 4 bytes in Par@Graph.