pygeodyn 1.1.0: a Python package for geomagnetic data assimilation

. pygeodyn is a sequential geomagnetic data assimilation package written in Python. It gives access to the core surface dynamics, controlled by geomagnetic observations, by means of a stochastic model anchored to geodynamo simulation statistics. pygeodyn aims at giving access to a user-friendly and ﬂexible data assimilation algorithm. It is designed to be tunable by the community by different means: possibility to use embedded data and priors, or to supply custom ones; tunable parameters through conﬁguration ﬁles; adapted documentation for several user proﬁles. In addition, output ﬁles are directly 5 supported by the package webgeodyn that provides a set of visualisation tools to explore the results of computations.


Introduction
The magnetic field of the Earth is generated by motion of liquid metal in the outer core, a process called the "geodynamo".
To tackle this complex problem, direct numerical simulations (DNS) have been developed to model the coupling between the primitive equations for heat, momentum and induction in a rotating spherical shell.With the development of computing power, DNS capture more and more of the physics thought to be at work in the Earth core (degree of dipolarity, ratio of magnetic to kinetic energy, occurrence of torsional Alfvén waves, etc. see for instance Schaeffer et al., 2017).However, despite such advances, the geodynamo problem is so challenging that DNS are not capable of reproducing changes observed at interannual periods with modern data (e.g.Finlay et al., 2016).Simulating numerically dynamo action at Earth-like rotation rates indeed requires resolving time-scales 10 8 orders of magnitude apart (from a fraction of day to 10 kyrs) for N ≈ (10 6 ) 3 degrees of freedom.This requirement is unlikely to be satisfied in a nearby future with DNS, making the prediction of the geomagnetic field evolution an extremely challenging task.For these reasons, promising strategies involving large-eddy simulations (LES, see Aubert et al., 2017) are emerging, but these are currently unable to ingest recent geophysical records.
Many efforts were devoted to the improvement of observable geodynamo quantities: the magnetic field above the surface of the Earth and its rate of change with respect to time, the so-called secular variation (SV).The launch of low orbiting satellite missions (Ørsted, CHAMP, Swarm) dedicated to magnetic field measurements indeed presented a huge leap on the quality and coverage of measured data (see for instance Finlay et al., 2017).
In this context, the development of geomagnetic data assimilation (DA) algorithms is timely.DA consists in the estimation of a model state trajectory using (i) a numerical model that advects the state in time and (ii) measurements used to correct its trajectory.DA algorithms can be split in two main families: variational methods that imply to minimize a cost function (based on a least-squares approach); in a time-dependent problem, one ends up tuning the initial condition by means of adjoint equations.
statistical methods, which are applications of Bayes rule to obtain the most probable state model given observations and their uncertainties; it comes down to estimating a "Best Linear Unbiased Estimate" (BLUE) under Gaussian assumptions for the prior uncertainties and model distributions.
There are numerous variations around those methods (for reviews, see Kalnay, 2003;Evensen, 2009).Both types of algorithms are already commonplace in meteorology and oceanography, but have only been recently introduced in geomagnetism (for details, see Fournier et al., 2010;Gillet, 2019).
In this article, we present a Python package called pygeodyn devoted to geomagnetic DA based on a statistical method, namely an augmented state Kalman Filter (see Evensen, 2003).It uses a reduced numerical model of the core surface dynamics to alleviate the computation time inherent to DA algorithms.The reduced model is based on stochastic Auto-Regressive processes of order 1 (AR-1 processes).These are anchored to cross-covariances derived from three-dimensional numerical geodynamo simulations.We provide examples involving the coupled-earth (Aubert et al., 2013) and midpath (Aubert et al., 2017) dynamos.
The aim of pygeodyn is to provide the community with a tool that can easily be used or built upon.It is made to ease the updating of data and the integration of new numerical models, for instance to test them against geophysical data.This way, it can be compared with other existing DA algorithms (e.g.Bärenzung et al., 2018;Sanchez et al., 2019).
The manuscript is organized as follows: Section §2 presents the principles under which the pygeodyn package was developed.Section §3 is a technical description of the version 1.1.0 of the package (Huder et al., 2019a) that also gives the basic necessary scientific background (for details, see Barrois et al., 2017Barrois et al., , 2018;;Gillet et al., 2019).In §4 we give examples of the visualization interface webgeodyn to which is coupled the core surface DA tool pygeodyn.We discuss in §5 possible future developments and applications of this tool.

Principles
In order to support the use of DA in geomagnetism, the package is designed with the following characteristics in mind: Easy-to-use: -It is written in Python 3, now a widespread language in the scientific community thanks to the NumPy/Scipy suites.
-The installation procedure only requires Python with NumPy and a Fortran compiler (other dependencies are installed during the setup).
-An online documentation describes how to install, use and build upon the package.

Flexible:
-Algorithm parameters can be tuned through configuration files and command line arguments.
-Algorithms are designed to be modular, in order to allow the independent use of their composing steps.
-Extension of the features is eased by readable open-source code (following PEP8) that is documented inline and online with Sphinx. Reproducible/stable: -The source code is versioned on a Git repository, with tracking of bugs and development versions with clear release notes.
-Unitary and functional tests are routinely launched by continuous integration pipelines.Most of the tests use the Hypothesis library1 to cover a wide range of test cases.
-Logging of algorithm operations is done by default with the logging library.

Efficient:
-Direct integration of parallelisation is possible using Message Passing Interface (MPI) -Lengthy computations (such as Legendre polynomial evaluations) are performed by Fortran routines wrapped in Python.
Easy to post-process: -Output files are generated in HDF5 binary file format that is highly versatile (close to NumPy syntax) and more time-and size-efficient.
-The output format is directly supported by the visualization package webgeodyn for efficient exploration of the computed data (see Section §4).

User profiles
The package was designed for several user types: Standard user: the user will use the supplied DA algorithms with the supplied data.The algorithms can be modified through the configuration files so this requires almost no programming skill.
Advanced user: the user will use the supplied DA algorithms but wants to run their own data.In this case, the user needs to follow the documentation to implement the reading methods of the data2 .This requires a few Python programming skills and a basic knowledge of object-type structures.
Developer user: the user wants to design their own algorithm using the low-level functions implemented in the package.The how-to is also documented but it requires some experience in Python programming and object-type structures.
The documentation (available online3 ) was written with these categories of users in mind.First, installation instructions and a brief scientific overview of the algorithm are provided.Then, it explains how launch computations with a supplied script that takes care of looping through DA steps, logging, parallelisation and saving files.For more advanced uses, the documentation includes tutorials on how to set up new data types as input and how to use low-level DA steps on CoreState objects.The developer documentation, gathering all the documentation of the functions/objects implemented in pygeodyn, is also available online.
3 pygeodyn 1.1.0content 3.1 Model state DA algorithms are to be supplied in the forms of subpackages for pygeodyn.The intention is to have interchangeable algorithms and be able to easily expand the existing algorithms.In the version described in this article, we provide a subpackage augkf that implements algorithms based on an augmented state Kalman filter (AugKF) initiated by Barrois et al. (2017).The algorithm is based on the radial induction equation at the core surface that we write in the spherical harmonic spectral domain as: (1) Vector b (resp.u and ḃ) stores the (Schmidt semi-normalized) spherical harmonic coefficients of the magnetic field (resp.the core flow and the SV) up to a truncation degree L b (resp.L u and L sv ).The number of stored coefficients in those vectors are is the matrix of Gaunt-Elsasser integrals (Moon, 1979) of dimensions N sv × N u , depending on b.The vector e r stands for errors of representativeness (of dimension N sv ).This term accounts for both subgrid induction (arising due to the truncation of the fields) and magnetic diffusion.
Quantities b(t), u(t) and e r (t) describe the model state X(t) at a given epoch t on which algorithm steps act.
The model states are stored as a subclass of NumPy array called CoreState (implemented in corestate.py).This subclass is dedicated to the storage of spectral Gauss coefficients for b, u, e r and ḃ but can also include additional quantities if needed.Details on the CoreState are given in a dedicated section of the documentation4 .

Algorithm steps
The sequential DA algorithm is composed of two kinds of operations: Forecasts are performed every ∆t f .An ensemble of N e core states is time stepped between t and t + ∆t f .
Analyses are performed every ∆t a with ∆t a = n∆t f (analyses are performed every n forecasts).The ensemble of core states at t a is adjusted by performing BLUE using observations at t = t a .
These steps require spatial cross-covariances that are derived from geodynamo runs (referred to as priors, see §3.3.3).Realizations associated with those priors are noted b * , u * and e * r for respectively the magnetic field, the core flow and errors of representativeness.
From a technical point of view, algorithm steps take CoreState objects as inputs and return the CoreState resulting from the operations.Forecasts and analyses are handled by the Forecaster and Analyser modules that are implemented in the augkf subpackage according to the AugKF algorithm.Again, details on these steps are given in the relevant documentation section5 .

Forecast and AR(1) processes
The forecast step consists of time-stepping X(t) between two epochs.AR-1 processes built on geodynamo cross-covariances are used to forecast u(t) and e r (t).We write u(t) = u 0 + u (t), with u 0 the background flow (temporal average from the geodynamo run) -and similar notations for e r (t).Their numerical integration is based on an Euler-Maruyama scheme, which .
(2) D u is the drift matrix for u. ζ u is a Gaussian noise, uncorrelated in time and constructed such that spatial cross-covariances P uu = E u u T of u match those of the prior geodynamo samples u * .E(. . . ) stands for statistical expectation.Similar expressions and notations holds for e r .Note that u and e r are supposed independent, which is verified for numerical simulations.
Drift matrices are estimated with different manners depending on the characteristics of the considered geodynamo priors.In the case where the geodynamo series do not allow to derive meaningful temporal statistics (e.g.too few samples, or simulations parameters leading to relatively too slow Alfvén waves, see Schaeffer et al., 2017), the two drift matrices are simply diagonal, and controlled by a single free parameter (τ u for u and τ e for e r ): with I u (resp.I e ) the identity matrix of rank N u (resp.N e ).The drift matrices being diagonal, the process is hereafter referred to as diagonal AR-1.Barrois et al. (2017Barrois et al. ( , 2019) ) used such diagonal AR-1 processes, based on the coupled-earth dynamo simulation.
In the case where geophysically meaningful temporal statistics can be extracted from geodynamo samples, time crosscovariance matrices are derived according to a sampling time ∆t * .D u,e are then defined as (see Gillet et al., 2019, for details and an application to the mid-path dynamo): D u,e are now dense, hence processes using this expression are referred to as dense AR-1 processes.
The first step of the forecast is to compute u(t+∆t f ) and e r (t+∆t f ) using Eqs.
(2) (with matrices depending on the AR-1 process type).Then, the vector b(t + ∆t f ) is evaluated thanks to Eq. ( 1) by using u(t + ∆t f ), e r (t + ∆t f ) and b(t) with an explicit Euler scheme: This yields the forecast core state X f (t + ∆t f ).As this step is performed independently for every realization, realizations can be forecast in parallel.This is implemented in supplied algorithms with a MPI scheme.

Analysis
The analysis step takes as input the ensemble of forecast core states X f (t a ), the geodynamo statistics, plus main field and SV observations at t = t a together with their uncertainties.It is performed in two steps: (i) First, a BLUE of an ensemble of realizations of b is performed from magnetic field observations b o (t) and the ensemble of forecasts b f (t) using the forecast cross-covariance matrix for b.
(ii) Second, a BLUE of an ensemble of realizations of the augmented state z = [u T e T r ] T is performed from SV observations ḃo (t), the ensemble of analyzed main field from step (i), and the ensemble of forecasts for u f (t) and e f r (t), using a forecast cross-covariance matrix for z.

Command line arguments
Computations can be launched by running run_algo.pythat accepts several command line arguments.These arguments and their default value (taken if not supplied) are given in Table 1.The first group corresponds to the computation parameters, the only non-optional parameter being the path to the configuration file.The second group of parameters is linked to the output files: name of data files and logs.We stress the importance of the argument -m that fixes the ensemble size N e , meaning the number of realizations on which the Kalman filter will be performed.As N e forecasts are performed at each epoch, this value has an important impact on the computation time (see §3.4).It is advised to set it to at least 20 to get a converged measure of the dispersion within the ensemble of realizations.

Configuration file
The pygeodyn configuration file sets the values of quantities used in the algorithm (called parameters).This configuration file is a text-file containing three columns: one for the parameter name, one for the type and one for the parameter value.We refer to Table 2 for the list of parameters that can be set this way.The table is separated in six groups: 1. Number of coefficients to consider for the core state quantities and the Legendre polynomials that are used to evaluate the Gaunt-Elsasser integrals that enter A(b).
2. Time-spans: starting time t start , final time t end , and time intervals (in months) for forecasts ∆t f and analyses ∆t a .
3. Parameters of the AR-1 processes used in the forecasts.ar_type can be set to diag (in this case, τ u and τ e will be used as in Eq. ( 3)) or to dense (in this case, ∆t will be used to sample the prior data and compute drift matrices with Eqs. ( 5)).
4. Parameters for using a principal component analysis (PCA) for the core flow.By setting N pca , the algorithm will perform forecasts and analyses on the subset composed of the first N pca principal components describing the core flow (stored by decreasing explained variance), rather than on the entire core flow model.This is advised when using dense AR-1 processes (see Gillet et al., 2019).The normalization of the PCA can be modified by setting pca_norm to energy (so that the variance of each principal component is homogeneous to a core surface kinetic energy) or to None (PCA performed directly on the Schmidt semi-normalized core flow Gauss coefficients).
5. Initial conditions of the algorithm.By setting core_state_init to constant, all realizations of the initial core state will be equal to the average prior.If set to normal, realizations of the initial core state will be drawn according to a normal distribution centered on the dynamo prior average, within the dynamo prior cross-covariances (default behavior).
It is possible to set the initial core state to the core state from a previous computation by setting core_state_init to from_file.In this latter case, the full path of the hdf5 file of the previous computation and the date of the core state to use must be given (init_file and init_date).
6. Parameters describing the types of input data (priors and observations) that are presented in more details in the next section.

Priors
Priors are composed of a series of snapshot core states that are used to estimate the background states and the cross-covariance matrices.The mandatory priors are those for the magnetic field b, the core flow u and the subgrid errors e r from which the respective cross-covariance matrices P uu , P ee , etc., are derived.
The aforementioned snapshots currently come from geodynamo simulations, meaning that the covariance matrices for b, u, and e r will reflect the characteristics of the simulations.As a consequence, the forecasts will be done according to the statistics of the dynamo simulations.As examples, pygeodyn comes with two prior types derived from two simulations: -coupled_earth from Aubert et al. ( 2013) Technically, the two types are interchangeable.However, only the midpath prior type allows the use of dense AR-1 processes as it requires time-correlations that cannot be extracted from coupled_earth runs.

Observations
Observations are measurements of the magnetic field and of the SV at a set of dates.These observations are used in the analysis step to perform the BLUE of the core state (see §3.2.2). pygeodyn provides two types of observations: covobs: Gauss coefficients and associated uncertainties at a series of epochs (every 6 months from 1840 to 2015), from the COV-OBS.x1model derived by Gillet et al. (2015).
-go_vo: Ground-based observatory (GO) and virtual observatory (VO) data (B r , B θ , B φ ) and their associated uncertainties.VO gather in one location at satellite altitude observations recorded by the spacecrafts around this site.GO are provided every 4 months from March 1997 onward for ground-based series, and VO every 4 months from March 2000 onward for virtual observatories.The satellite data come from the CHAMP and Swarm missions.Both VO and GO are cleaned as much as possible from external sources (for details, see Barrois et al., 2018;Hammer, 2018).
In the code, observation data are to be supplied with the observation operator and errors in the form of a Observation object.This allows to have a consistent interface between spectral data (covobs) and data recorded in the direct space (go_vo).

Beyond the supplied data
For advanced users, pygeodyn provides the possibility to define custom prior and observations types by supplying new data reading methods in the dedicated pygeodyn modules.Defining a custom prior type allows to use custom geodynamo simulation data to compute covariance matrices that will be used in the forecasts and analyses steps.Similarly, a new observation type can be supplied with custom observation data that will be used to estimate the core state in the analysis step.
In other words, an advanced user can completely control the input data of the algorithm to test new magnetic observations and/or new numerical models, and derive new predictions from them.

Runtime scaling
To reduce computation time, supplied algorithms use MPI to perform forecasts ( §3.2.1) of different realizations in parallel.
Analysis steps are not implemented in parallel, as they require the whole ensemble of realizations.To assert the effect of this parallelisation, model computations were performed on a varying number of cores.The configuration for this benchmark reanalysis is the following: the AugKF algorithm with diagonal AR-1 processes, using m = 50 realizations over t end −t start = 60 years, ∆t f = 6 months and ∆t a = 12 months.
The results are displayed on Fig. 1 with runtimes varying between 1 and 6 hours.The power law fit appears to be close to 1/n (with n the number of cores), with an offset time of around 15 minutes that is probably associated with the 60 analysis steps whose duration do not depend upon the number of cores.Note that the computations remain tractable whatever the number of cores.A basic sequential computation (n = 1) for a re-analysis using 50 realizations over 100 yrs is performed in less than 6 hours, while using 32 cores will reduce it to half an hour.The source code of webgeodyn is hosted at its own Git repository7 .Being available on the Python package index, it can also be installed through the Python package installer pip.
webgeodyn implements a web application with several modes of representation to explore, display and diagnose the products of the re-analyses (Barrois et al., 2018).It is deployed at http://geodyn.univ-grenoble-alpes.fr but can also be used locally on any pygeodyn data, once installed.We illustrate here several possibilities offered by the version 0.6.0 of this tool (Huder et al., 2019b).

Mapping on Earth's globe projections
Quantities at a given time can be displayed at the core surface in the Earth's core surface tab.Two representations can be 10 used simultaneously: a streamdots/streamlines representation for the core flow components (orthoradial, azimuthal and norm) and a color plot for all quantities (core flow horizontal divergence and components included).A timeslider allows to change the epoch at which the quantities are evaluated.Figure 2 shows an example with magnetic field as color plot and norm of the core flow for the stream lines, for a re-analysis of VO and GO data using a diagonal AR-1 model.The plot is interactive with zooming, exporting (as pictures or animations) and display tuning features.

Time-series of harmonic coefficients
In the Spherical Harmonics tab, it is possible to look at the time evolution of a single spherical harmonic coefficient for a given quantity (core flow, magnetic field, SV), or of the length-of-day.Several models can be displayed at once for comparison.
Figure 3 shows the time evolution of one SV coefficient from a re-analysis of SV Gauss coefficient data using a dense AR-1 model.The interface gives the possibility to also represent the contribution from e r .It is possible to zoom on the plot and export it as a picture or raw CSV data.

Comparison with ground-based and virtual observatory data
Computed data can be easily compared with the geomagnetic observations used for the analysis in the Observatory data tab.
It allows to display the spatial components (radial, azimuthal and ortho-radial) of the magnetic field and its SV recorded by observatories.These can be either GO or VO data.Data can be displayed by clicking on a location on the globe and be compared with spectral model data predictions evaluated at the observatory location.Figure 4 shows an example of the SV at a ground-based site in South America.One can compare how predictions from a re-analysis, together with its associated uncertainties, follow geophysical data (black dots) -here the model by Barrois et al. (2019), which uses a diagonal AR-1 model from GO and VO series.It can also be used to compare predictions from several magnetic field models -here COV-OBS.x1, which is constrained by magnetic data only up to 2014.5.
On top of the three examples illustrated above, the package webgeodyn also gives the possibility to display and export Lowes-Mauersberger spatial spectra, or cross-sections at the core surface as a function of time and longitude (respectively latitude) for a given latitude (respectively longitude).

Conclusions
We presented the Python toolkit pygeodyn that allows to: calculate models of the flow at the core surface from SV Gauss coefficient data calculate models of the flow and the magnetic field at the core surface from measurements of the magnetic field and its SV above the Earth's surface represent and analyse the results via the web interface webgeodyn.
The underlying algorithm relies on AR-1 stochastic processes to advect the model in time.It is anchored to statistics (in space and optionally in time) from free runs of geodynamo models.It furthermore accounts for errors of representativeness due to the finite resolution of the magnetic and velocity fields.This Python tool has been designed with several purposes in mind, among which : test of the Earth-likeness of geodynamo models comparison with alternative geomagnetic DA algorithms production of magnetic models under some constraints from the core dynamics education of students on issues linked to core dynamics and geomagnetic inverse problem.

Figure 1 .
Figure 1.Evolution of the runtime with respect to the number of MPI processes (see text for details).Dots are the observed runtimes (in hours) and the dashed line is a fit of the form t = a • n b + c.

Figure 2 .
Figure2.Example of map for the magnetic field and the flow at the core surface, obtained using webgeodyn, for the model calculated byBarrois et al. (2019).Here the radial magnetic field (colorscale) and streamlines for the core flow (black lines, which thickness indicates the intensity) are evaluated in 2016 from VO derived from the Swarm data.

Figure 3 .
Figure 3.Time series for the SV spherical harmonic coefficient h 1 2 using webgeodyn.In red ('GHA19') the re-analysis by Gillet et al. (2019), obtained from the COV-OBS.x1observations (in black) by Gillet et al. (2015).The solid lines represent the ensemble average, and the shaded areas the ±1σ uncertainties.The dotted line gives the contribution from er.

Figure 4 .
Figure 4. Time series of the three components of the SV (in spherical coordinates) at the Kourou observatory (French Guyana, location in dark red on the globe on the left) using webgeodyn.The green line ('Barrois_VO_2018e') is from the core surface re-analysis by Barrois et al. (2019), using as observations GO and VO data, and a diagonal AR-1 model.For comparison are also shown (in orange) the COV-OBS.x1model predictions at this site.The solid lines are the ensemble mean, and the shaded areas represent the ±1σ uncertainties.

Table 2 .
Parameters available in a pygeodyn configuration file.