Volunteer or crowd computing is becoming increasingly popular for solving complex research problems from an increasingly diverse range of areas. The majority of these have been built using the Berkeley Open Infrastructure for Network Computing (BOINC) platform, which provides a range of different services to manage all computation aspects of a project. The BOINC system is ideal in those cases where not only does the research community involved need low-cost access to massive computing resources but also where there is a significant public interest in the research being done.
We discuss the way in which cloud services can help BOINC-based projects to
deliver results in a fast, on demand manner. This is difficult to achieve
using volunteers, and at the same time, using scalable cloud resources for
short on demand projects can optimize the use of the available resources. We
show how this design can be used as an efficient distributed computing
platform within the cloud, and outline new approaches that could open up new
possibilities in this field, using Climateprediction.net (
Traditionally, climate models have been run using supercomputers because of their vast computational complexity and high cost. Since its early development, climate modelling has been an undertaking that has tested the limits of high-performance computing (HPC). This application of models to answer different types of questions has led to them being used in manners not originally foreseen. This is because, for some types of simulations, it can take several months to finish a modelling experiment given the scale of resources involved. One reason for including climate modelling as a high-throughput computing (HTC) problem, as opposed to an HPC problem is due to the application design model, where there is a number (not usually greater than 20) of uncoupled, long-running tasks, each corresponding to a single climate simulation and its results.
The aim of increasing the total number of members in an ensemble of climate simulations, together with the need to achieve increased computational power to better represent the physical and chemical processes being modelled, has been well understood for some decades in meteorological and climate research. Climate models make use of ensemble means to improve the accuracy of the results and quantify uncertainty, but the number of members in each ensemble tends to be small due to computational constraints. The overwhelming majority of research projects use ensembles that generally contain only a very small number of simulations, which has an obvious impact in terms of the statistical uncertainty of the results.
The Climateprediction.net project (CPDN) was created in 1999
CPDN has been running for more than 10 years and faces a number of evolving
challenges, such as
an increasing and variable need for new computational and storage resources; the processing power and memory of current volunteers' computers that
restricts the use of more complex models and higher resolution; and the need to manage costs and budgeting (this is of particular interest
in researching on-demand projects requested by external research
collaborators and stakeholders).
To address these issues, we have explored the combination of MTC/volunteer and
cloud computing as a possible improvement of, or extension to, a real existing
project. This kind of solution has previously been proposed for scientific
purposes by
It is not the aim of this paper to describe the internals of BOINC, and for
better comprehension of the problem that we are trying to solve, it is
recommended to review previous works about this knowledge, such as
Here, we describe some of the problems that we intend to address, as well as
proposed implementations of possible solutions.
To run more complex and computationally more expensive versions of the model,
resources greater than those that can be provided by volunteer computers may be needed. One
solution is a re-engineering and deployment of the client side from a
volunteer computing architecture to an infrastructure as a service (IaaS)
based on cloud computing (e.g. Amazon Web Services, AWS). There is a growing need for an on-demand and more predictable return of
simulation results. A good example of this is urgent simulations for critical
events in real time (e.g. floods) where it is not possible to rely on
volunteers; instead, a widely available and massive scaling system is
preferable (like the one described here). The current architecture and
infrastructure based on BOINC does not provide a solution that can be scaled
up for this purpose. This is because the models are running over a
heterogeneous and decentralized environment (on a number of variable and
different volunteers' computers with varying configurations), where their
behaviour cannot be clearly anticipated or measured, and any control over the
available resources is severely limited. A rationalization of the costs is required (and establishing useful metrics),
not just for internal control but also to provide monetary quotations to
project partners and funding bodies; this led us to the need of the
development of a control plane together with a front end to display the
statistics information and metrics. Free software can be used in order to promote scientific reproducibility
Complete documentation of the process will allow knowledge to be transferred
or migrated easily to other systems
Furthermore, in this work, we wish to prove the feasibility of running complex
applications in this environment. We use weather@home
The example presented here is running CPDN in AWS. AWS is the largest infrastructure as a service (IaaS) provider, it is very well documented, and is the most suitable solution for the problem at present (and with fewer limitations than other providers).
The first step was to benchmark different AWS EC2 instance
types
For benchmarking purposes, short 1-day climate simulations were run. The
model used here is weather@home2 which consists of an atmosphere-only model (HadAM3P;
Figure
Figure
Based on the previous tests, new infrastructure was designed on the cloud
(Fig.
Proposed cloud infrastructure.
First of all, a template was created to allow automated instance creation including
instance selection, based on the benchmarks presented in Sect. base operating system installation (Amazon Linux (64 bit) was used for this work); storage definition (16 firewall configuration (inbound – only SSH(22) accepted; outbound – everything
accepted); and installation and configuration of BOINC client inside the template image,
including its dependencies, such as 32 bit libraries (it is recommended to use
the latest version from git). This was followed by instance post-installation configuration (contextualization); for example, in AWS this is achieved by
creating a machine image (AMI) and adjusting it by selecting the appropriate
options such as the kernel image (AKI). Finally, an (optional) installation and configuration of AWS EC2 command-line interface is performed.
This can be useful to debug or troubleshoot issues with the infrastructure.
Another problem that needs to be solved is the need for a decentralized,
low-latency and world-wide-accessible storage for the output data (each
simulation (36 000 work units) generates
Shared storage architecture.
Given these values, every work unit returns a result of
Having set up the computing and storage infrastructure, we still lack a
control plane to provide a layer for abstraction and automation, and
provide more consistency to the project. The aim of developing the central
control system (Fig.
The control plane is still in its early developmental stages (e.g. although it is cloud agnostic, so far only AWS has a connector and is supported), and further work will describe its improvements over time.
It consists of two main components:
the back end provides the user with a RESTful API with basic
functionalities related to simulation information and management, with the
intention of providing (even more) agnostic access to the cloud; and the front end makes it easier to communicate with the API as intuitively
and simplistically as possible.
The core component, the RESTful back end (using JavaScript object notation – JSON), provides
simple access and wraps common actions: start simulation with
Dashboard and metrics application architecture.
Dashboard.
Several experiments (using all the defined infrastructure) were
done by using standard work units developed by the
climateprediction.net/weather@home project. We processed work units from two
main experiments: the weather@home UK floods
It has been successfully demonstrated that it is possible to run simulations
of a climatic model using infrastructure in the cloud; while this might not
seem complex, to the best of our knowledge, it has never previously been
tested. This efficient use of MTC resources for scientific computing has
previously been used to facilitate real research in other areas
We have benchmarked a number of Amazon EC2 instance types running CPDN
work units. Prices for spot instances vary significantly over time and between
instance, but we estimate a price as low as USD 1.50 to run a 1-year
simulation based on the c4.large instance in the us-west-1 region in June
2016 (see Fig.
It is interesting to note that cloud services enable us to achieve a given
number of tasks completed in some cases 5 times faster than using the
regular volunteer computing infrastructure. However, the financial
implications can only be justified for critical cases where stakeholders are
able to justify through a specific cost–benefit analysis. Anyway, academic
institutions and different type of organizations can benefit from waivers to
reduce the fees
Regarding our usage and solution for storage, S3 was a good fit for this work
(it comes out of the box with AWS and the pricing is convenient).
However, we would not suggest it as suitable long-term archival of this output
but instead suggest to make use of community repositories where such data are curated
(it should be noted though that CPDN produces output in community standard
NetCDF). Also, we understand that even though the infrastructure described here
covers a good number of use cases for different projects and
experiments, other alternatives could be analysed:
AWS Glacier is an interesting option to study in case that long-term
storage for data is needed for non-immediate access and with lower cost
S3 file size is limited to 5
This research has also served as a basis for obtaining new research funding as part of climateprediction.net for state-of-the-art studies using cloud computing technologies. This project is based on demonstrated successes in the application of technologies and solutions of the type described here.
In summary, the achieved high-level objectives were to ensure that
the client side was successfully migrated to the cloud (EC2); the upload server capability was configured to be redirected to AWS S3 buckets; different simulations were successfully run over the new infrastructure; a control plane (including a dashboard: front end and back end) was developed, deployed, and
tested; and a comprehensive costing of the project and the simulation were obtained, together with
metrics.
Future improvements should focus on providing more logic to the interaction with client status (such as through remote procedure calls – RPCs), allowing more metrics to be pulled from them, and creating new software as a service (a SaaS layer). From the infrastructure point of view, two main improvements are possible: first, a probe/dummy-automated execution will be needed to adjust the price to a real one before each simulation; second, full migration of the server side into the cloud, allowing the costs of data transfer and latency to be dramatically reduced.
The new computing infrastructure was built over virtualized instances (AWS EC2). Amazon provides also autoscaling groups that allow the user to define policies to dynamically add or remove instances triggered by a defined metric or alarm. As the purposes of this work are to use the rationalization of the resources and to have full control over them (via the central system), as well as any type of load balancing or failover, this feature will not be used in the cloud side but in the control system node that serves as back end for the dashboard.
After tasks have been setup in the server side and are ready to be sent to
the clients (this can be currently checked in the public URL
The (project) administrator user configures and launches a new simulation
via the dashboard. The required number of instances are created based on a given template
that contains a parametrized image of GNU/Linux with a configured BOINC
client. Every instance connects to the server and fetches two tasks (one per CPU, as
the used instances have two CPUs). When a task is processed, the data will be returned to the server, and also
stored in a shared storage so they will be accessible for a given set of
authorized users. Once there are no more tasks available, the control node will shut down the instances.
Parameters for the template instances.
It should be noted that, at any point, the administrator will be able to have real-time data about the execution (metrics, costs, etc.) as well as be able to change the running parameters and apply them over the infrastructure.
In order to be able to create a homogeneous infrastructure, the first step is to create an (EC2) instance that can be used as template for the other instances.
The high-level steps to follow to get a template instance (with the
parameters defined in Table
Note that one should remember to create a new keypair (public–private key
used for password-less SSH access to the instances) and save it (it will be
used for the central system), or use another one that already exists and is
currently accessible. Because of the limited space in this article, the line
length (new line) has been truncated with
Prerequisites include wget, unzip, and Python 2.7.x.
This step is optional, but it is highly recommendable because this will be
the advanced control of the infrastructure through the shell. The following
description applies and has been tested on Ubuntu 14.04
First, create an “Access Key” (and secret and/or password), via the AWS web interface in the “Security Credentials” section. With these data, the “AWS_ACCESS_KEY” and “AWS_SECRET_KEY” variables should be exported/updated; please have in mind that this mechanism will be also used for the dashboard/metrics application.
The project executes both 32 and 64 bit binaries for the simulation, so once the template instance is running, the needed packages and dependencies need to be installed via
The version of BOINC used will be the latest from git
Once the BOINC client is installed, it must be configured so it will automatically run on every instance with the same parameters.
1. Create a new account in the project.
2. With the account created (or if already done), the client needs to be associated to the project by creating a configuration file with the user token.
3. Make BOINC to start with the system (ec2 user will be used because of permissions).
An essential piece of software, developed for this work, is the
“simulation terminator”, which decides if a node should shut down itself
in the event that work units were not processed for a given amount of time (by
default 6
This application will be provided upon request to the authors.
To install it (by default into /opt/climateprediction/), the following must be done:
When an instance is powered off, it will be terminated (destroyed) by the Reaper service that runs in the central control system.
Now that the template instance is ready, this means that all the parameters have been configured and the BOINC client is ready to start processing tasks; the next stage is to contextualize it. This means that an OS image will be created from it, which will give our infrastructure the capacity of being scalable by creating new instances from this new image. Unfortunately, this part is strongly related to the cloud type, and although it can be replicated into another system, by now it will only explicitly work in this way for AWS.
Once a client has processed a work unit, the task (result) is created and
sent to the defined upload server, that for the CPDN is
Click on “Create” bucket; the name should be
“CLIMATE_PREDICTION” and must be in the same region as the instances. Activate (in the options) the HTTP/HTTPs server.
1. Access to the S3 service from the AWS dashboard.
To secure the bucket, remember to modify the policy so only allowed IP
ranges can access it (in this case, only IP ranges from instances and from
CPDN servers).
The back end of the central system consists of
a RESTful (representational state transfer) API
over Flask (a Python web microframework; a “simple scheduler” that will be in the background and will
take care that the simulation is running with the given parameters (e.g. all
the required instances are up); and the “Reaper”, a subsystem of the simple scheduler that is some
sort of garbage collector and will terminate powered-off instances in order
to release resources.
The back end can be reused and integrated into another system in order to give the full abstraction over the project. The available requests (HTTP) are
A simplistic (but functional) GUI (graphical user interface) has been designed to make the execution of the simulation on the cloud more understandable.
Two control actions are available:
“Start/edit simulation” sets the parameters (cloud type, number of instances,
etc.) of the simulation and runs it. “Stop simulation” forces all the instances to terminate.
There are three default metrics (default time lapse: 6 Active instances are the number of active instances. Completed tasks are the number of work units successfully completed. Simulation cost is the accumulated cost for the simulation.
The applications are intended to run at any GNU/Linux. The only requirements are (apart from Python 2.7) Flask and Boto, that can be easily installed into any GNU/Linux:
For this step, the file
Optionally, the configuration can be set manually by editing the file
“Config.cfg” (parameters in
Now that the central system has been installed and configured, it will be listening and accepting connections into any network interface (0.0.0.0) on port 5000, protocol HTTP, so it can be accessed via web browser. Firefox or Chromium are recommended because of Javascript compatibility.
When starting a simulation, the number of instances will be 0. This can be changed by clicking “Edit Simulation”, setting the number into the input box, and clicking on “Apply Changes”. Within some minutes (defined in the configuration file, in the “pollingTime” variable), the system will start to deploy instances (workers).
If the number of instances needs to be adjusted when a simulation is running, the procedure is the same as launching a new simulation (“Edit Simulation”). Please be aware that if the number of instances is reduced, unfinished work units will be lost (the scheduler will stop and terminate them using a FIFO).
When a simulation wants to be stopped, click “Stop Simulation”. This will reduce the number of instances to 0, copy the database as “SIMULATION-TIMESTAMP” for further analysis, and reset all the parameters and metrics.
All the authors participated into the design of the experiments and the analysis of results. Diego Montes implemented the full infrastructure experiments. Peter Uhe carried out the benchmarking. All authors participated in the writing of this paper.
The authors declare that they have no conflict of interest.
We thank Andy Bowery, Jonathan Miller, and Neil R. Massey for all their help and assistance with the internals and specifics of the CPDN BOINC implementation. We also thank the comments by B. N. Lawrence, C. Fernández, and A. Arribas that have helped to improve this paper. The compute resources for this project were provided under the AWS Cloud Credits for Research Program. Edited by: S. Marras Reviewed by: C. Fernandez Sanchez and A. Arribas