Will the mountain come to the Mohammad or Mohammad go to the mountain?
When we consider whether clouds can provide a suitable platform for high performance computing (HPC) we always talk about how cloud computing needs to evolve to suit the needs of HPC – in other words will the mountain come to Mohammad. But there are signs that there may also be movement in the other direction – transforming HPC so that it may work better in the cloud paradigm. Mohammad may have to go.
Discussions around this issue typically focus on performance: how the existing cloud hardware and software has to change. But those are not the only issues. I recently listened to a talk given by a colleague from the Joint Lab for Petascale Computing, Franck Cappello, who considered an often overlooked aspect of HPC – fault management. As it turns out, the way fault tolerance for HPC applications is handled is dramatically different from other applications and can have enormous influence over both its performance and the cost.
HPC applications are typically single program multiple data (SPMD) — tightly-coupled codes executing in lockstep and running on thousands or hundreds of thousands of processors. The assumption is that if just one node in the whole computation fails, the whole computation has to be repeated. To make such failures less catastrophic – potentially throwing out many weeks’ worth of computation — we use global restart based on checkpointing – application state is periodically saved and when the failure occurs the application is restarted from the last checkpoint data. How often do we checkpoint? The answer to this depends most on a quality called mean time between failures (MTBF) – if your checkpointing interval is greater than MTBF you’d have to be lucky for your computation to make much progress. As the architectures evolved to support computations running on increasingly more nodes the probability of failure of at least one of those nodes during the computation started increasing, thus pushing MTBF down. To compensate, MTBF became an increasingly important factor in the design of both HPC hardware and the software that executes on it.
Before we go on let’s pause and reflect when have we last even heard of an MTBF of a cloud? Or MTBF of a virtual cluster deployed on that cloud for that matter? Likely never, because so far these systems tend to support applications that are more loosely coupled where the failure of one component does not affect all.
But here is the issue: global restart is expensive. You spend a lot of time saving state and occasionally you also have to read it and redo part of your calculation. This affects both the overall time of your computation (when your code finishes in practice) and the cost of that computation. In fact, Franck and his colleagues estimate that global restart can range from 20% of the total HPC computation cost to as much as 50% in extreme cases – and will of course go up as the MTBF goes down. In other words, if MTBF of a virtual cluster is low — as it is likely to be — HPC on a cloud will not only drag down the execution time but also be prohibitively expensive due to more frequent need for restarts. These factors combined could easily keep HPC out of clouds no matter how good their benchmark results are.
But do we really need global restart if only one component fails? Franck and his colleagues investigated this question and found that in most cases we do not. They are now working on leveraging this finding: formulating protocols that log less data and restart fewer nodes thus significantly reducing the cost of providing fault tolerance for SPMD style applications. The MTBF of clouds, while still an important factor may not be a deal-breaker after all.
It seems that the pay-per-use model of cloud computing sent us all on a global efficiency drive. Before it emerged, optimizing qualities such as fault-tolerance and the resulting power usage and cost was largely a global concern driven by the resource owner. The individual users had little incentive to optimize the cost of their specific run. For this reason, progress happened largely on a global scope, e.g., by driving architecture evolution. Pay-per-use changes this point of view: it now becomes important to individual users to ensure that their run costs as little as possible. It is therefore likely that the next wave of progress will arise out of optimizing individual runs.
It will be fascinating to watch as Mohammad and the mountain maneuver around each other during the next few years ;-).
You can find more information about this and related issues on the Joint Lab publications page.
Last week (Nov 13-19) was the annual Supercomputing (SC) conference. This year it was held in New Orleans, Louisiana. Cloud computing was featured by vendors and speakers throughout the conference. There were far too many cool products, talks, and papers to mention in a single post, however, a few of the highlights that we are thankful we caught in person include:
- Two representatives from Platform Computing presented a large-scale cloud deployment being tested at the CERN laboratory in “Building the World’s Largest HPC Cloud.” CERN is testing Platform ISF to run scientific jobs in a virtualized environment. Results included reports of launching several thousand VMs and a comparison of image distribution techniques.
- In “Virtualization for HPC”, members of the academic (Ohio State University, ORNL) and industrial (VMware, Univa UD, Deopli) communities shared their vision of a future for virtualization technologies in HPC. Topics discussed included pro-active fault tolerance using migration, virtualized access to high-performance interconnects, and new hypervisors technologies designed for exascale computing.
- In “Low Latency, High Throughput, RDMA and the Cloud in Between” representatives from Mellanox, Dell, and AMD discussed the advantages of cloud computing and highlighted the importance of reducing latency and increasing throughput for scientific communities. RDMA over Converged Ethernet (RoCE) was emphasized as a specific effort toward reducing latency in virtualized environments.
- The work in “Elastic Cloud Caches for Accelerating Service-Oriented Computations” demonstrated a dynamic and fast memory-based cache using IaaS resources, specifically for a geoinformatics cyberinfrastructure. The system responds to changes in demand by dynamically adding or removing IaaS nodes from the cache.
In addition to some great cloud computing talks and sessions, cloud resources were also involved in a handful of demos and tutorials. In particular, Purdue demoed Springboard, a “hub” to work with NSF’s TeraGrid infrastructure. The hub provides a central point for researchers to collaborate and removes the need for researchers to rely strictly on the command line when interacting with the TeraGrid’s resources. Springboard also interfaces with the TeraGrid’s first cloud resource, Wispy, at Purdue. The National Center for Atmospheric Research (NCAR) and the University of Colorado at Boulder used 150 Amazon EC2 instances for the Linux Cluster Construction tutorial. The virtual machines were launched on-demand the morning of the tutorial. They provided participants with a realistic software environment for configuring and deploying a Linux cluster using a variety of open source tools such as OpenMPI, Torque, and Ganglia.
With cloud computing becoming ever more popular at SC it would be cool to see an HPC challenge category for cloud computing, perhaps running on Amazon’s cluster compute resource that just last week was officially included in the Top500 at 231.
Amazon’s S3 is a great storage cloud service. It provides highly available access to highly reliable storage at a price. It has emerged as the data storage cloud de-facto standard for good reasons. Its REST API has allowed several 3rd party software vendors to make impressive clients, both GUI and command line.
However, it is closed source and unavailable to the numerous data centers actively used for science. Can private data cloud providers that already have substantial amounts of hardware allow their users to take advantage of the known interfaces and existing clients? Can the Scientific community that already has access to vast amounts of computing and storage power, but not access to an expense accounts take advantage of these data storage cloud innovations?
To address these questions the Nimbus project is presenting Cumulus at Super Computing 2010. Cumulus is an S3 protocol compliant data storage cloud service. It is not the only S3 look-a-like, but it very well could be the easiest to use. Additionally Cumulus provided features uncommon to the others. It has a baked-in, and backward compatible protocol compliant feature that enforces user storage quotas. This feature allows private clouds to control fair sharing by means other than credit card charges. Cumulus lets the cloud provider control their own ratio of cost (amount of redundant hardware) to availability, and more importantly, their own performance and locality needs. In this graph we show how Cumulus measures up to popular data transfer protocols like GridFTP.
For more information, including more graphs like the one above check out the Cumulus poster, and come talk to us as we present it in the Main Lobby at Super Computing 2010 Tuesday night between 5:15 and 7pm.
I’ve been wanting to say a few words about sky computing for a while and eventually iSGTW forestalled me with a very nice article on the topic. It describes a cool work by Nimbus committer Pierre Riteau who created a virtual cluster of over a thousand cores over resources leased from six Nimbus clouds: three provided by Grid’5000 and three by FutureGrid.
“Sky computing” was a name we coined back in 2008 to describe the idea of operating in a multi-cloud environment. It addresses the issues of provider interoperability and comparison between providers (cloud markets) as well as end-user concerns – the abstractions and tools required to provide an integrated and secure environment over resources provisioned in multiple potentially distributed clouds which the original paper focused on.
Pierre’s work pushes the limits of this concept in terms of scale: distributing more images to more distributed clouds for longer sustained leases obtained faster. He started out with a Grid5000 Large Scale Deployment Challenge award earlier this year for managing deployment of hundreds, and by mid-summer was investigating the properties of distributed clusters of thousands of cores. I can’t wait to see what happens next ;-).
More information about Pierre’s sky computing ventures can be found in his TeraGrid 2010 poster and the recent ERCIM article.
Today we made open source Infrastructure-as-a-Service capabilities for science just a little bit better… The new Nimbus 2.6 makes your images get to the nodes a lot faster and provides capabilities allowing administrators to easily and dynamically shift resources between their tried-and-true batch scheduled cluster and a cloud — depending on where they need them more. The Nimbus Context Broker gets shiny new REST-based interfaces, which opens it up to new applications.
The highlight of the release is unquestionably the introduction of the fast propagation LANTorrent tool. The deployment of many identical virtual machine images represents a very common pattern in scientific use of clouds. It occurs when researchers deploy a virtual cluster, augment an existing cluster by adding more worker nodes, or enlarge a set of platforms for a volunteer computing application.
Tests of LANTorrent on Magellan by John Bresnahan; we distribute a thousand 1.2 GB images in 10 minutes.
LANTorrent leverages this pattern to optimize image distribution within your cloud. Built on the ideas underlying BitTorrent, LANTorrent uses streaming to propagate the VM image file to hundreds or thousands of nodes. Unlike copying, streaming does not wait for the file to fully arrive at its destination before forwarding it on – by the time the end of the file reaches the first node, its beginning may have been forwarded to many other nodes. The only limiting factor to how many streams we can have “in flight” at any given time is the capacity of the LAN switch. And this is where the “LAN” part of LANTorrent comes in – it tries to optimize the number of transfers in order to saturate the switch without causing contention. In addition to that, LANTorrent detects all same-image transfers taking place within a time bucket – even if they are part of different requests – and includes them in existing streams. As shown on the enclosed graph, on the ANL’s Magellan cloud we were able to deliver a thousand 1.2 GB VM images within 10 minutes. Moreover, the graph also shows that LANTorrent can scale better than linear (the red line in the graph) thanks to multiple transfer detection.
Another significant addition in this release is the dynamic cloud configuration management tool. Adding or removing nodes from a Nimbus cloud used to be complex: you had to take the cloud down, manually adjust various files, then restart your cloud. This new feature makes all that a thing of the past – now you simply boot the node you want to add (configured with Nimbus), run the nimbus-nodes program and get back to enjoying your coffee. Most importantly the adjustment takes place on-the-fly, while the cloud is running – no downtime, and no inconvenience to your users. This dynamic adjustment feature paves the way for future improvements that will allow administrators to easily and dynamically move resources between their say, PBS cluster and a cloud as we plan to do in FutureGrid. A host of new upgrade tools accompanying this new addition mean that your Nimbus cloud is easier to manage than ever.
Finally, last but not least, the Nimbus Context Broker — successfully used to deploy turnkey virtual clusters for scientific production runs — has a new interface, based on HTTP/REST, in addition to WSRF. This makes it much easier to use the Context Broker in new integrations — in particular ones that rely on different languages – and has been a longstanding user request.
If you are looking for more information about our new features, take a look at the changelog, lots of new things to explore. And in case you are looking for something to do, we will be announcing feature lists for 2.7 soon ;-).
Right on the heels of Amazon’s groundbreaking news on the Cluster Compute instances a couple of days ago, comes this announcement about a partnership between CENIC, Pacific NorthWest GigaPoP (PNWGP), and Amazon: two 10 Gigabit per second (Gbps) connections to Amazon S3 and EC2. This connection will be available to CENIC and PNWGP member institutions (educational and research institutions on the West Coast and in the Pacific North-West) — among others, the many ocean scientists of the Ocean Observatory Initiative (OOI) who we are working with to develop cloud-based scientific infrastructure.
In other words, we can now not only use IaaS to lease supercomputers – we can move data to those supercomputers fast. While so far these high-speed connections to Amazon are not generally available, it will be interesting to see what scientists will be able to do with them, what performance will be achievable in practice, and how it will change the scientific use of cloud computing. That’s two great developments for cloud computing for science in one week – it seems that the pace of progress on this front is accelerating ;-).
We all woke up to a game-changing announcement today: Amazon announced Cluster Compute instances designed to support the kinds of closely coupled workloads that high performance computing (HPC) relies on. The Cluster Compute instances consist of a pair of quad-core Intel “Nehalem” processors with 23 GB of RAM, and 1690 GB of local instance storage. But by far the best part of the offering is the 10 Gbps network that connects Cluster Compute instances — essential for HPC applications.
The real headline though is that for the first time ever a virtual cluster could be featured on the Top500 list. Amazon published the result of the High Performance Linpack benchmark on a virtual cluster made up of 880 Cluster Compute instances (7040 cores) and measured the overall performance at 41.82 TeraFLOPS. This would place a virtual cluster made out of Cluster Compute instances in the 146th position on the Top500 list. For a sense of scale, the somewhat larger in size TACC Lonestar cluster, serving as computational resource in TeraGrid, currently occupies the 123th position on this list.
How much does it all cost? A quick back-of-the-envelope calculation shows that at $1.60 per hour, the Cluster Compute on-demand instances cost about $14K per node per year. However, if you use reserved instances the price drops significantly. Based on 100% utilization for a 3 year reserved instance (which is more similar to buying a supercomputer for 3 years) you’d pay only $0.81 per instance ($6590 up front and $0.56 per hour), in other words, $7K per node per year – but that’s all-inclusive, no additional operating costs. This rough calculation does not include the cost of EBS and data transfer which to some extent depend on the use of the cluster — still, something to keep in mind.
The issue of how exactly cloud computing differs from grid computing was responsible for much controversy in the last year. Here are my two cents on how Infrastructure-as-a-Service (IaaS) cloud computing and grid computing are different (also discussed in the Sky Computing paper)
At some level, both cloud computing and grid computing represent the idea of using remote resources. However, grid computing is built on the assumption that control over the manner in which resources are used stays with the site, reflecting local software and policy choices. These choices are not always useful to remote users who might need a different operating system, or login access instead of a batch scheduler interface to a site. Reconciling those choices between multiple groups of users proved to be complex, time-consuming, and expensive. Looking back, leaving complete control over the resources with the site was a pragmatic choice that enabled very fast adoption of a radically transformative technology. On the other hand, once the technology became successful, this factor made it difficult for it to scale to many user groups with different (and sometimes conflicting) requirements of what the resource should provide.
IaaS cloud computing represents a fundamental change of assumption: when a remote user “leases” a resource, the control of that resource is turned over to that user. This change in assumption was enabled by the availability of a free and efficient virtualization technology: the Xen hypervisor. Before virtualization, turning over the control to the user was fraught with danger: the user could easily subvert a site. But virtualization provides a way of isolating the leased resource from a site in a secure way that mitigates this danger. Virtual machines can be deployed very fast (on the order of miliseconds) – when in addition to that the overhead and the price associated with a reliable virtualization technology went down, it suddenly became viable and cost-effective to use them in order to lease resources to remote users.
The ability to turn the control of remote resource over to the user makes it possible to develop tools, such as Nimbus, and provide services such as Amazon’s EC2 or Science Clouds that allow users to carve out their own custom “site” out of remote resources. At the same time, this change of assumption challenges the established notions of what it means to be a site as we continue to struggle with the new meaning and implications of domain names, site licenses, and established security practices.
Amazon AWS recently announced that EC2 instances can be configured to launch from EBS volumes instead of bundled disk images.
Science users launching heterogeneous clusters can possibly take advantage of this in order to streamline the bundling of images. Those clusters often share a base image layout. Because these AMIs can now reference any number of EBS volumes in their external description including for the root disk, you can now work on customizing each partition and “mix and match” root disks and partitions more easily to make a cohesive cluster. That’s more convenient than maintaining such a partition organization separately and bundling images for each cluster node type, which is traditionally time consuming.
Another change is that the AMI can be above 10GB (up to 1TB) when launched in this manner: some clusters we have seen are pushing that limit even without any data sets!
There is an added cost involved in using EBS which must be taken into account. And EBS is charged by both disk size and number of I/O operations, so this may not be useful in a lot of cases.
I recently attended an immensely interesting workshop on using cloud computing for systems biology computations. The workshop was co-held with SC09. The agenda and the presentations are available online from the workshop pages and are well worth a look. Here are some impressions from the workshop.
The workshop began with a discussion of current challenges in biosciences. One of the most compelling is personal medicine which helps physicians tailor treatments to individual patients based on feedback obtained on genetic and molecular level. For example, knowledge of genetic variations can now help physicians better assess treatment risks, manage dosing of drugs, better detect diseases in early stages and optimize treatments such as e.g., breast cancer therapy. In his introductory talk, Eugene Kolker said that today there were already hundreds of patients treated based on information obtained from their genetic signatures as part of experimental programs. He also emphasized that the main obstacle to progress in this area is not obtaining the data but the response time and ability to store, process, and analyze it to obtain the right information. And this brings us to cloud computing, in this workshop the “prime suspect” to process, analyze and store on demand.
Simon Twigger from the Medical College of Wisconsin made a very compelling case for why bioscientists need cloud computing. His based his case on the pipette analogy – a common tool in molecular biology and medicine typically equipped with a disposable tip. The analogy was particularly apt as probably 90% of the audience was using the pipette on a daily basis. Simon proposed the following: “Imagine that you are running your lab with only one pipette tip to share.” [Huge laughter from the audience.] He then went on to explain how this assumption would change the work pattern in his laboratory. First, everybody would have to wait in line to use the pipette tip. Because of this waiting, they would do a lot less work. They would also do only small scale things because the imaginary pipette tip is small (moving large quantities of liquid would take weeks!). They would do fewer things because washing the pipette between uses is a pain. And finally they would not try to do something risky, because what it the pipette tip, e.g., becomes clogged? Having only one 16-node cluster for the lab, Simon explained, was exactly like having only one pipette tip – it was a bottleneck for the work in the lab. You queue your program and can’t make progress till the results become available. Because of that you do less work. Since the cluster is small, you also try only small scale things – as well as fewer types of things because different types of things may require configuration changes. And the risky stuff you don’t do at all.
The panel in the afternoon presented some option for cloud computing for science. Kathy Yelick from LBNL and our own Pete Beckman described the recently funded DOE Magellan project – a research project looking at how to build clouds for science. Afterwards, Owen White from the University of Maryland started a discussion on what makes cloud computing compelling to science. In addition to issues brought up earlier by Simon, the ease-of-use plays a very significant role. Owen described how his group was trying to use the TeraGrid and found it too complex to use both procedurally and technically – they were not able to overcome the entry barrier despite the significant resource incentive. The ease-of-use question has many aspects. Pete summed it up by saying that half the users tell him that they want to develop their own VM images and half that they don’t. A rough show of hands showed that in that particular audience everybody thought that developing their own image was much simpler than adapting their application to an environment provided by somebody else (because this is effectively the alternative). This does raise an issue however: for some people the need to develop their own image may be too high a barrier.
As if to address this issue the panel was followed by a presentation from Sam Angiuoli from the University of Maryland. Sam described an appliance for automated analysis of sequence data developed for the bio community. It seems that a model is emerging where some users take the initiative to develop appliances on as a service to their community. This is similar to e.g., the high-energy physics CERNVM project that provides images supporting all four LHC experiments.
The workshop was wrapped up by a talk from Deepak Singh from Amazon Web Services who described AWS capabilities but also the different ways in which various projects use them. It’s fun to see new potential for science emerge!