I recently attended an immensely interesting workshop on using cloud computing for systems biology computations. The workshop was co-held with SC09. The agenda and the presentations are available online from the workshop pages and are well worth a look. Here are some impressions from the workshop.
The workshop began with a discussion of current challenges in biosciences. One of the most compelling is personal medicine which helps physicians tailor treatments to individual patients based on feedback obtained on genetic and molecular level. For example, knowledge of genetic variations can now help physicians better assess treatment risks, manage dosing of drugs, better detect diseases in early stages and optimize treatments such as e.g., breast cancer therapy. In his introductory talk, Eugene Kolker said that today there were already hundreds of patients treated based on information obtained from their genetic signatures as part of experimental programs. He also emphasized that the main obstacle to progress in this area is not obtaining the data but the response time and ability to store, process, and analyze it to obtain the right information. And this brings us to cloud computing, in this workshop the “prime suspect” to process, analyze and store on demand.
Simon Twigger from the Medical College of Wisconsin made a very compelling case for why bioscientists need cloud computing. His based his case on the pipette analogy – a common tool in molecular biology and medicine typically equipped with a disposable tip. The analogy was particularly apt as probably 90% of the audience was using the pipette on a daily basis. Simon proposed the following: “Imagine that you are running your lab with only one pipette tip to share.” [Huge laughter from the audience.] He then went on to explain how this assumption would change the work pattern in his laboratory. First, everybody would have to wait in line to use the pipette tip. Because of this waiting, they would do a lot less work. They would also do only small scale things because the imaginary pipette tip is small (moving large quantities of liquid would take weeks!). They would do fewer things because washing the pipette between uses is a pain. And finally they would not try to do something risky, because what it the pipette tip, e.g., becomes clogged? Having only one 16-node cluster for the lab, Simon explained, was exactly like having only one pipette tip – it was a bottleneck for the work in the lab. You queue your program and can’t make progress till the results become available. Because of that you do less work. Since the cluster is small, you also try only small scale things – as well as fewer types of things because different types of things may require configuration changes. And the risky stuff you don’t do at all.
The panel in the afternoon presented some option for cloud computing for science. Kathy Yelick from LBNL and our own Pete Beckman described the recently funded DOE Magellan project – a research project looking at how to build clouds for science. Afterwards, Owen White from the University of Maryland started a discussion on what makes cloud computing compelling to science. In addition to issues brought up earlier by Simon, the ease-of-use plays a very significant role. Owen described how his group was trying to use the TeraGrid and found it too complex to use both procedurally and technically – they were not able to overcome the entry barrier despite the significant resource incentive. The ease-of-use question has many aspects. Pete summed it up by saying that half the users tell him that they want to develop their own VM images and half that they don’t. A rough show of hands showed that in that particular audience everybody thought that developing their own image was much simpler than adapting their application to an environment provided by somebody else (because this is effectively the alternative). This does raise an issue however: for some people the need to develop their own image may be too high a barrier.
As if to address this issue the panel was followed by a presentation from Sam Angiuoli from the University of Maryland. Sam described an appliance for automated analysis of sequence data developed for the bio community. It seems that a model is emerging where some users take the initiative to develop appliances on as a service to their community. This is similar to e.g., the high-energy physics CERNVM project that provides images supporting all four LHC experiments.
The workshop was wrapped up by a talk from Deepak Singh from Amazon Web Services who described AWS capabilities but also the different ways in which various projects use them. It’s fun to see new potential for science emerge!