Take Slide for example, we have a solid amount of hardware, hundreds of powerful machines constantly churning away on a number of tasks: serving web pages, providing back-end services, processing database requests, recording metrics, etc. If I start the work-week needing a new pool of machines either set up or allocated for a particular task, I can usually have hardware provisioned and live by the end of the week (depending on my Scotch offering to the Operations team, I can get it as early as the next day). If I can have the real thing I clearly have no need for cloud computing or virtualization.
That's what I thought, at least, until I started to think more about what would be required to get Slide closer to the lofty goal of continuous deployment. As I was involved in pushing for and setting up our Hudson CI server, I constantly check on the performance of the system and help make sure jobs are chugging along as they should be, I've become the defacto Hudson janitor.
Our current continuous integration setup involves one four-core machine running three separate instances of our server software as different users, processing jobs throughout the day. One "job" typically consists of a full restart of the server software (Python) and running literally every test case in the suite (we walk the entire tree aggregating tests). On average the completion of one job takes close to 15 minutes, and executes around 400+ test cases (and growing). Fortunately, and unfortunately, our Hudson machine is no longer able to service this capacity during development peak in the middle of the day, this is where the "cloud" comes in.
We have a few options at this point:
- Setup another one or more machines
- Rethink how we provision hardware for continuous integration
The fundamental problem with provisioning resources for continuous integration, at least at Slide, is that the requirements are bursty at best. We typically queue a job for a particular branch when a developer executes a
git push(via the Hudson API and a post-receive hook). From around 9 p.m. until 9 a.m. we don't need but maybe two actual "executors" inside Hudson to handle the workload the night-owl developers tend to place on Hudson, from 12 p.m. until 7 p.m. however our needs fluctuate rapidly between needing 4 executors, and 10 executors. To exacerbate things further, due to "natural traffic patterns" in how we work, mid-afternoon on Wednesday and Thursday require even more resources as teams are preparing releases and finishing up milestones.
The only two possible solutions to solve the problem are to: build a continuous integration farm with full knowledge capacity will remain unused for large amounts of time, or look into "cloud computing" with service provides like Amazon EC2 which will allow for Hudson slaves to be provisioned on demand. The maintainer of Hudson, Kohsuke Kawaguchi has already started work on "cloud support" for Hudson via the EC2 plugin which makes this a real possibility. (Note: using EC2 for this at Slide was Dave's idea, not mine :))
Using Amazon EC2 isn't the only way to solve this "bursty" problem however, we could just as easily solve the problem in house with provisioning of Xen guests across a few machines. The downside of doing it yourself is amount of time between when you know you need more capacity and when you can actually add that capacity to your own little "cloud". Considering Amazon has an API for not only running instances but terminating them, it certainly provides a compelling reason to "outsource" the problem to Amazon's cloud.
I recommend following Kohsuke's development of the EC2 plugin for Hudson closely, as continuous integration and "the cloud" seem like a match made in heaven (alright, that pun was unnecessary, it sort of slipped out). At the end of the day the decision comes down to a very fundamental business decision: which is more cost effective, building my own farm of machines, or using somebody else's?
(footnote: I'll post a summary of how and what we eventually do to solve this problem)