Above the Clouds: 2009

Thursday, June 11, 2009

clouds and peer-to-peer

We've been asked a few times about the relationship between clouds and peer-to-peer systems, and we wanted to take this opportunity to respond.

Definitions
We differentiate between peer-to-peer (p2p) techniques and p2p systems. The former refers to a set of techniques for building self-organizing distributed systems. These techniques are often useful in building datacenter-scale applications, including datacenter-scale applications that are hosted in the cloud. For instance, Amazon's Dynamo datastore relies on a structured peer-to-peer overlay, as do several other key-value stores.

People often use "P2P" to refer to systems that use these techniques to organize large numbers of cooperating end hosts (peers) such as personal computers and settop boxes. In these systems, most peers necessarily communicate using the Internet, rather than a local area network (LAN). To date, the most successful peer-to-peer applications have been file sharing (e.g., Napster, BitTorrent, eDonkey), communication (Skype). and embarrassing parallel computations, such as SETI@home and BOINC projects.

Limitations
The main appeal of p2p systems is that their resources are often "free", coming from individuals which volunteer their machines' CPUs, storage, and bandwidth. Offsetting this, we see two key limitations of P2P systems.

First, p2p systems lack a centralized administrative entity that owns and controls the peer resources. This makes it hard to ensure high levels of availability and performance. Users are free to disable the peer-to-peer application or reboot their machine, so a great degree of redundancy is required. This makes p2p systems a poor fit for applications requiring reliability, such as web hosting, or other sorts of server applications.

This decentralized control also limits trust. Users can inspect the memory and storage of a running application, meaning that applications cannot safely store confidential information unencrypted on peers. Nor can the application developer count on any particular quantity of resources being dedicated on a machine, or on any particular reliability of storage. These obstacles have made it difficult to monetize p2p services. It should come as no surprise that, so far, the most successful p2p applications have been free, with Skype being a notable exception.

Second, the connectivity between any two peers in the wide area is two or three order of magnitude lower than between two nodes in a datacenter. Residential connectivity in US is typically 1Mbps or less, while in a datacenter a node can often push up to 1Gbps. This makes p2p systems inappropriate for data intensive applications (e.g., data mining, indexing, search), which accounts for a large chunk of the workload in today's datacenters.

Opportunities
Recently, there have been promising efforts to address some of the limitations of p2p systems by building hybrid systems. The most popular examples are data delivery systems, such as Pando and Abcast, where p2p systems are complemented by traditional Content Distribution Systems (CDNs). CDNs are used to ensure availability and performance when the data is not found at peers, or/and peers do not have enough aggregate bandwidth to sustain the demand.

In another development, cable operators and video distributors have started to test with turning the set top boxes into peers. The advantage of settop boxes is that, unlike personal computers, they are always on, and they can be much easily managed remotely. Examples in this category are Vudu, and the European NanoDataCenter effort. However, to date, the applications of choice in the context of these efforts have still remained file sharing and video delivery.

Datacenter clouds and p2p systems are not a substitute for each other. Widely distributed peers may have more aggregate resources, but they lack the reliability and high interconnection bandwidth offered by datacenters. As a result, cloud-hosting and p2p systems complement each other. We expect that in the future more and more applications will span both the cloud and the edge. Examples of such applications are:

Data and video delivery. For highly popular content, p2p distribution can eliminate the network bottlenecks by pushing the distribution at the edge. As an example, consider a live event such as the presidential inauguration. With traditional CDNs, every viewer on a local area network would receive an independent stream, which could lead to choking the incoming link. With p2p, only one viewer on the network needs to receive the stream; the stream can be then redistributed to other viewers using p2p techniques.
Distributed applications that require a high level of interactivity, such as massive multi player games, video conferences, and IP telephony. To minimize latency, in these applications peers communicate with each other directly, rather than through a central server.
Applications that request massive computation per user, such as video editing and real-time translation. Such applications may take advantage of the vast amount of computation resources of the user's machine. Today, virtually every notebook and personal computer has a multi-core processor which are mostly unused. Proposals, such as Google's Native Client aim to tap into these resources.

Wednesday, May 6, 2009

Surge Computing/Hybrid Computing

In an earlier blog post [March 2, 2009], we discussed why private clouds enjoy only a small subset of the benefits of public clouds. If common API's allowed the same application to transition between a private cloud and a public cloud, we believe application operators could enjoy the full benefits of cloud computing. We referred to this capability as "surge computing" in our Above the Clouds white paper.

Surge computing would allow developers to push just enough (possibly sanitized) data into the cloud to perform a computation and obtain an acceptable result, or seamlessly pull in resources from a public cloud when local capacity is temporarily exceeded. They could even use either the private cloud or the public cloud as a "spare" in the event that one cloud environment becomes unavailable or fails.

One early surge-computing tool available to SaaS developers is Eucalyptus, an open source reimplementation of the Amazon Web Services EC2 APIs. The Eucalyptus software was originally developed at UC Santa Barbara, and Eucalyptus Systems, recently raised $5.5M to provide consulting services and technical support for customers constructing private clouds. Canonical Ltd. announced that Eucalyptus will be the underlying technology used in the Ubuntu Enterprise Cloud, which previewed in the latest version of Ubuntu (9.04 released April 23rd 2009). Finally, companies like RightScale have committed to allowing their customers to register their private Ubuntu Enterprise Clouds and have them managed alongside applications deployed on Amazon EC2 via a single interface (the RightScale dashboard).

Wednesday, April 29, 2009

cloud security

Security is one of the most often-cited objections to cloud computing; analysts and skeptical companies ask "who would trust their essential data 'out there' somewhere?" We didn't focus on security extensively in our paper, and we wanted to offer our analysis of what the major security concerns are with cloud computing, and what might be done about them. These are preliminary thoughts; we welcome comments and criticism. Security is not our primary area of interest, and we'd love to hear from people with operational experience.

The security issues involved in protecting clouds from outside threats are similar to those already facing large datacenters, except that responsibility is divided between the cloud user and the cloud operator. The cloud user is responsible for application-level security. The cloud provider is responsible for physical security, and likely for enforcing external firewall policies. Security for intermediate layers of the software stack is a shared between the user and the operator; the lower the level of abstraction exposed to the user, the more responsibility goes with it. Amazon EC2 users have more responsibility for their security than do Azure users, who in turn have more responsibilities than AppEngine customers. This user responsibility, in turn, can be outsourced to third parties who sell specialty security services. The homogeneity and standardized interfaces of platforms like EC2 makes it possible for a company to offer, say, configuration management or firewall rule analysis as value-added services. Outsourced IT is familiar in the enterprise world; there is nothing intrinsicaly infeasible about trusting third parties with essential corporate infrastructure.

While cloud computing may make external-facing security easier, it does pose the new problem of internal-facing security. Cloud providers need to guard against theft or denial of service attacks by users. Users need to be protected against one another.

The primary security mechanism in today's clouds is virtualization. This is a powerful defense, and protects against most attempts by users to attack one another or the underlying cloud infrastructure. However, not all resources are virtualized and not all virtualizion environments are bug-free. Virtualization software has been known to contain bugs that allow virtualized code to "break loose" to some extent. [1] Incorrect network virtualization may allow user code access to sensitive portions of the provider's infrastructure, or to the resources of other users. These challenges, though, are similar to those involved in mangaging large non-cloud datacenters, where different applications need to be protected from one another. Any large internet service will need to ensure that one buggy service doesn't take down the entire datacenter, or that a single security hole doesn't compromise everything else.

One last security concern is protecting the cloud user against the provider. The provider will by definition control the "bottom layer" of the software stack, which effectively circumvents most known security techniques. Absent radical changes in security technology, we expect that users will use contracts and courts, rather than clever security engineering, to guard against provider malfeasence. The one important exception is the risk of inadvertent data loss. It's hard to imagine Amazon spying on the contents of virtual machine memory; it's easy to imagine a hard disk being disposed of without being wiped, or a permissions bug making data visible improperly.

There's an obvious defense, namely user-level encryption of storage. This is already common for high-value data outside the cloud, and both tools and expertise are readily available. The catch is that key management is still challenging: users would need to be careful that the keys are never stored on permanent storage or handled improperly. Providers could make this simpler by exposing APIs for things like curtained memory or security sensive storage that should never be paged out.

[1] Indeed, even correct VM environments can allow the virtualized software to "escape" in the presence of hardware errors. See Sudhakar Govindavajhala and Andrew W. Appel, Using Memory Errors to Attack a Virtual Machine. 2003 IEEE Symposium on Security and Privacy, pp. 154-165, May 2003.

Monday, April 20, 2009

Cloud computing, law enforcement and business continuity

In our Above The Clouds white paper, we identified various obstacles to the growth of Cloud Computing including data conﬁdentiality and auditability as well as business continuity in the event of an outage at the cloud vendor.

Recently, a colocation facility owned by Core IP Networks LLC was raided by the FBI and the entire datacenter was shut down. "Millions of dollars' worth" of computers, many owned by other companies colocated in the datacenter that had no connection to the companies being investigated by the FBI, were confiscated and those sites went offline. Some of the companies subsequently went out of business. Spreading one's cloud application over multiple physical datacenters may protect against natural disasters, but if those datacenters are all operated by a single provider or in a single jurisdiction, customers might still be exposed to other business continuity disruptions such as this one.

Core IP Networks' CEO, Matthew Simpson, posted a letter to inform customers of the situation as well as to voice concern over the unfairness of the FBI's operation to many of the innocent "bystander" customers who suffered service outages as a result. His letter concludes: "If you run a datacenter, please be aware that in our great country, the FBI can come into your place of business at any time and take whatever they want, with no reason." Indeed, noted technologist and technology blogger James Urquhart wonders whether the U.S. legal system will be a hindrance to cloud computing adoption.

The problem is hardly unique to the United States. The massive government-initiated shutdowns of Swedish ISP's used by the Pirate Bay, a group being investigated for trafficking in copyrighted digital media, similarly resulted in unexpected downtime for many companies unrelated to the Pirate Bay but who had the misfortune to be housed in the same facility.

These incidents also illustrate what we called reputation fate sharing in the paper: the behavior of a single cloud customer can affect the reputation of other customers, perhaps to the extreme degree that computers belong to innocent bystanders are seized.

Tuesday, March 17, 2009

Cloud computing in education

Berkeley's computer science program has a long tradition of integrating research into teaching, at all levels from undergraduate to PhD. Last year we piloted a successful SaaS project course using Ruby on Rails; this year we offered a more advanced version of the course that introduces students to the challenges of SaaS operations (scalability, availability, etc.) using cloud computing, with a generous donation of AWS credits from Amazon. I wrote a short article for Berkeley's IT newsletter on why we did this and what our experiences were. It turns out that besides being easy to administer, utility computing was a great fit for the bursty demand associated with a lab/project intensive course. Amazon will be expanding its support for cloud computing in education soon, and I'm sure we will be looking at moving other courses to cloud computing as well.

Monday, March 2, 2009

Is Everything Cloud Computing?

We waited for the dust to settle before commenting about reactions to our Berkeley View paper. One of the most frequent comments that we got from several bloggers was that our definition of Cloud Computing is too narrow, and does not include, for example, internal data centers. Granted, a lot of cloud computing ideas can make private data centers more efficient. However, as we explain below, there are two serious drawbacks to including internal datacenters in Cloud Computing.

It will be very hard to come up with an easy-to-apply, widely-adopted definition that clearly demarks when an internal datacenter should be properly included in Cloud Computing and when it should not.
Even if you overcome drawback 1, many of the generalizations that apply to our definition of Cloud Computing will be incorrect for a more inclusive definition of Cloud Computing. Such inconsistency is a reason some think the claims for Cloud Computing are just hype.

To come up with our definition of Cloud Computing, we spent six months reading white papers and blogs, arguing about this issue ourselves, and receiving comments on drafts of our paper from dozens of leaders in industry. Like others, we found a lot of imprecision and inconsistency as to what is and is not Cloud Computing. We think imprecise definitions cause the allergic reactions to "Cloud Computing" claims exhibited by prominent figures like Larry Ellison, who said, “we’ve redefined Cloud Computing to include everything that we already do.”

Hence, to be more precise, in Above the Clouds, we defined Cloud Computing as follows:

Cloud Computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. The services themselves have long been referred to as Software as a Service (SaaS), so we use that term. The datacenter hardware and software is what we will call a Cloud. When a Cloud is made available in a pay-as-you-go manner to the public, we call it a Public Cloud; the service being sold is Utility Computing. … We use the term Private Cloud to refer to internal datacenters of a business or other organization that are not made available to the public. Thus, Cloud Computing is the sum of SaaS and Utility Computing, but does not normally include Private Clouds.

Regarding the first drawback to a more expansive definition, you can include nearly everything in IT involving server hardware or software, leaving yourself vulnerable to people like Larry Ellison calling you a marketing charlatan. Alternatively, you have to come up with further distinctions of which kinds of internal datacenters are Cloud Computing and which are not. Good luck coming up with a short, crisp definition and having it widely adopted.

Regarding the second drawback to including internal datacenters, an expanded definition will break most generalizations about Cloud Computing. Here are two examples. Virtually no internal datacenters are large enough to see the factors of 5 to 7 in cost advantages due to economies of scale as seen in huge data centers that we believe, as we say in the report, are a defining factor of cloud computing. Many internal datacenters also do not have the fine grained accounting in place to inform users of what resources they are using, which makes it hard to inspire the resource conservationism provided by the pay-as-you-go billing model, which we identify as another unique characteristic of Cloud Computing.

Of course, it is possible to run an internal datacenter using exactly the same APIs and policies of Public Clouds; presumably, that is how Amazon and Google got started in this business. Running an internal data center this way leads to some advantages of Cloud Computing, such as improved utilization and resource management, but not to all of the advantages we identified, such as high elasticity, pay-as-you-go billing, and economies of scale. As we say in the report, we also think that using cloud APIs in an internal data center will enable what we call Surge Computing: in times of heavy load, you outsource some tasks from an internal data center into a Cloud, thus mitigating the risk of under-provisioning.

Returning to the analogy we used in Above the Clouds, the hardware world has largely separated into semiconductor foundries like TSMC and fab-less chip design companies like NVIDIA. However, some larger companies do have internal fabs that they precisely match to the TSMC design rules so that they can do “Surge Chip Fabrication” at TSMC when chip demand exceeds what their internal capacity. Indeed, “Surge Chip Fab” is a significant fraction of TSMC’s business.

Just as the those in the hardware world do not consider companies that use Surge Chip Fab to be semiconductor foundries, for the two reasons above, we do not recommend including any private datacenter that imitates some characteristics of Public Clouds in Cloud Computing. Surge Computing is a more accurate label for such datacenters, even if it doesn’t have the same sizzle as Cloud Computing.

Although our restricted definition may limit which products and services are labeled Cloud Computing, by being precise we aim to prevent allergic reactions and thereby enable a more meaningful and constructive discussion of the current state of Cloud Computing and its future.

Tuesday, February 24, 2009

Eucalyptus to be Included in Next Ubuntu Release

One of the obstacles we identified in our paper was API-based vendor lock-in: If you've written your application to run on a particular cloud, and especially if you have automated management and provisioning tools, moving to a different provider, with different APIs, can be a significant expense. Mark Shuttleworth recently announced on the Ubuntu Developers mailing list that the next release of Ubuntu, Karmic Koala, "aims to keep free software at the forefront of cloud computing by embracing the API's of Amazon EC2, and making it easy for anybody to set up their own cloud using entirely open tools." Specifically they will be including support for Eucalyptus, the University of California, Santa Barbara project that allows operators to provide an EC2 compatible API to their own cluster. While there are a number of other efforts to create a standard this is definitely a step towards the adoption of the EC2 API as a de facto standard. To the best of our knowledge, this makes EC2's API the first one with a second implementation -- an open source implementation, to boot.

Monday, February 23, 2009

IBM Software Available Pay-As-You-Go on EC2

Just before we released our paper, IBM and Amazon made a very interesting announcement: IBM software will be available in the cloud with pay-as-you-go licensing. This makes IBM one of the first major enterprise software vendors to provide pay-as-you-go licensing in the cloud. (Another example is Red Hat, which provides supported versions of Red Hat Enterprise Linux, JBoss and other software for a monthly fee plus a by-the-hour fee). In our white paper, we had identified software licensing models as an obstacle to cloud computing, so we are excited to see this development. The announcement highlights some of the benefits that a software vendor like IBM can get from cloud computing. IBM is letting software developers use their products in the cloud at no additional cost for development purposes (other than a small monthly fee), thus essentially providing a very easy-to-use trial version of the software: spinning up an AMI with, for example, DB2, is much faster and more convenient than working out a trial arrangement with a sales representative and installing the software. This is good for developers because they can focus on integration with the software, and it's good for IBM because it lets more people try their products. For companies that want to do more than development, the software can be used in the cloud with either an existing (longer-term) IBM billing plan or an hourly pay-as-you-go billing plan. Details of the latter are not available yet, but it will be interesting to see what the setup, monthly and hourly fees have to look like for a supported pay-as-you-go offering to make sense. At Berkeley we believe that at least in the short term, one of the biggest advantages of cloud computing is ease of experimentation. Before today, one could use a cloud service like EC2 to test out multiple operating systems, machine images with pre-configured open-source software stacks, and large-scale experiments (will my software scale to 100 nodes?). The availability of software like DB2, WebSphere sMash, etc from IBM means even more prototyping and experimentation is possible without negotiating long-term contracts or having to go through a complicated setup process. This potential for prototyping and experimentation helps both software users and commercial software vendors.

Thursday, February 12, 2009

YouTube Discussion of the Paper

Here's a video of professors Armando Fox, Anthony Joseph, Randy Katz and David Patterson discussing some of the ideas in the paper:

Above the Clouds Released

We've just released our white paper: "Above the Clouds: A Berkeley View of Cloud Computing."

Executive summary:

Cloud Computing, the long-held dream of computing as a utility, has the potential to transform a large part of the IT industry, making software even more attractive as a service and shaping the way IT hardware is designed and purchased. Developers with innovative ideas for new Internet services no longer require the large capital outlays in hardware to deploy their service or the human expense to operate it. They need not be concerned about over-provisioning for a service whose popularity does not meet their predictions, thus wasting costly resources, or under-provisioning for one that becomes wildly popular, thus missing potential customers and revenue. Moreover, companies with large batch-oriented tasks can get results as quickly as their programs can scale, since using 1000 servers for one hour costs no more than using one server for 1000 hours. This elasticity of resources, without paying a premium for large scale, is unprecedented in the history of IT. Cloud Computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. The services themselves have long been referred to as Software as a Service (SaaS). The datacenter hardware and software is what we will call a Cloud. When a Cloud is made available in a pay-as-you-go manner to the general public, we call it a Public Cloud; the service being sold is Utility Computing. We use the term Private Cloud to refer to internal datacenters of a business or other organization, not made available to the general public. Thus, Cloud Computing is the sum of SaaS and Utility Computing, but does not include Private Clouds. People can be users or providers of SaaS, or users or providers of Utility Computing. We focus on SaaS Providers (Cloud Users) and Cloud Providers, which have received less attention than SaaS Users. From a hardware point of view, three aspects are new in Cloud Computing:

The illusion of infinite computing resources available on demand, thereby eliminating the need for Cloud Computing users to plan far ahead for provisioning.
The elimination of an up-front commitment by Cloud users, thereby allowing companies to start small and increase hardware resources only when there is an increase in their needs.
The ability to pay for use of computing resources on a short-term basis as needed (e.g., processors by the hour and storage by the day) and release them as needed, thereby rewarding conservation by letting machines and storage go when they are no longer useful.

We argue that the construction and operation of extremely large-scale, commodity-computer datacenters at low-cost locations was the key necessary enabler of Cloud Computing, for they uncovered the factors of 5 to 7 decrease in cost of electricity, network bandwidth, operations, software, and hardware available at these very large economies of scale. These factors, combined with statistical multiplexing to increase utilization compared a private cloud, meant that cloud computing could offer services below the costs of a medium-sized datacenter and yet still make a good profit. Any application needs a model of computation, a model of storage, and a model of communication. The statistical multiplexing necessary to achieve elasticity and the illusion of infinite capacity requires each of these resources to be virtualized to hide the implementation of how they are multiplexed and shared. Our view is that different utility computing offerings will be distinguished based on the level of abstraction presented to the programmer and the level of management of the resources.

Amazon EC2 is at one end of the spectrum. An EC2 instance looks much like physical hardware, and users can control nearly the entire software stack, from the kernel upwards. This low level makes it inherently difficult for Amazon to offer automatic scalability and failover, because the semantics associated with replication and other state management issues are highly application-dependent. At the other extreme of the spectrum are application domain-specific platforms such as Google AppEngine. AppEngine is targeted exclusively at traditional web applications, enforcing an application structure of clean separation between a stateless computation tier and a stateful storage tier. AppEngine's impressive automatic scaling and high-availability mechanisms, and the proprietary MegaStore data storage available to AppEngine applications, all rely on these constraints. Applications for Microsoft's Azure are written using the .NET libraries, and compiled to the Common Language Runtime, a language-independent managed environment. Thus, Azure is intermediate between application frameworks like AppEngine and hardware virtual machines like EC2. When is Utility Computing preferable to running a Private Cloud? A first case is when demand for a service varies with time. Provisioning a data center for the peak load it must sustain a few days per month leads to underutilization at other times, for example. Instead, Cloud Computing lets an organization pay by the hour for computing resources, potentially leading to cost savings even if the hourly rate to rent a machine from a cloud provider is higher than the rate to own one. A second case is when demand is unknown in advance. For example, a web startup will need to support a spike in demand when it becomes popular, followed potentially by a reduction once some of the visitors turn away. Finally, organizations that perform batch analytics can use the "cost associativity" of cloud computing to finish computations faster: using 1000 EC2 machines for 1 hour costs the same as using 1 machine for 1000 hours. For the first case of a web business with varying demand over time and revenue proportional to user hours, we have captured the tradeoff in the equation below.

The left-hand side multiplies the net revenue per user-hour by the number of user-hours, giving the expected profit from using Cloud Computing. The right-hand side performs the same calculation for a fixed-capacity datacenter by factoring in the average utilization, including nonpeak workloads, of the datacenter. Whichever side is greater represents the opportunity for higher profit.

The table below previews our ranked list of critical obstacles to growth of Cloud Computing; the full discussion is in Section 7 of our paper. The first three concern adoption, the next five affect growth, and the last two are policy and business obstacles. Each obstacle is paired with an opportunity, ranging from product development to research projects, which can overcome that obstacle.

We predict Cloud Computing will grow, so developers should take it into account. All levels should aim at horizontal scalability of virtual machines over the efficiency on a single VM. In addition:

Applications Software needs to both scale down rapidly as well as scale up, which is a new requirement. Such software also needs a pay-for-use licensing model to match needs of Cloud Computing.
Infrastructure Software needs to be aware that it is no longer running on bare metal but on VMs. Moreover, it needs to have billing built in from the beginning.
Hardware Systems should be designed at the scale of a container (at least a dozen racks), which will be is the minimum purchase size. Cost of operation will match performance and cost of purchase in importance, rewarding energy proportionality such as by putting idle portions of the memory, disk, and network into low power mode. Processors should work well with VMs, flash memory should be added to the memory hierarchy, and LAN switches and WAN routers must improve in bandwidth and cost.

**Table:** Quick Preview of Top 10 Obstacles to and Opportunities for Growth of Cloud Computing.
	Obstacle	Opportunity
1	Availability of Service	Use Multiple Cloud Providers; Use Elasticity to Prevent DDOS
2	Data Lock-In	Standardize APIs; Compatible SW to enable Surge Computing
3	Data Confidentiality and Auditability	Deploy Encryption, VLANs, Firewalls; Geographical Data Storage
4	Data Transfer Bottlenecks	FedExing Disks; Data Backup/Archival; Higher BW Switches
5	Performance Unpredictability	Improved VM Support; Flash Memory; Gang Schedule VMs
6	Scalable Storage	Invent Scalable Store
7	Bugs in Large Distributed Systems	Invent Debugger that relies on Distributed VMs
8	Scaling Quickly	Invent Auto-Scaler that relies on ML; Snapshots for Conservation
9	Reputation Fate Sharing	Offer reputation-guarding services like those for email
10	Software Licensing	Pay-for-use licenses; Bulk use sales

Thursday, January 22, 2009

Welcome!

This the the first post of what is going to become a group blog about cloud computing, Internet datacenters, and related issues. The authors are members of the UC Berkeley RAD Lab.

Above the Clouds