Thursday, June 11, 2009

clouds and peer-to-peer

We've been asked a few times about the relationship between clouds and peer-to-peer systems, and we wanted to take this opportunity to respond.

We differentiate between peer-to-peer (p2p) techniques and p2p systems. The former refers to a set of techniques for building self-organizing distributed systems. These techniques are often useful in building datacenter-scale applications, including datacenter-scale applications that are hosted in the cloud. For instance, Amazon's Dynamo datastore relies on a structured peer-to-peer overlay, as do several other key-value stores.

People often use "P2P" to refer to systems that use these techniques to organize large numbers of cooperating end hosts (peers) such as personal computers and settop boxes. In these systems, most peers necessarily communicate using the Internet, rather than a local area network (LAN). To date, the most successful peer-to-peer applications have been file sharing (e.g., Napster, BitTorrent, eDonkey), communication (Skype). and embarrassing parallel computations, such as SETI@home and BOINC projects.

The main appeal of p2p systems is that their resources are often "free", coming from individuals which volunteer their machines' CPUs, storage, and bandwidth. Offsetting this, we see two key limitations of P2P systems.

First, p2p systems lack a centralized administrative entity that owns and controls the peer resources. This makes it hard to ensure high levels of availability and performance. Users are free to disable the peer-to-peer application or reboot their machine, so a great degree of redundancy is required. This makes p2p systems a poor fit for applications requiring reliability, such as web hosting, or other sorts of server applications.

This decentralized control also limits trust. Users can inspect the memory and storage of a running application, meaning that applications cannot safely store confidential information unencrypted on peers. Nor can the application developer count on any particular quantity of resources being dedicated on a machine, or on any particular reliability of storage. These obstacles have made it difficult to monetize p2p services. It should come as no surprise that, so far, the most successful p2p applications have been free, with Skype being a notable exception.

Second, the connectivity between any two peers in the wide area is two or three order of magnitude lower than between two nodes in a datacenter. Residential connectivity in US is typically 1Mbps or less, while in a datacenter a node can often push up to 1Gbps. This makes p2p systems inappropriate for data intensive applications (e.g., data mining, indexing, search), which accounts for a large chunk of the workload in today's datacenters.

Recently, there have been promising efforts to address some of the limitations of p2p systems by building hybrid systems. The most popular examples are data delivery systems, such as Pando and Abcast, where p2p systems are complemented by traditional Content Distribution Systems (CDNs). CDNs are used to ensure availability and performance when the data is not found at peers, or/and peers do not have enough aggregate bandwidth to sustain the demand.

In another development, cable operators and video distributors have started to test with turning the set top boxes into peers. The advantage of settop boxes is that, unlike personal computers, they are always on, and they can be much easily managed remotely. Examples in this category are Vudu, and the European NanoDataCenter effort. However, to date, the applications of choice in the context of these efforts have still remained file sharing and video delivery.

Datacenter clouds and p2p systems are not a substitute for each other. Widely distributed peers may have more aggregate resources, but they lack the reliability and high interconnection bandwidth offered by datacenters. As a result, cloud-hosting and p2p systems complement each other. We expect that in the future more and more applications will span both the cloud and the edge. Examples of such applications are:

  • Data and video delivery. For highly popular content, p2p distribution can eliminate the network bottlenecks by pushing the distribution at the edge. As an example, consider a live event such as the presidential inauguration. With traditional CDNs, every viewer on a local area network would receive an independent stream, which could lead to choking the incoming link. With p2p, only one viewer on the network needs to receive the stream; the stream can be then redistributed to other viewers using p2p techniques.
  • Distributed applications that require a high level of interactivity, such as massive multi player games, video conferences, and IP telephony. To minimize latency, in these applications peers communicate with each other directly, rather than through a central server.
  • Applications that request massive computation per user, such as video editing and real-time translation. Such applications may take advantage of the vast amount of computation resources of the user's machine. Today, virtually every notebook and personal computer has a multi-core processor which are mostly unused. Proposals, such as Google's Native Client aim to tap into these resources.