A Tale of Two ICPs

Jim Theodoras
Two way sign

In the last few years, data center operators and Internet content providers (ICPs) have grown from being just footnotes on analysts' charts to dominating transport bandwidth consumption. A pervasive question in the press this year has been how much bandwidth do the new kings of consumption really need? One prognosticator will claim they need as much as they can get, and would buy a Petabit if made available today, while another industry sage proffers a single Terabit will suffice for the foreseeable future. So who's right? It turns out, both.

To illustrate the point, let's look at two ICPs in the exact same line of business, similar in size, global reach and number of subscribers. Both have hundreds of thousands of servers running in leaf and spine configurations. Computing clusters are formed by groupings of these servers and switches between them. These clusters used to be physical constructs where one grouping of hardware made up the cluster. However, these days the clusters are virtually defined across a pool of hardware resources, and the virtual cluster and physical hardware are independent of each other. The performance of the virtual cluster though, will still depend upon the physical properties of the underlying hardware. For example, a cluster within a single physical rack will outperform a cluster distributed across different buildings due to the added latency that distance always brings. Being virtual, the computing cluster can be as small or large as needed, and is often “elastic”, shifting in size depending on the size of the load.

ICP #1

150824_Diagram_ICP_Data_Centers_Interconnect_1

ICP #1 has defined a maximum cluster size as whatever is within the walls of its building. It uses replication and load balancing algorithms to distribute its work among its data centers globally, something that virtual machines are very good at. The maximum bandwidth it needs is the sum of vertical traffic in and out of its data center plus the load balancing traffic, which is a percentage of total traffic bouncing around inside the data center. If I had to take a swing at numbers, I would guess around 1Tbit/s of transport is needed today between each data center.

ICP #2 

150824_Diagram_ICP_Data_Centers_Interconnect_2

ICP#2 is very similar. The main difference is that it does not limit its cluster size to within the walls of its buildings. It tries to make the pool of geographically distributed resources behave as global compute cluster. Why? It turns out that running a large job on a single 10,000 server cluster is much more efficient than running 100 smaller jobs on 100 server clusters. That end goal significantly changes the math, to say the least. The maximum bandwidth needed is now the sum of the vertical traffic in and out, plus load balancing traffic, PLUS all the horizontal traffic flowing within the data center.

As mentioned earlier, when distributing a virtual compute cloud across separate equipment pools, overall performance will be impacted by physical constraints. Another way to look at the architecture is to think of the compute cloud as a large parallel computer. Amdahl’s law (sometimes known as “his lesser known law”, or maybe “better known”, depending on what field of study you’re in) states that in parallel computing, interconnect must be equal to compute. In his time, that meant 1Mbit/s I/O for every 1MHz compute, though today that might be scaled up to 100Gbits/s of interconnect for every 100GHz of compute. Each compute slice might be 32-64 2.5GHz cores, needing 100Gbit/s I/O per server. Even a relatively small cluster of 10,000 servers would need 1Pbit/s of interconnect bandwidth to keep from throttling the performance of the cluster. Suddenly the push for 400Gbit/s per WDM channel and beyond, and 100s of Tbit/s of data center interconnect (DCI) makes a bit more sense.

It’s a misconception that the growth in cloud services is driving the recent boom in ICP bandwidth consumption, as even the phenomenal growth in cloud traffic cannot account for the amount of bandwidth some ICPs are installing. The reality is that we’re seeing an architectural shift in the way data center compute is being orchestrated and this is adding significantly to baseline cloud growth. Getting a global data center network to behave as a single global compute cloud requires an astronomical amount of interconnect bandwidth, and DCI vendors are more than happy to oblige.

Related articles