Having spent my career in the IT world, where I’ve participated in the design of high-level services, including CDN-related technologies, I now find myself helping to make a case for why network operators should consider CDNs as a core element of their network infrastructure. One of the factors that make this interesting is a difference in perspective between the IT view of the world and the operator world-view. I don’t claim to be unique in witnessing this clash of perspectives since it’s happening more generally as network operators consider adopting cloud technologies, but I would claim that CDNs are at the bleeding edge of traditional IT technology pushing deep into operator access networks.
In trying to put my finger on whether there’s something fundamentally different in how these two communities build and operate systems, I keep coming back to the following distinction. Network operators think in terms of appliances and devices, and how they can be architected to build a network, whereas the IT perspective decouples hardware and software (treating the former as commodity), and focuses on how the software can be architected to provide a global service. This device/appliance versus software/service distinction then permeates the language: one talks about managing individual devices and the other talks about managing the service as a whole; one talks about appliance performance and the other talks about aggregate service performance; one talks about device reliability and the other talks about service-level reliability. Of course the network operator also thinks about network-wide behavior and IT people also think about per-server behavior, but their respective viewpoints start at opposite ends from each other.
So does it matter, or is it a distinction without a difference? Here’s where I think it matters. If you are focused on appliances or devices, then you would naturally equate the appliance with the highest performance and highest reliability rating with the best-of-breed. On the flip side, if you are focused on service-wide behavior, then you equate best-of-breed with the software approach that scales to the best aggregate performance and offers the best service-level reliability, independent of how fast/reliable each individual server is. In fact, the more a software service is able to extract good performance and reliability out of slow/unreliable servers, the better.
To see how this plays out in practice, consider the following simple scenario. Suppose you need to support 10Gbps of customer-facing content delivery bandwidth, and you have the option of either a single 10Gbps appliance or a cluster of four commodity servers running clustering software that delivers 10Gbps in aggregate. Which is the better choice? (To simplify the story, assume the same underlying processor is used in both options, and that the software throttles each box to 2.5Gbps in the latter case.) From a cost/performance perspective, both options offer the same performance, but the latter incurs a modest incremental cost for the extra hardware, and the corresponding rack space. However, the cluster-based solution offers two distinct advantages: (1) it is easier to incrementally increase the aggregate capacity since opening up a bandwidth throttle is easier than installing another appliance, and (2) a node failure results in a loss of 1/Nth the aggregate capacity rather than taking down the whole site (and to avoid the latter, many operators adopt a 1+1 redundancy strategy whereby they install two 10Gbps appliances, making the total cost of ownership substantially higher).
Note that these advantages are amplified if the software is deployed on virtual machines rather than physical machines, as would be the case if the operator’s underlying infrastructure were cloud-based. Running on such a platform means faster provisioning and re-provisioning as workloads change, quicker time-to-market for new software services, and greater capacity to absorb the failure of individual hardware devices.