With an assortment of open source proxy caches readily available (e.g., NGINX, Varnish, ATS), it isn’t surprising to hear network operators ask about the value of commercial CDN solutions. After all, how hard can it be to build a CDN service by deploying a set of caches throughout your network? Is that any different than providing an IP packet service by deploying a set of routers throughout the network?
Setting aside the relative strengths and weaknesses of individual proxies, as well as the open source versus commercial product debate—and given the rate of feature requests we see at Verivue, the latter is no small consideration—this is a fair question. In what way is a CDN more than a distributed set of caches?
The obvious answer is that a CDN includes several other components, including request routers that redirect user requests to the best cache and a traffic analytics facility that aggregates, analyzes, archives, and visualizes traffic data generated by individual caches. But there is a more important answer that can be summarized in one word: scalability. The key is to organize and manage a set of caches in a way that scales to form a content delivery service… a CDN. In other words, caches are a key building block, a brick so to speak, but a CDN is a house constructed from a large stack of bricks that have been arranged according to a sound architectural design and engineering practices. Without the correct design, you have a pile of bricks, not a house.
There are two dimensions to scalability. The first and most obvious is performance. If you have 1000 caching nodes that can individually deliver, say, 10Gbps of performance, then a properly designed CDN ought to be able to deliver 10Tbps of aggregate performance across a wide-spectrum of workloads. The key challenge is distributing the workload over the available caches in a way that avoids hotspots that limit aggregate performance. On this point, there is a fundamental balancing act in selecting a cache that is simultaneously (a) close to the end-user, (b) not overloaded, and (c) has a copy of the desired object. In other words, the challenge is to balance network proximity, load balancing, and cache locality in the face of variable request workloads and object popularity distributions.
A superficially designed CDN will easily find itself in one of two sub-optimal situations: either a popular object (movie) is delivered by too few caches while other caches sit idle, reducing the aggregate throughput of the CDN, or an unpopular object is unnecessarily replicated across multiple caches, displacing objects that would make more effective use of the cache. That is, the first situation results from favoring cache locality over a balanced load, while the latter situation results from favoring a balanced load over cache locality. Plus, both options potentially suffer from less-than-perfect information about the current state of the system.
Moreover, striking the right balance is not a matter of having a sufficiently clever algorithm—the underlying problem is NP-complete, that is, there is no known efficient algorithm. Instead, a robust design is a matter of decomposing the problem in just the right way. To see this, consider a critical subset of the larger problem: how to get scalable performance out of a cluster of proxy servers in a single site. One option is to put a load balancer in front of the cluster, but this makes little sense from a cost perspective (the load balancer is a disproportionally expensive element), not to mention that it ignores the role locality plays in cache effectiveness. A second option is to push the problem onto the Request Routing service, but this just moves the problem—it doesn’t solve it. Fortunately, there have been important advances in the design of scalable systems involving the use of consistent hashing to simultaneously distribute load evenly and retain favorable cache locality. The use of such mechanisms is considered an essential best practice in CDN design.
The second dimension of scalability is operational overhead. Although less easy to quantify, it should not be the case that managing 1000 cache nodes is 1000 (or even 100) times the effort needed to manage a single cache node. This means the configuration uploaded to each cache must be automatically generated from a single CDN-wide specification of how the CDN is to behave—how the individual caches are organized into a caching hierarchy, how request routing is mapped onto the caching hierarchy, what restrictions are imposed on when and where content can be delivered, how delivery is to be customized for different end-users, and so on. It is simply not practical to treat each cache as an independent appliance, subject to the inevitable operator errors that will occur when each appliance is managed as a distinct element.
A comprehensive CDN-wide configuration, in turn, depends on a management interface that models the rich set of abstractions and operational workflow of a full-featured CDN. Even though this is a difficult dimension to quantify, people that have built and operated wide-area network services uniformly agree that time spent incorporating a new feature into the configuration management system often dwarfs the time needed to implement the feature itself, but without investing that time, the system quickly becomes unmanageable.
The challenge of building distributed and parallel systems from component parts has always been how to make the system scalable. CDNs are no different.