Trends in Cloud Storage

Here’s a modest insight. When designing a cloud storage system, there is value in decoupling the system’s archival capacity (its ability to persistently store large volumes of data) from the system’s delivery capacity (its ability to deliver popular objects to a scalable number of users). The archival half need not support scalable performance, and likewise, the delivery half need not guarantee persistence.

In practical terms, this translates into an end-to-end storage solution that includes a high-capacity and highly resilient object store in the data center, augmented with caches throughout the network to take advantage of aggregated delivery bandwidth from edge sites. This is similar to what Amazon offers today: S3 implements a resilient object store in the data center, augmented with CloudFront to scale delivery through a distributed set of edge caches.

The object store runs in the data center, ingests data from some upstream source (e.g., video prepared using a Content Management System), and delivers it to users via edge caches. The ingest interface is push-based and likely includes one or more popular APIs (e.g., FTP, WebDAV, S3), while the delivery interface is pull-based and corresponds to HTTP GET requests from the CDN.

In past articles I have written extensively about how to architect a CDN that can be deployed throughout an operator network, claiming that a well-designed CDN should be agnostic as to the source of content. But it is increasingly the case that content delivered over a CDN is sourced from a data center as part of a cloud-based storage solution. This begs the question: is there anything we can learn by looking at storage from such an end-to-end perspective?

I see three points worth making, although in way of a disclaimer, I’m starting from the perspective of the CDN, and looking back to what I’d like to see from a data center based object store. The way I see it, though, there’s more value in storing data if you have a good approach to distributing it to users that want to access it.

First, it makes little sense to build an object store using traditional SAN or NAS technology. This is for two reasons. One has to do with providing the right level of abstraction. In this case, the CDN running at the network edge is perfectly capable of dealing with a large set of objects, meaning there is no value in managing those objects with full file system semantics (i.e., NAS is a bad fit). Similarly, the storage system needs to understand complete objects and not just blocks (i.e., SAN is not a good fit). The second reason is related to cost. It is simply more cost effective to build a scalable object store from commodity components. This argument is well understood, and leverages the ability to achieve scalable performance and resiliency in software.

Second, a general-purpose CDN that is able to deliver a wide range of content—from software updates to video, from large files to small objects, from live (linear) streams to on-demand video, from over-the-top to managed video—should not be handicapped by an object store that isn’t equally flexible. In particular, it is important that the ingest function be low-latency and redundant, so it is possible to deliver both on-demand and live video. (Even live video needs to be staged through an object store to support time shifting.)

Third, it is not practical to achieve scalable delivery from a data center. Data centers typically provide massive internal bandwidth, making it possible to build scalable storage from commodity servers, but Internet-facing bandwidth is generally limited. This is just repeating the argument in favor of delivering content via a CDN—scalable delivery is best achieved from the edge.

Share

About Larry Peterson

As Chief Scientist, Larry Peterson provides technical leadership and expertise for research and development projects. He is also the Robert E. Kahn Professor of Computer Science at Princeton University, where he served as Chairman of the Computer Science Department from 2003-2009. He also serves as Director of the PlanetLab Consortium, a collection of academic, industrial, and government institutions cooperating to design and evaluate next-generation network services and architectures. Larry has served as Editor-in-Chief of the ACM Transactions on Computer Systems, has been on the Editorial Board for the IEEE/ACM Transactions on Networking and the IEEE Journal on Select Areas in Communication and is the co-author of the best selling networking textbook Computer Networks: A Systems Approach. He is a member of the National Academy of Engineering, a Fellow of the ACM and the IEEE, and the 2010 recipient of the IEEE Kobayahi Computer and Communication Award. He received his Ph.D. degree from Purdue University in 1985.
This entry was posted in Cloud Computing and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>