Skip to Content

Betting against Data Gravity:
A Fool's Errand

Par Botes, VP AI Infrastructure, Pure Storage

Actions
4 min. read

Introduction

By Par Botes, VP AI Infrastructure, Pure Storage

The tech industry seems to have a cyclical fascination with distributed file systems, content distribution, replica management, and global namespaces. These concepts, with varying names, periodically resurface as the "next big thing" in storage. Currently, the buzzword in some areas of systems architecture is “global namespace,” which is making the rounds in IT discussions again—especially in the context of AI—for its promise of seamless data access across geographies.

But is it truly a game-changer for AI, or just another hype cycle in the long history of distributed file systems and content distribution? For business leaders, understanding the evolution of these concepts, the technical realities, and what they mean for enterprise IT strategy is crucial for making strategic storage decisions.

What Is a Global Namespace? It Depends on Who You Ask

Some vendors define a global namespace as a global reporting of information and data types within storage systems. Others define it as the ability to access data from multiple locations as if it were local. Still, others argue that global namespaces essentially mean distributing data to endpoints, a concept previously known as content distribution. 

Over the years, global namespaces have taken on many forms. To understand the history of this technology, let’s look at the grandfather of distributed namespaces: the Andrew File System. Developed at Carnegie Mellon University around 30 years ago, this file system presented itself to clients as a local file system, even when data resided on a different continent. 

It was a remarkable achievement for its time. The Andrew File System employed a sophisticated authentication scheme and a complex locking mechanism to ensure consistency and prevent conflicts arising from simultaneous modifications by multiple users. It was not particularly easy to set up or manage, but there were a handful of relatively large installations back in the day.

Data Gravity and the Evolution of the File System

Shortly after the dot-com bust, the Andrew File System's popularity began to decline. This wasn't due to any inherent flaws but rather a combination of increasing complexity and the rapid rise of thin clients. Analysing data centrally became more efficient than transferring data across networks. Yes, data gravity reared its head as far back as 20 years ago. Turns out we cannot ship data as fast to remote locations as we can create it.

Additionally, new application frameworks and user interfaces for easy-to-develop, server-side rendering reduced the need for data to leave centralized data centers. 

In the early 2000s, Microsoft entered the distributed file system arena with its Distributed File System (DFS), which later evolved into replicated DFS. While DFS became a mainstream offering, it wasn't as widely adopted as one would have expected from a Microsoft product. Few applications leveraged its capabilities, and it remained somewhat outside the mainstream. Although some users undoubtedly appreciated DFS, it didn't significantly impact broader system architectures globally. 

Then, around 15 years ago, stretch clustering emerged. Initially, this technique aimed to enhance availability by placing data stores at synchronous replication distances. However, some vendors offered active-active configurations, enabling data access from both sides. While stretch clustering still exists, its primary use case is now high availability. Outside of specific industrial applications, its popularity has waned despite some vendors' insistence otherwise.

“It's crucial to remember that data, especially large data sets, has gravity. While WANs are significantly faster than they were 20 years ago…data grows faster than Moore’s Law, and Moore’s Law grows much faster than WAN links.”

Object Stores Address the Issue of Scale

Parallel to these developments, object stores and the S3 protocol gained prominence. This approach eliminated the strict requirements of the POSIX standard for read and write behaviors, which was one of the main challenges in making distributed file systems work and scale. 

Object stores offered greater scalability both in and across data centers. The S3 protocol's inherent location independence simplified data access from clients. As more applications are natively written to access data using S3 APIs, it’s become a dominant force in the cloud and increasingly more so in enterprise data centers. 

Recently, some vendors have attempted to redefine the concept of global namespaces. Certain vendors claim to offer global namespaces by presenting views of aggregated file system metadata. However, generating reports on a central index of filenames doesn't truly constitute a global namespace. Other vendors are revisiting ideas of establishing locking authorities and delegated locks (often referred to as leases) combined with various caching strategies. This approach is reminiscent of Lustre's innovations from 15 years ago, albeit with highly polished marketing and little else being new under the sun. 

Whether this approach will move beyond niche use cases and gain widespread adoption in meaningful applications remains to be seen. I’m a bit skeptical; we still create data much faster than we can transfer it over wide-area network (WAN) links.

Data Gravity May Decide What’s Next

It's crucial to remember that data, especially large data sets, has gravity. While WANs are significantly faster than they were 20 years ago, data volumes have grown at an even faster pace. The rate of data growth far outstrips the expansion of network links. Data grows faster than Moore’s Law, and Moore’s Law grows much faster than WAN links. Even if network links could keep pace with Moore’s Law, at some point the CAP theorem becomes a limit to scale. Consistency, Availability, and Partition tolerance becomes really hard to handle at scale and across inter-site data centre links. Some very specific applications, typically read-only or in the content distribution niche can handle this; most real applications can’t.

The challenge of moving data to data centers across distances invariably leads to a battle against network limitations. Physics imposes constraints that are difficult to overcome, regardless of what you call your data prefetching technique. While AFS-like global namespaces with old-school opportunistic locking semantics might be suitable for niche applications like data distribution, their potential to become the dominant computing paradigm is implausible. 

This doesn’t mean I don't think highly of ideas like replication of data for protection or distributed erasure coding in availability zones for object storage resiliency. These ideas work well and belong in a special category of recovery techniques, but they aren’t exactly what people mean when they say global namespaces.

Content distribution itself is a niche use case, and with the increasing prevalence of dynamically generated content by GPUs, the role of storage in this context diminishes. It is true that GPUs are a scarce commodity and moving data to where a GPU is may be a required Band-Aid for supply chain constraints, but the GPU scarcity will be short-lived enough that it’s unlikely to lead to a meaningful and lasting architectural change. 

One of the most intriguing distributed systems with a compelling namespace in recent years is Google's Spanner. Google, having the advantage of building applications from scratch, addressed some of the most challenging problems in distributed storage. They recognized that many queries could be answered using older data. Consequently, they designed their storage system to be queryable at any point in time. This innovative approach allowed applications to determine whether to wait for the most current data or utilize older data to answer queries. While this is a remarkable technique, only a few companies globally possess the resources to build and maintain such a system since the application must be modified to interact with the system; traditional read/write semantics doesn’t work in such a system. 

Moore's Law continues to drive advancements in GPUs, the most data-intensive processors available, and as a result, we're only at the beginning of exploring the possibilities of GPU computing in enterprise data centers. As GPUs become the dominant computing form, we will either recompute the data directly, or data gravity will make it so that the compute resource will execute close to where data lives. 

Despite industry buzz around global namespaces, history has shown that data gravity, network limitations, and consistency challenges impose real barriers to widespread adoption. While niche applications may benefit, enterprise leaders should remain skeptical of solutions that claim to eliminate fundamental storage constraints. Instead, the real shift to watch is how compute resources move closer to data—whether through GPUs or architectural shifts that prioritize locality. In this era, just like in past eras, placing compute resources in proximity to storage will remain the dominant architecture—betting against data gravity is a fool's errand.

Actions
4 min. read

We Also Recommend

Your Browser Is No Longer Supported!

Older browsers often represent security risks. In order to deliver the best possible experience when using our site, please update to any of these latest browsers.