AI’s Challenge to Data, Storage, and the Computing Industry

Artificial intelligence has created a fundamental change in the nature and architectural importance of data. To deliver on AI’s promise, we’ll need to find new discoveries in the data we already have.

4 min. read

Actions

4 min. read

Table of Contents

Introduction The Source of Truth Is at the Center of the Process A Generational Challenge A Metaphor: The Bank Heist and the Patent Metadata: How We’ll Build the Future To Succeed in AI, Stay Vigilant about Data

Introduction

By Par Botes, VP AI Infrastructure, Pure Storage

A generational change is hitting the multitrillion-dollar global enterprise computing industry. Almost no one is talking about it, and few are coming to grips with what it will mean.

Artificial intelligence (AI) has created a fundamental change in the nature and architectural importance of data.

AI has captured the imagination of so many sectors. There are code-slinging IDE-extensions, poetry writing robots, dazzling and impossible images, elaborate videos, and music disconnected from reality. There are math-solving tools that look promising for finding proofs for problems that have long vexed the greatest minds, and language models that could reveal the deep language of biology.

No matter where you look, though, this is the change: Before, if there was a bad output in computing, you checked the code for bugs. Now, with AI, if there’s a bad output, you don’t check the code—you check the data.

Data is both the source and the problem. How we depend on it will change the way we think about computers, programs, testing, and reliable execution.

The Source of Truth Is at the Center of the Process

This shift in data dependency implies a technological (and legal) sea change with few parallels in computer engineering history. Before, data was simply something used by code. Now, training data is a foundational source of truth for what the code will do.

In current models for building AI systems, many different kinds of data need to be clearly identified and tracked, made auditable, and put into a repeatable format that can be analyzed at a fast and regular cadence. Upon training with new data, new insights from the new data have a direct influence on the success of every AI exercise.

The metadata to this data is also becoming disproportionately more valuable.

A Generational Challenge

Because the identity of a single piece of data is ever-changing, depending on the context in which it’s being used, it’s essential that it be clear in its identity, despite shifting contexts. Great AI outcomes happen when the data aligns with clarity and rigor.

The workflows of data are normalizing and beginning to look like those used by regulated industries, such as audited financial data, which is set and unchanging by a fixed period in time, like the end of a financial quarter. Another is workflows for testing new drugs or food quality. That data is clearly derived and labeled by commonly understood standards and audited via artifacts and evidence so it can be used reliably.

Reliable and uniform data is why AlphaFold, Google DeepMind’s tool for determining protein structures, is one of the most successful AI projects around. Lab data from around the world is uniform, with every person involved agreeing on what definitions mean, so that the descriptions—sometimes called labels—are uniform. The goal is ambitious, but it’s also relatively narrow so the data doesn’t have to be repurposed from other sources to fit other needs.
Those examples are single uses, though. Data has to have that same quality of identity and provenance in all other different training and usage contexts. While regulatory and financial data is in a process that happens over months, what we’re talking about in this era is checking this kind of data and its interactions near instantaneously.

Regulatory agencies, financial departments, and specialized labs aren’t the norm. Most of the world’s digital information is created according to any number of standards, indexed in various ways, and stored in a multitude of formats. The majority of older data was created before auditable labels were a consideration.

Some vendors think merely having storage for data and an index is the answer. That is a fallacy. Having a structured method to describe the data and track changes to the data and the index is both the problem and the value.

That’s why this is a new, generational challenge to how we’ve previously thought about data.

A Metaphor: The Bank Heist and the Patent

To give you a better sense of the problem, consider the implications of a French bank heist that occurred in 1890.
A thief broke into the bank and went to work on the safe with a torch that used methane from the bank’s gas-powered lighting and liquid oxygen he’d brought along. After a couple of hours, he’d cut a 12x20-inch rectangle in the iron safe, only to find the safe was double-hulled. Without enough oxygen to cut a second hole, he took off.

Fast forward 20 years and a new way to cut iron was developed. The inventor realized that, along with heat, a more direct application of oxygen to the surface created an iron oxide akin to rust. A torch cutting through brittle rust along with iron is much faster, a breakthrough that inspired the developers to seek a patent for their innovation.

But, was it really novel enough for a patent? Someone remembered the theft from 20 years earlier, proving that the use of oxygen to break into the safe may preclude the patent. They found the damaged safe still in an evidence locker, saw no trace of rust, and the new method was granted a patent.

Think of that iron rectangle as a single data point. In 1890, the presence of rust was not yet relevant. That piece of metadata (“the condition of metal”) only mattered later when we understood the science, even if it had existed the whole time.

That’s the new reality not for one piece of data, but for trillions, in repositories around the world. The ability to go back and reexamine data in a new context creates new insights and new value. AI accelerates this paradigm shift beyond anything we could imagine in the prior big data era.

Metadata: How We’ll Build the Future

The real promise of the Age of AI, beyond automation, is the new discoveries we’ll find in the data we already have. Practitioners will realize the core need to create and attach more metadata for existing data, some not yet known, for an infinitude of data points.

There’s never been anything like this in the generations of enterprise technologies that came before.

The need to revisit data and continuously extract more knowledge and insights to improve AI is unique for our time. This makes data lineage, tracking, and indexing—the discipline known as metadata—grow in both value and scale. Metadata is no longer just a method to accelerate data lookup; it’s become a true “master catalogue” of data.

Some may throw their hands up at the sheer scale of the problem, but the tech industry is full of talented people who delight in tough problems of scale. Disruptive thinkers always show up when the needs are the greatest.

Consider the unstructured data revolution from big data’s rise 20 years ago. It’s going through something of a dramatic evolution today. Instead of treating data like amorphous blobs, even highly unstructured storage systems have gained the capability to organize unstructured data into structured forms. The winner is looking much like tabular formats with highly flexible transformations of how data is evolved and related. The “lazy evaluation” strategy of programming, in which primitives are treated as abstractions, is being mined for its applicability to maintaining data that is reliable and standardized, according to need.

We know Python is the universal language of AI, and one of the most widely used data structures in the data science and AI discipline is the Pandas DataFrame. I have followed the team over at Pixeltable (of Parquet file format fame) who have looked at the data problem and made the dataframe a super flexible data structure. I really like the door this opens up, making multimodal data sets something that can be flexibly stored, transformed, and iterated on, in dependable ways. The world needs more flexible methods for organizing and querying data at scale, and fast search through columns just won’t be enough.

In my work, I'm thinking about extending these types of concepts with even more transformations, lineage, and scale versus what we previously thought was possible. At the core of it, what I truly like is how the data morphs depending on caller needs, decoupling its creation and administration from its use. Transforming data into new forms at access time drives developer productivity, where an MPEG image is transformed to JPEG upon access, if that is implicit in the needs of the code.

I’m following this project, and others like it, very closely. While solutions from the big data era do offer scale, they’ll need significant innovation to be anything like the model for future generations of data organization and storage.

To Succeed in AI, Stay Vigilant about Data

In an earlier post, I talked about the healthy rise of brute force computing and how new AI models change convention. I suspect the need for more computation will be as big a part of the future as will emerging storage capabilities like flexible data transformations, tracking, and indexing.

I leave you with this: Enterprises cannot lose focus on data availability and performance. These are table stakes for today and the future. AI introduces new demands for data freshness and data quality, which creates new needs for data representation, tracking, and indexing, as these areas mature.

At Pure Storage, we’re laser-focused on our mission to deliver best-in-class products for the all-flash data center. We are well-placed to deliver and working hard on building visionary products that incorporate new concepts for data flows.

I personally love the fresh challenges and opportunities in AI and data set management, and I can’t wait to explore innovations in these new areas in future posts.

4 min. read