The real promise of the Age of AI, beyond automation, is the new discoveries we’ll find in the data we already have. Practitioners will realize the core need to create and attach more metadata for existing data, some not yet known, for an infinitude of data points.
There’s never been anything like this in the generations of enterprise technologies that came before.
The need to revisit data and continuously extract more knowledge and insights to improve AI is unique for our time. This makes data lineage, tracking, and indexing—the discipline known as metadata—grow in both value and scale. Metadata is no longer just a method to accelerate data lookup; it’s become a true “master catalogue” of data.
Some may throw their hands up at the sheer scale of the problem, but the tech industry is full of talented people who delight in tough problems of scale. Disruptive thinkers always show up when the needs are the greatest.
Consider the unstructured data revolution from big data’s rise 20 years ago. It’s going through something of a dramatic evolution today. Instead of treating data like amorphous blobs, even highly unstructured storage systems have gained the capability to organize unstructured data into structured forms. The winner is looking much like tabular formats with highly flexible transformations of how data is evolved and related. The “lazy evaluation” strategy of programming, in which primitives are treated as abstractions, is being mined for its applicability to maintaining data that is reliable and standardized, according to need.
We know Python is the universal language of AI, and one of the most widely used data structures in the data science and AI discipline is the Pandas DataFrame. I have followed the team over at Pixeltable (of Parquet file format fame) who have looked at the data problem and made the dataframe a super flexible data structure. I really like the door this opens up, making multimodal data sets something that can be flexibly stored, transformed, and iterated on, in dependable ways. The world needs more flexible methods for organizing and querying data at scale, and fast search through columns just won’t be enough.
In my work, I'm thinking about extending these types of concepts with even more transformations, lineage, and scale versus what we previously thought was possible. At the core of it, what I truly like is how the data morphs depending on caller needs, decoupling its creation and administration from its use. Transforming data into new forms at access time drives developer productivity, where an MPEG image is transformed to JPEG upon access, if that is implicit in the needs of the code.
I’m following this project, and others like it, very closely. While solutions from the big data era do offer scale, they’ll need significant innovation to be anything like the model for future generations of data organization and storage.