00:01
Mm Hi everybody, my name is Justin Emerson, I'm a Principal technology evangelist here at pure storage. Uh and I have the distinct pleasure today of talking to somebody who have gotten a chance to work with for the last almost two years now. Mr thanks for being here. Thanks a lot. Justin happy to be here and that you are the
00:25
RSC storage lead for meta for the I R S C, correct? Yes, that's correct. So what maybe give our viewers an idea what is the meta A I R S C. The AI Research Supercluster. So the research supercluster is a purpose built data center, specifically for AI workloads and in particular for AI research workloads.
00:49
Right. And so that might sound like an awkward distinction, right? But in, you know, companies like ours research is often trying to stay a couple of years ahead of where the regular businesses and so, you know, part of it, is that not only is it a I workloads, it's kind of unpredictable ai workloads.
01:09
And from an AI research standpoint, you guys publish a lot of papers and do a lot of sort of pure research as opposed to say, applying it to a particular business problem. Right. Exactly. So, meta has a group that does pure research, a lot of it is open source, a lot of it makes its way into publications and then a lot of that also drives the research production pipeline.
01:32
And so it's important for the researchers that meta to have state of the art technology for doing aI research in whatever fields, you know, ranging from things that have direct applicability to the kinds of work that meta does all the way to just regular, what you would think of as being an academic aI interesting. So, so you were the storage lead for RSC, what kind of role did you play in its design
02:00
development, etcetera. So, the the RSC was a new concept at meta, which is that it's uh it's not the first dedicated AI facility. Uh in fact, pure worked with us extensively on the first AI facility, which is known as the Fair cluster. Fair was the name of a research group for AI at
02:24
meta. And the distinction here is that the AI research supercluster is intended to be larger and also work on more than just academic research. Right. So a lot of the training that goes on is with metas data. And so what that involves is that instead of data sets that tend to be on the side on the order of terabytes,
02:53
we're talking about data sets that could be on the order of petabytes or tens of petabytes or, you know, theory even larger than that. And so we're really talking about a couple of orders of magnitude larger data and using the latest technology, we're also talking about networks that are 10 times faster computation, that's anywhere from 2 to 6 times faster. Right?
03:16
So on every dimension, the AI research supercluster is, you know, several times faster than its predecessors and. And and so it's one of the fastest supercomputers in the world and it's growing, so what were the kinds of design constraints or challenges that drove the the requirements for the RSC? So the RSC was special in several ways.
03:42
One is that it's a one of a kind data center and so there wasn't a existing pattern that we were building off of. Second, is that a lot of the technologies that went into the RSC or either the first of its kind or the first of its use in meta or in some cases uh the largest of its kind for the vendors that we were using. So in in practically every dimension the RSC
04:15
was new. But then beyond just that, the RSC was also built in an extremely short time period, I think as you mentioned, we've been working together for two years and the RSC took place in in less than that time window. Right, It was announced a couple of months ago. And so that's really amazingly short time for bringing up a new data center.
04:40
Right, for for comparison's sake, right. Some of the various supercomputers that you'll you'll see announced will be five years from when design winds occur to when they're actually doing whatever workload it was that they were designed to do so, so less than two years is pretty remarkable. Yeah, it was definitely an accomplishment.
05:01
So you mentioned one of the uh you know, one of the things about R. S. C. Was a bunch of new technologies. One of those was air store, um what what was their store and and and how did you guys go about building it? And what was it for? So Air store is our managed storage service that sits on top of the
05:22
storage devices that we buy from Pure. And so what air store is is that we needed a storage service that could deliver the raw performance of the RSC to any workload. So, whereas in a in many data centers, you'll have a collection of small jobs, each operating on its own portion of the data. We wanted to have the RSC not have any limitations like that. So that, you know,
05:50
if we had one job that wanted to use all of the compute and all of the storage in the Rs C, we want to make that available. And that means, you know, petabytes of data, it means terabytes per second of delivered traffic. And this is not something that anybody offers as a single package. Right. And so what air store is, is it's a layer that
06:17
sits on top of other storage systems and harnesses them together to provide that kind of unified data delivery to any training job in the RSC, What were some of the key insights that you sort of had that made you sort of choose how air store was, was designed. So that's a good question. The one of the main observations that we
06:42
started from is that at this kind of scale you're really catering towards certain types of Ai jobs? Right. And you don't necessarily need the kind of full flexibility that you would need for regular, let's say, like database workloads or transaction processing or you know, even editing files. And so what air store was designed to do was
07:09
convert a lot of what we call the control plane operations into data plane operations. And so we worked closely with um you know, the systems that we built with our storage partners and said, hey, we want a a storage system specifically tailored for this very narrow workload. Right. And we could take an arbitrary Ai training
07:34
workload and transform it such that it had fewer uh control operations and larger data sizes and you no longer sustained transfers. And then we could ask our our partners, can you optimize for this kind of workload? And luckily they were able to. And so uh so what our stories is, you know, not only is it a managed storage service, it's a service that also present very um yeah, very narrowly
08:04
tries to provide a set of operations that can be optimized at all layers of the stack. So it's optimized in storage devices. It's optimized in the cache, is it's optimized in the client libraries that the ai training jobs actually use? Well that that's a great segment of my next question, which is, so how did meta come to select your storage for um the ai R C.
08:29
So, as I mentioned before, we had a very short time frame. Right. And so, you know, traditionally a company like ours, there are internal storage systems, many of which have been published that were developed over the course of many years. And and tend to optimize for the workloads that we have in the rest of the company. Right. But we couldn't use those systems for our Ai
08:54
workloads because they have very different properties. And so in the in the span of 18 months we had to select a new storage system for this project and figure out how to get it operational. And so we contacted a number of storage vendors of of both disk and flash to evaluate their their highest performance, highest density offerings and went through a comparative evaluation and pure
09:23
storage is especially the flashlight devices did very well for bulk storage needs. So that's a that's a great point there is you guys originally were looking at disk based solutions as well as flash based solutions. What kind of um what was the big thing that drove it toward toward flash as opposed to disk? What were the ways that disk fell down traditionally, you know, when we started the project, if you told me that we'd be buying,
09:51
you know, the the quantity of flesh that we were buying and and that we're projecting to buy forward. You know, it wouldn't have made sense. Right. That, you know, five years ago there was such a gulf in pricing between disk and flash that for something of this scale, disk would have been the only option. Right. And you know,
10:11
a couple of things happened in that time. One is that flash pricing got much more competitive compared to disc. And secondly, uh you know, from a reliability standpoint, when you have this much storage having the operational costs of, you know, constantly going in and replacing disks, you know, is an issue that you have to take into account.
10:34
But then third, um and this this one was one of the larger determinants was actually the power profile that we had a specific power budget as all data centers do. And we wanted to maximize that. And especially on the storage side we wanted to do a very good job of using as little power as possible because with an Ai data center you want most of your power going to the GPU
11:03
servers, the actual machines doing the ai computation. And so, you know, from a combination of performance and power and costs, we ended up selecting pure storage. Really interesting. So that power became such a determinant. Right, Because the less power you could use in storage, the more GPU nodes you could run,
11:23
which overall major jobs go faster. Yes, exactly. Right. Uh meta has a reputation for designing very, very lean data centers. Right? And this one's no exception. And especially with storage, we had a very, very tight power budget specifically, so we could drive as much power
11:43
to the GPU servers and pack as many GPU servers into the facility as we could. So what has the experience working with pure storage been like so far, were, you know, now more than a year plus into the actual implementation part, I suppose, um what's the experience been like? Well, it's it's definitely been exciting for us that we,
12:07
I think are probably one of the more challenging customers in terms of what we, what we expect our systems to do. But also, I think we're probably one of the more interesting customers in that we can do things like provide you the exact workload models for, you know, a system that is, you know, only on the drawing board. And and that was one of the things that we did
12:34
that was that we we provided you a tester and we said, well we're not gonna tell you, you know, what this thing does, but this is what we're optimizing for. Right. And I think that's probably a bit unusual for this industry. Right? And so I would say that I would say that if only every customer were so helpful.
12:54
And so this this works to our mutual advantage. Right? That by providing uh, you know, providing you with the tester, you are able to optimize specifically for that and, you know, for us, that's great, right? Because the more throughput we can deliver from the same amount of power from the same amount of physical footprint,
13:14
the better it is for the researchers who are going to be using the system. So what are met his future plans for the RSC? Um I know you've talked publicly about that, It's going to be expanding into multiple phases, but how does the RSC fit into metas overall strategy around the, you know, social media and the metaverse and so forth? So the the RSC is growing. We've the current plans are for what we call
13:42
phase one of the RSC and I think we mentioned that this is designed to be an exabyte scale system and with the growth of rich media content, you know, photos and video and the kinds of Ai applications and you know, involved in understanding what's going on in content, understanding, you know, what people want to talk about and how they want to be able to talk
14:11
about what they've seen or, you know, search for images by, you know, the content of the image or whatever. Uh that that combination will provide a lot of interesting opportunities going forward. So what was the most important thing that you learned from building the RSC in hindsight, what what would you have done differently?
14:33
Uh that's a very good question. The the I think the biggest lesson for me was take every estimate of what you think you're going to need in terms of people working on the project or time frames and probably padded by more than what you expect. The amazing thing about the RSC as you know, was that all of it was
15:02
built in the time of the pandemic. Right. And so this affected us in a number of ways. One of which is that most of the people building the RSC never met each other until after the RSC was built. Right. And and you know, and and this goes across the board right from internal teams to our partner teams to our collaborators.
15:25
And so this was challenging both because, you know, everything was done remotely. But even building the physical data center, they had to take into account how many people could be in a physical space at a given time for things like, you know, putting the machines in or you know, attaching cables or attaching power. So everything about the RSC was a challenge. Just not only because of the size and the scale
15:51
of it, but also because of the timeline and events beyond our control. So the one thing that I think you could confidently predict is that all of your predictions will go wrong. No, no plan survives first contact with the enemy of sun. Su said, right, yes, this this was it was very, it was a very interesting experience at a very interesting time in the world,
16:17
but you know, we got it done. It was quite remarkable, really came together. Well, it's a super amazing project, I can say personally it was it was fascinating working on. It was a pleasure working with you to that. And so thank you very much for talking to us today. This has been super, super informative and so
16:34
really appreciate it. Thanks a lot. It was great working with you as well and to everybody pure you have our appreciation. All right, thanks very much.