Skip to Content
27:42 Webinar

Build an Open Data Lakehouse with High-performance Object Storage

A data lakehouse unites the strength of data lakes and data warehouses to support ETL, SQL, and machine learning workloads in a single system.
This webinar first aired on June 14, 2023
Click to View Transcript
00:01
All right. Hey, good afternoon, everyone. Thanks for coming for to my session. Um My name is Fu Jian. I'm a principal of field solution architect, focus on data science at pure storage. So today we are going to talk about data lake house, right?
00:17
Not data lake, not data warehouse, data lake house. OK. Yeah. All right. Here is today's agenda. We will start with the evolution of data platform architecture. Then I will introduce data leak house what it does and how and
00:36
why we are going to use high performance object storage to build an open data lake house. We will conclude the session with two case studies with you. All right, here are the data chains and the force today, right? Everyone in the organization wants access, easy access to data to support many different workloads including
01:00
dashboards, machine learning, and many data applications. And of course all these applications need data. So they people want data lake modernization, they want more performance, they want cloud readiness. Where on the other hand, here are the challenges. Often time we found out data is poorly managed and make it quite difficult
01:24
to explore and as data keeps growing the system, the data platform becomes more and more complicated and we all know complicated data platform limits the value we can drive from the data. And because of the complexity, it also tends to be coming slower and slower and more expensive. So how do we address these challenges? Let's take a look uh a kind of a review of the
01:52
evolution of data platform from architecture, right? So the first gen start from the data warehouse in the data warehouse architecture, we extract structure data from multiple data source and then we transform the data and finally no those transformed data into the data warehouse as to support um B I and report applications.
02:17
So this process is called ETL transform and node. And about 20 years ago, data lake conceptual or data lake architecture was uh introduced to do the similar thing for much bigger scale and unstructured data. Now the challenges with data lakes is that everything in a data lake is just a fire or object, right? There's low optimization,
02:47
there's low index and many other advanced features not available. And also because of that, the data quality governance and performance is generally not that good. Now, on the other hand, data warehouse ob obviously it has limited support to unstructured data, right? It is designed for structured data.
03:09
Well in general data warehouse performance for S QR workload is quite good, but for machine learning workload, it's not very good why? Because as QL over a single OD BC or G DB G connection is simply not efficient enough for machine learning machine learning needs direct access to the data in an open form format in a very, very high concurrency, right.
03:36
So well, both data lake and data warehouse has its strength and the weakness. How about we put them together? Right. This is the second gen the two tier data architecture, right. In this two tier architecture, we still extract the data from actually structure and uh unstructured data from multiple data source. In this case, the data first land into the data
04:03
lake, right? And then we do a lot of et jobs to load those structured data into the data warehouse to support pr and report. In the other hand, machine learning data sets that goes directly to the data lake to access those unsure data.
04:22
In most case. Well, it seems to work. However, this is still just two separated system, right? It introduce its own challenges because there are so many etl jobs between the two tiers. It becomes very, very difficult to manage those ETL pipelines.
04:44
Therefore, to ensure data quality and reliability and those ETL jobs runs periodically as a bad job. So it's also increased data standards, right? So there's another challenge introduced by this two tier architecture. So the question, how do you enable easy access to your data where we have data that keeps
05:10
coming in our data volumes are exploding. There's always security compliance risks around those data. And of course, we we always hope we have more and more budget for that. Right. What can he do? Let's enter data lake house, right. I have a data lake,
05:32
I have a data warehouse boom, I have a lake house. All right. Again, a data lake house tries to unite the strength from both data lake and the data warehouse. But the key difference from the previous two tier architecture is that is a single system. It is a single system to support both ETA big data, Etrq and machine learning workload in a
06:00
single system. So it does animates most of the challenges with the two tier system, right? And here is a high level key feature of a data lake houses. Data files are typically stored in open format. It has built in AC ID transaction support partition and the schema evolution time
06:26
travel queries all supported with near data warehouse performance. Now, what does all these function words mean? And why are they matter to you? Right. We will discuss that in a minute. But for now just keep in mind all of these futures are essential for building an open, flexible and fast data platform.
06:53
So the question is how can we build one, how can we build one that is open and fast and typically this particularly with a high performance object storage and why we want to use high performance strategy for that? Let's discuss about that. Now, there are three key components in a data lake house architecture. Number one, we have the data lake as the foundation.
07:21
And then number two, we add advanced metadata layer on top of the data lake to enable a lot of uh advanced features. And number three, we are going to use process engines that understand the data lake specs. Let's work through all these three points one by one. So the first we have the Data Lake Foundation since it is the foundation,
07:51
right, we want it to be as simple and as fast as possible. And we want to store all files or data in the Data Lake foundation with the open format like or ORC why you want to do that? Because later because later we are going to use multiple different process engine for different workloads with the same single data copy. That's why we want open data format.
08:20
All right. And then why high performance object storage for data lakes? Because in his history has been very very complex complex. OK. This actually you know, prevent a lot of organization from driving value from a data lake. In the other hand, object storage is super
08:43
simple, highly scalable and cloud. Well, object storage is fast and simple yet it can be very, very powerful by adding advanced better data management into it. And we don't just want any theology. All right, we want high performance, we want multidimensional high performance of the story
09:07
for that because since we are going to store a lot of data in the data lake and of course, we are going to run many different types of workload with those data including high throughput, big data each year, low latency dashboard, and the B I and the high concurrency machine learning workload all against the same data in the data lake. That's why we need a high performance obvious strategy for that.
09:34
Now, once we have the high performance data lake, next, we are going to add metadata layer on top of it, a meta data layer enables a lot of uh advanced future for the data lake, including transaction management, virgin governance and advanced data structure for uh better uh performance, right?
09:59
And because we are going to store petabytes of data and billions of fire and objects in the data lake, the metadata for those data can also become very, very big. So typically those meta data are stored in the same data lake as your actual data. Typically again with a open format like J and A today, there are three popular
10:24
open source uh data, uh data lake house metadata library, data lake, Apache Iceberg and Apache. Here is an example of data lake data and the metadata stored in fresh S3. As you can see on the bottom, we have some file, right? These are your big data, the actual data and then under the data directory you have we have
10:50
several JSON files, those are the meta data for for your files. OK? Now, the last step is to use a process engine or actually multiple process engine that understand those data and metadata. There are many choice today, both from open source software community and commercial vendors among which Apache spark is the most popular open source software engine for uh
11:22
data lake house. Here we got examples of using Apache Spark to write and read data, data lake files stored in fresh. If you are familiar with Spark, you will see this is very similar to write and read a CS V file into S3, right. The only difference is that we say the format
11:44
is data, that's all you need to know. And the metadata library actually handle all the details for you under the hood. Now we have right, right is a transaction, right. Transaction are built and supported with data Lake House. This is another another example of doing low level update in data Lake House using Spark.
12:09
We cannot do this with the traditional classic data lake tables. This is the only way available. This Lake House and transaction creates versions, creates table level versions in the Lake House. Those versions are always available for query right on the top. They this is an example of using spark it using its uh version of API to read a older
12:37
version of the data from a lakehouse table. On the bottom, we have another example to list the full history of the table versions and all these versions, you can always travel back to query each single version available. So this is called the time travel queries.
13:03
As you can easily, easily imagine this future could be very useful for things like audit compliance, roll back, reproduce queries at certain points in time. It's a very cool future. Of course, we can also do the same thing with SQL using things like Chilo and many other uh lakehouse engines just um regular S square as you can see,
13:35
right, select and update. So if you are coming from a relational database background, you may say what's, what's the matter is just basic SQF future, right? It is true. However, if you are like me coming from a big data data lake background, you will appreciate this feature a lot.
14:01
Why? Because Q update is not available with classic data lake backed table. All right. So for example, one day you find out there's a, a wrong data, there's a myth. For whatever reason, go into your table, you want to do uh updates to create the data,
14:22
right? This classic data lake backed table, you cannot do that easily just to update a single row in that table. You have to repopulate the whole partition or even the whole table to do that, which is very slow and risky. You don't want to do that, right?
14:43
But this sq this is building this um data lake house table including data lake. So it's very easy, it's instant, right? It's just basically a square future here now. However, however, I do want to highlight that even this is supported in data lake house, we should only use this infrequently, right?
15:10
Why? Because it is not designed for things like online transactional or very frequent transactional scenarios like duration database is simply not designed for that. There's always tradeoff in this kind of a big data uh system. So we should use this uh infrequently.
15:36
Now, let's let's review what are the options available for us? For these three key components in a data lake house for data lake, we have high performance of storage to support multiple workloads with performance. And then for the metadata layer, I would say go either with Data Lake or Apa Iceberg who is a little bit uh following behind these days.
16:03
And then for the processing again, many choices right, either from open source or you know, commercial vendors. Let's put them all together. This is how we enable easy access to your data. All right, since the data stored in the data lake is open format and the
16:28
technology we use X ray and the other process engine are either open source or open standard that's low when they are looking. Therefore, it's inexpensive. And thanks to the high performance of the strategy like fresh ray S, we enable fast data exploration and it's very simple to use and operate
16:52
the compute layer and the storage layer are completely separated, right? So you can operate and upgrade in independently. So that's very easy and simple. Let's make a step further to make this cloud ready. How can we do that?
17:13
Well, we already have the data stored in object storage like X ray. We all know object storage is called LA. And the next thing we need to make this car ready is to run the process engines in the and cluster because C and S can run anywhere on prime public cloud or even hybrid.
17:40
So therefore, it makes this whole solution be able to be deployed anywhere. So cloud ready. Now, here is an example architecture for running a open data lake house with pure storage. I will go through the data flow with this. Um So first we have fresh S3 as the high performance of your storage and then your
18:10
data source, multiple data source structure or unstructured or semis structure doesn't matter land into a S3 fresh sra bucket. And then from there we are going to use spark to do the ETL or even crush machine learning uh more close V S3 API. And we are going to run the spark jobs relatively in a cluster with the spark operator
18:42
which is open sourced by Google cloud available for everyone. You may also want to introduce a Jupiter notebook that gives a your end user uh very easy to use that interface access to those spark jobs and those job spark jobs will output their, their results, the etl results back into another uh typically into another X ray bucket with either one of these and
19:14
supported and data leak spec here we are using data leak. It doesn't matter, you can go with a uh you know, ice or hoodie. Now from there, we can use things like chi or other uh you know, an engine to provide QR capability to those data in the Lake House for B I reporting and dashboard uh use cases.
19:44
And internally Chino will use require a relation database like post grade SQL to store its table uh schema information. Since the recent database like Q we all know that requires a block storage, right? So that's where that's why we have uh fresh array here to support that.
20:10
And we are also going to run the post grad database in the same. And and of course, we are going to use pro provision and the managing the persistent war for the post grade SKR container. And you don't have to stop that. You can even run your data applications in the same cluster to share the compute resource uh actually to share the computer and the storage
20:37
resource. Right. Again, this system is open, it's simple, it's fast. Now with this, let me introduce you two case studies and summary the session. So the first one is KG I Asia KJ Asia is a Hong Kong based financial financial and security company.
21:07
So like any financial firm, data is the most critical asset to the company built with pure storage, fresh and Jim I. One of our data lake house partner, Kgr Asia was able to consolidate multiple data silos into a single, fast and simple data lake house system. Therefore enable fast data insights to the company.
21:40
The customer also like the simplicity fresh we provide, they were able to combine these silos into the new system backed by Fresh Bay within three months. OK? Which is very fast, much faster compared to the industry standard of six or even nine months. The second case study is a company based in Singapore.
22:11
So this company provide cyber security service to government agents across Asia. They have built, they have built this system using legacy Hadoop technology a couple of years ago, they needed to do a tech refresh to the whole stack. Of course, they, they consider just simply upgrade to the latest uh software stack.
22:40
It's a later uh consideration. However, they didn't, they didn't um um fun enough value from the expensive software upgrade and they also need advanced A I powered a capability to quickly identify new security threats.
23:07
So they, they came to pure what we offer is a simple and fast data platform built with all open source technology, open source and open uh technology and leverage pure portfolio including fresh spray, fresh array and works today. They are running the whole big data and A I pipelines together in a single cluster, which is very important for this
23:36
team because they are such a small team. They also like us and the fact that we are not like some of the other competitors that we are not, they are just set in the box, right? We are more like a long term, you know, uh partner to the customer. I personally spent a lot of time with the customer and III I built maybe 78 demos and workshop with the customer cosigned the system
24:01
with the customer. And I'm happy to see that there's a value with that. So the benefit is huge for the customer today. They can run a single pipeline from data ingestion to preparation ETS Q or even deep learning training in a single system and it lowers TCO a lot for the customer because everything now is open,
24:27
right? And they also like the the fact that they can do big data results and do complexity and then they don't have to worry about maybe every five years. They don't have to worry about data migration. Like you, if you go with other con you know, storage vendor, you have to think about that, right?
24:46
You don't have to do that. Thanks to Evergreen. So this is the high level architecture of this customer system. And as you can see, it's very similar to what we just talked today. I do want to highlight the customers even a step further, right? They are also doing deep learning A IML using
25:09
tensor flow learning in GP uh GP U servers again in the same, same cluster and use the same fresh spray data lake. And this is exactly why you also should use high performance storage like fresh play for your data system. Maybe today, you are not running, you know, a IML deep learning but eventually,
25:38
right, everyone wants to go to this path. Eventually you will consider on boarding a IML into your system. And when that happens, you don't want to build, to build another data cycle. Just for this applications, you want to leverage your existing data lake, your existing high performance data lake.
26:07
So that's the case study. Now, let's conclude the session with three key takeaways. Today we talk about Data lake house which is a single system to combine the strength from both data lake and data warehouse to support big data ETL SQL and the machine learning workload again in a single system.
26:35
And we want to build an open data leak house store the file in open formats using either open source or open technologies to avoid looking. Because later you are going to use, you are going to run multiple use case on on the data lake, how you're going to use different process for that. So you want to keep your platform open, right?
27:01
And that's also help you save your cost in the long run, we talk about using high performance object strategy to build a data lake foundation for that because that's really uh give us uh fast data exploration across all the different workloads. And also because of strategy is simple, it makes the whole system simple to use and
27:29
operate. So with that, I want to thank you for, for for your time. Um Any questions from the audience can
  • Object Storage
  • Pure//Accelerate
  • SQL

This session covers current architectural challenges and demonstrates a flexible, on-prem, and hybrid-cloud lakehouse architecture, built with open source technologies on high-performance scale-out S3 storage for cost-efficient and fast data exploration.

11/2024
How Healthy Is Your Data Platform Really?
Complete this self-guided wellness check to help determine if your data platform can successfully adapt with your organization into the future.
Infographic
1 page
Continue Watching
We hope you found this preview valuable. To continue watching this video please provide your information below.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Your Browser Is No Longer Supported!

Older browsers often represent security risks. In order to deliver the best possible experience when using our site, please update to any of these latest browsers.