00:00
Financial institutions rely on AI to process massive amounts of data, extract insights, and drive informed decision making, but accuracy is critical. Misinformation and hallucinated data can erode trust in AI generated analysis. This is where retrieval augmented generation or RAG changes the game. RAG enhances AI models by integrating authoritative data,
00:24
ensuring responses are accurate, relevant, and trustworthy. In this video, we'll explore how it works and demonstrate its impact in financial services. AI is accelerating financial analysis, enabling faster data processing and deeper insights. In financial institutions, AI must handle both structured and unstructured data such as SEC filings, earnings reports, and market trends to deliver meaningful insights and drive smarter
00:57
decision making. However, accuracy remains a fundamental challenge. Financial data is complex, constantly changing, and often incomplete, making it difficult for AI models to generate reliable insights. RAG resolves this by grounding AI responses in real authoritative data,
01:18
ensuring financial professionals receive accurate and contextware insights by leveraging real world data, RAG not only enhances AI generated responses but also adapts to domain specific knowledge, reducing the need for costly retraining while ensuring financial professionals receive relevant and reliable insights. And thanks to Flash Blade's high speed data retrieval and seamless scalability.
01:42
AI driven analysis is now faster and more efficient. Let me walk you through exactly how RAG works, including the key techniques behind prompt engineering. In this demo we're gonna do a deep dive into how to build a financial services rig model. Now, the first step is that a user will typically see a user interface like this where
02:02
they're gonna ask a question and get a model response out. Now, here we're gonna ask a simple question about a specific company's debt disclosures from their latest filing. Now notice here that I'm not connecting to the SEC filings database because we wanna show that with if there is no context that the model won't hallucinate any data.
02:22
This is very typical for models like LLMs to have guard rails around what it can show to the user. So when I run this query, it'll be fairly quick. It could not provide any information about this corporation's debt disclosures. If it says if you have some additional information, please give it to me and it'll help in assisting. Now what we're gonna do next is actually ask
02:46
the exact same query, but this time connecting to that SEC filings database that is running on KDB.AI's vector database and powered by a pure storage flash blade. This is gonna take a little bit to process, so while it's doing, let's actually look at what's going on under the hood to get a better understanding of the entire workflow. Now, we're gonna be using Llama 3's uh latest
03:10
model or Meta's latest model, Lama 3, but let's take a look at a little bit of that workflow. Now we were just looking at that user interface. In that user interface we typed in a query. Now what's happening behind the hood here when we're connecting it to that SEC filings database it's gonna check to see if there's a company name associated that it can help use to
03:31
uh narrow down the vector database search. If a company name is present, we're gonna pass that query plus that company's name to that vector DB so that it can extract the most relevant information around that specific company's filings and if no company name is present, we're gonna do an entire vector database search on all of the vectors that we have in our vector database.
03:51
Now here we're using financial services data, all 10K filings from 2009 to 2024 and. When we hit that uh KDB.AI vector database, we're actually gonna get additional rag context out so it's gonna say here are the most relevant vectors related to your query and we're gonna then pass those to uh a summarizer. So basically another LLM here that is summarizing that output. The reason we want to do that is because
04:19
sometimes that output may be not, may not be in chronological order, so we're gonna have it summarize it first. And put it in ch chronological order so that the uh final stage will uh will then generate the proper response. Now notice here that these two in green here, the get company in the game and summarize
04:37
outputs are actually optional things. You don't have to do this in your LLM. You could just go straight from user interface, vector DB query, and put it through the Llama 3 with rag, but here we're doing additional steps to highlight what might actually be happening in production because you don't want to. Uh, just throw a bunch of stuff into that large language model.
04:53
You want to summarize it and clean things up and and put additional guardrails. So these are sort of acting like agents in a sense. So lastly, that Llama 3 with rag is gonna have the full query plus its summary, uh, and then it's gonna return to the user interface. Now when it comes to the Python libraries that are used,
05:11
we're actually gonna be looking at predominantly these 5 libraries here where we're, these are just that these are the only libraries that are needed to run LA 3 with 70 billion parameters and connect to the KDB.AI interface to interact with our vector database here. Uh, there are other libraries that we use, but this is just a summary of letting you know how actually much the LLM world has progressed
05:32
with with tools like hugging Face. There isn't a lot of libraries that are really needed to build these pipelines out. Now we're actually gonna define our LLM here. Here we're actually like I said, connecting to uh Lama's 370 billion parameter, uh, model that is uh residing on hugging face.
05:50
We're gonna create that transformer saying what we want to do here is text generation. We're gonna tell it some of the traditional parameters on how many, you know, how are we gonna, what sort of quantization do we want with our parameters here. We're using float 16 because if you use, uh, float 32, they may actually take up too much space in your GPU and you can't run any
06:07
generations. And then here we're actually gonna be loading the embedding model. The embedding model here is the is sort of the secret sauce of how do you actually filter and separate all these information in this higher dimensional space so that, you know, queries about soccer are in one space and queries about football are in another space and queries about debt disclosures and 10K filings might be in
06:31
another so that when we ask a query it actually knows what space to search for and pull those relevant uh contexts out. Now, for both of these, we're going to be putting it on Nvidia A 100 GPUs. And Then kind of comes the, kind of the cool stuff where we're gonna do a little bit of prompt engineering.
06:49
Now, in prompt engineering, and particularly for, for LA 3, we have two roles that we can define. Here we have the the system role, which is you can think of that as like what sort of hat do you want your uh LLM to wear, so meaning like for example here we're asking to be a helpful financial assistant and then the user role which is sort of to support uh the user's query
07:10
and tell it what to do with the user's query. So notice here in this company name template you're saying, you know, your your task is return is to return the name of the company being discussed and only provide the name of the company in your response, right? But then for that user role we're saying what is the the company name being discussed in the
07:27
statement and then we pass that user's query to it and then we return that template to the be then passed on to the large language model with all of that information in there. Now we actually have several uh templates here we've got sort of the. The summarize the embeddings and then the Llama prompt, the full lama prompt that we're passing through the data, these can vary in size depending on exactly what you're doing and can
07:47
and prompt engineering is a is a hot topic here for improving uh large language models that we just don't have the time to dive into here. Now, uh, how do we get our data from our KDB.AI vector database? So here we're gonna be using their Python API and here, you know, if there is a company name present, we can actually filter to those results before
08:09
doing that vector query search. So we're gonna say, hey, filter to this company name and then do that vector search. If not, like I said, we're gonna grab the nearest neighbors, the top 10 nearest neighbors from our from our vector database looking for nearest neighbors. And then we're gonna do a little bit of parsing so that we can pass that to the model,
08:25
uh, quite nicely. Now In this helper function to call the LLM. This is gonna be a helper function that we're gonna use in a little bit. This is how you would just typically call the Llama 3, large language model. Here we're basically saying, look, if it's asking to do a company,
08:43
uh, query, so here we're saying like what hat should it be wearing? Should it be the company hat, the summarized hat, or just the general gen uh text generation hat here saying like look, if it's the company search, give me 16 tokens back, right, because we wanna keep the short and the name short and sweet for us to do a quick uh vector search on or a quick filtering on.
09:02
Then here for the summarize we wanted to give us some information out but not be too lengthy uh and then obviously we're gonna uh pass those through each of those templates that we defined before and then here, uh, what we're doing is basically prompting those messages for the LLM So what message are we giving it things like the tokenizer, these terminators are just sort of the end of sentences and like how to what to do with,
09:23
uh, tokens that it does not recognize so that we don't get errors out. And then our output here we're we're saying we want the temperature to be 0.6, so the closer you are to 1, the more uh the more random your responses are. If the closer it is to zero, the more static or uh cold they are, so to speak, and, and, and there's other parameters that we can be passing here to
09:43
really fine tune the output of that model, and then we're gonna return that output, but we're gonna be using this helper function in this generate response to our query. In this generate response we're saying, well, if there's included context. Uh, then let's make sure we get the company name out and then get the vectors that are related to that, uh, company name is present. If not,
10:03
just look through through all the vectors that are related to that, to that query. Then summarize those responses and then get the uh final response. So basically we're gonna say pass that context plus the final message here. Now, as I mentioned, we actually can be looking at what the model is giving us now.
10:21
And if you noticed before it was just a pretty short and sweet thing. I can't answer that question for you because I'm not provided context, but here we're actually looking at several financial metrics for the corporation we see the total debt, long term debt, short term debt interest rates and all these things that were in those filings. And then it gives us a summary of what it
10:40
thinks its financial position is given the information that was passed. Now this could be very useful for a financial analyst that is, you know, researching a company that may, it may just need to quickly look up and then what better way to do so by actually providing it with RAG information using an LLM, both of those powered by your storage flash blade.
11:01
With RAG we can guide AI to deliver reliable context aware financial insights. By retrieving data from authoritative sources like SEC filings, we eliminate hallucinations and provide more precise answers. Behind the scenes, pure storage flash blade plays a crucial role in enabling this AI powered workflow. Its speed and scalability allow financial
11:24
institutions to analyze large data sets in real time, unlocking deeper insights without performance slowdowns. We hope this session gave you a clear understanding of how RAG and financial AI models work together by leveraging authoritative data and cutting edge infrastructure, financial institutions can extract meaningful insights with confidence.
11:44
If you'd like to learn more, reach out to us at Pure Storage, and if you wanna see more ways Pure Storage is helping organizations work smarter, check out Pure 360. It's your go to hub for quick overviews, expert walkthroughs, and interactive demos. It's all designed to simplify your infrastructure and help you achieve more.