Confidential Computing and Private LLMs

Introduction

Today, what I'm talking about is quite a pressing issue. Privacy and Gen AI are things that come across my LinkedIn feed pretty much on a daily basis.

Everyone's concerned about how generative AI is affecting privacy data ownership. And honestly, what is the meaning of data anymore when data is not really representative of real facts but can be generated at will by these algorithms?

Ensuring Trust in AI with Privacy-Enhancing Technologies

1What we do at nCloud is ensure that companies can trust their AI through the use of novel privacy-enhancing technologies. We use something called confidential computing to secure the entire deployment of the model and any other peripheral components.

And I'm going to demo that to you today in real time.

nCloud's Background and Achievements

Our company, nCloud, was founded in 2019, and we recently got a Innovate UK government grant to build out this platform and make it available throughout the UK for small and medium-sized businesses. Additionally, we recently graduated from Accenture's FinTech Innovation Lab, and we've talked to a number of leading institutions in the city to validate the problem of privacy when using GenAI.

I do not need to belabor the fact that privacy is a huge issue, and especially when it comes to regulated firms like financial firms and their vendors. Even fintechs and other SaaS vendors that are trying to provide AI services as part of their general SaaS offering face a huge trust gap. The trust gap is how can they assure their customers that the AI that they are providing as part of their general SaaS offering is actually secure. It's not using the user data and prompts for training and fine-tuning the models.

Challenges with Current AI Services

The popular services available today like Azure OpenAI, GCP, Vertex AI, and AWS Bedrock tend to filter every prompt and output that comes out of their models. These are black box services that generally restrict the kind of the model, the architecture behind the scene, and even the region where such services can be deployed. It's almost like having a handcuff at the expense of convenience.

The other part is that you have to trust the vendor's software supply chain. You don't really know what inferencing capabilities and RAG architecture is running behind the scenes. Even though these companies are audited, this is a mystery to most companies.

And the final thing, with incoming regulation like the EU AI Act and DORA, providing trust to your customers with your AI services is going to be a regulatory requirement. The EU AI Act specifically states that state-of-the-art privacy techniques must be used with inferencing and training data.

Solutions for Data Privacy and IP Retention

Now, how can this be solved today? There are a number of techniques to solve for the data privacy and IP retention issue, and a lot of the solutions focus on taking software, open source software, and building traditional data privacy controls like data encryption, access control, and audit.

But recently, Apple came out with the next standard for privacy when it comes to AI. They talk about a end-to-end encrypted architecture where the data is encrypted right from the client device all the way to the avenue of the compute.

That's what we at nCloud have also built, coincidentally, but our architecture relies on open source components, and unlike Apple, we're not limited to their silicon.

Confidential Computing: A Deep Dive

Confidential computing is a bit of a mystery term, so I want to spend a minute on it. What it allows you to do is run processes on specialized hardware, which could be CPUs or GPUs, and the process itself runs in encrypted memory.

This means that When the process is executed, the system admins of the cloud provider or any software vendor do not have access to your data.

You can, in addition to protecting the data in memory, verify the state of the entire deployment through something called a attestation report. This is a report of the entire boot state of that virtual machine on that hardware, meaning that you have absolute assurance that the hardware and the software are not tampered with.

What we do at nCloud is enable the deployment of Gen-AI and LLM ops on the specialized hardware. Our principles are security. We combine confidential computing with role-based access control.

We're flexible. We can be deployed across a variety of available hardware like AMD EPYC and Intel Xeon CPUs and Nvidia H100 GPUs. And we're using state-of-the-art inferencing techniques, including graph and vector-based drag.

Demo of nCloud's Solution

I'm going to jump into the demo very shortly. And you're going to see this thing in action.

But I want to explain a bit about how it's going to work. And I know that this is a bit technical.

And feel free to reach out to me afterwards and ask more questions.

But what you're going to see today is I'm going to encrypt my prompts on my browser using an asymmetric encryption key pair. Now, that key pair holds basically the secret. If it gets into the wrong hands, my data can be decrypted.

What we've created is a authenticated key release mechanism where we procure the entire AI inferencing stack on the specialized confidential computing hardware currently on Google Cloud using an AMD EPYC CPU. we are able to have the entire RAG inferencing pipeline be validated, a attestation report sent to the browser. The browser validates against a predefined set of policies.

It trusts that this is coming from a valid hardware, and the software running on the hardware is also not corrupted. In lieu of that report, it exchanges the encryption keys back into that inferencing server. The inferencing server decrypts the inputs and is able to run the entire inferencing process in encrypted memory. That's why it is end-to-end encrypted.

Demonstration Walkthrough

Without further ado, let me jump into a demonstration.

So you will see this is the UI. I have a number of files that I've uploaded already.

I want to first start off by showing you the asymmetric keypad that I hitherto mentioned. This asymmetric keypad is sort of like my identity. It's mapped to my auth identity. So this is uniquely my signature on the system.

And this is the encryption key that's going to be released to the inferencing server in lieu of the attestation verification. What I can do is I can upload any number of files from either my local machine.

At any point that the file gets uploaded, it's first encrypted on the client side using a AES-256 key. The association of that data encryption key with the file is sent over to our storage layer, and that metadata itself is encrypted with the key encryption key, which is the asymmetric key pair.

This is a technique called envelope encryption. It's quite common in major Cloud providers.

Now, what I can do is I can just select from a number of different data sources. I can map my Google Drive. I can map the popular data sources.

Now, I want to focus a bit here on that attestation report. This is truly where our innovation lies.

Right now, I am communicating with a dedicated confidential computing-enabled inferencing server running on Google Cloud, and I have proof of it. This attestation report gives me an integrity event that is published by Google and AMD's APIs, which says that the hardware that it's running on is enabled by secure encrypted virtualization, meaning that it's confidential computing-enabled.

The other thing I want to point out is we know exactly where this instance is deployed, we know the region, we know the instance ID, and additionally, the last two components are very crucial. We know the entire boot state of the VM, and if there's any corruption when that dedicated virtual machine booted up,

So our client-side policies validate against those boot state values. And finally, you can see here that this is a hash of the entire inferencing server's Docker container, meaning that even if one file changed on that inferencing stack, this policy validation would fail and the key release wouldn't happen.

What we validate here is the entire RAG inferencing pipeline, including the LLM, the vector database, any embeddings model. So anything related to the inferencing is all validated.

Finally, this is a very familiar interface that most people know of. What happens here is I establish a direct WebSocket connection with that inferencing server based on a predetermined port, and I'm able to relay my queries.

Now, one thing I want to point out is the performance isn't great, and that is a known limitation. Today, this is running on purely a CPU.

We do not have access to confidential GPUs at scale today in the world because the only confidential GPUs are the NVIDIA H100, which, for once, people familiar would know, cost a fortune. That is going to change soon, and we hope that the same set of capabilities can map over to that confidential GPU and provide us with lower latency and even running complex models like the 70 billion parameter variations of LAMA.

Conclusion

So yeah, that is the demo in a nutshell. And I'd be happy to take any questions.

I know it's a complex piece of software. So if you have any questions, any doubts, I'm here the whole evening and happy to take them now.