From Search to Inference: The Rise of the Token Economy and Private AI

Introduction

The data explosion and why it’s hard to manage

In the last 30 years we generated a lot, a huge amount of data.

Every day we created more than 2 .5 millions of terabytes per day.

And by the end of the 2026 we will have 180 zettabytes.

it means 180 billions of terabytes it's a huge amount of data and it's difficult

to manage this data because every day we produce data since late 90s like me so

it's difficult to interact with our personal data because we have different

devices, different tools, so when we are looking for something or documents we

have to access to the search engine of WhatsApp or the search engine of our

email or search engine of our PC and it's difficult to manage all this kind

of data.

Where all this data lives: data centers

And where are located this data?

Inside data centers.

There are more than

10 ,000 of data centers in the world and the United States is the nation where we

have the most of them.

From search to inference

The solution of this problem to manage our data is

inference.

1So our vision is that we shift from research to inference.

The

The technology, the user experience will be the same,

the input text.

A brief history of generative AI and the rise of tokens

So before to introduce the concept of tokens

and the token industry,

I would like to summarize the history of Generalty AI

starting from 2012 when Google create Google DeepMind,

the group that released the technology of transformers

and large language model.

model.

After only three years OpenAI was born using this technology and after seven years

they released GPT -3, the large language model that works well to introduce to the market.

But in the last three years we saw the growth of the new industry, the token industry.

In 2022 we don't have any LLM in the market, so we had zero tokens per month generated.

After a few years, today, we have more than 50 trillions of tokens generated per month.

What tokens, vectors, and transformers actually do

What it means?

Just an example to understand what are tokens.

Each large language model has a different way to tokenize our prompt.

so for example this is a simple prompt when you are using a prompt and you

press ENTER on your keyboard the transformers generate the token in this

case GPT -5 generates four tokens for this sentence and it means generate

numbers, vector numbers.

We give these numbers, the vector numbers, to LLM.

The LLM generated the computation and give back numbers and then translate in words.

So, transformers are AI models that analyze vast amounts of text to

to understand context and relationship,

enable them to generate accurate human -like responses.

In these videos, these cool videos,

you can see how the transformers works

to predict the words after the other words.

Vectors are numerical representation

that capture the meaning of words, phrases, or data

in a high dimensional space.

So it's very complicated, but for us, using generative AI is very simple because we have an input text and just write down our prompt.

Inference becomes the new computing market

When we do prompting on our systems, we generate tokens.

Generating tokens, it means compute, calculation by using GPUs.

When we are using GPUs, we are doing inference.

The CEO of NVIDIA said that in the next few years

we shift from the training industry to the inference industry

because when we finish to create models to train models we start to do more

inference than training and this is the reason that the chipset for trainings

doesn't exist specific for inference because now they are using GPUs for

for training and inference but with the same GPUs chipset so NVIDIA announces Rubin and Feynman AI

chips for 2027 and 2028, Jensen 1 just saying the inference market will grow in the next few five

years.

Google of course

do the same with TPU, not GPU

remember that NVIDIA built only

GPUs, graphical processing unit

not specific for the AI, this is the big fortune

of Jensen 1, that works the same for gaming

and artificial intelligence, and also Google

come out with the first Google TPU for the age of inference.

They are saying

that the market in the future will be the inference.

The privacy challenge with closed AI providers

But there is a big problem

because if I want to connect all my knowledge base, Whatsapp, my email, my

storage and my inference, and I'm using OpenAI, Anthropic, etc.

to rely on

provider that use this large language private closed large language models you

are giving away your personal data so the solution there is only one solution

you have to use a private inference platform so now I want to talk about

Training vs fine-tuning vs inference

training fine -tuning and inference training you need a lot of GPUs to do

try to train your model from scratch if you want to create a large language

model you must do training using deep learning technologies so you have to

to collect a huge amount of data

and start with deep learning

to generate a large language model.

You need weeks or months of these GPUs

at the maximum energy consumption.

Fine -tuning, when you want to inject your data set

inside a foundation model to have a vertical model

specific in your data set and your knowledge.

Inference is the interaction with large language models, prompting.

Why high-quality training data is running out—and why open source matters

According to this research of CB Insights, we are running out of high quality data to train large language models.

models.

This research estimates that by the end of this year we will finish all high quality

data to train models, public data, available on the internet and open source content.

So

So OpenAI, Anthropix, Gemini needs more data, our data.

How many open source models are available right now?

More than 2 .5 million open source models

that you can download from HackingFace

and use on your own server.

And before my last slide,

Energy, infrastructure, and private GPT in practice

I want to show you how prompts works

regarding energy consumption, just a curiosity

so I would like to take

this easy prompt and put here

on my platform and ask to

my inference platform using QN

open source model and compare with

Mistral at the same time.

Mistral is on the right, it's a little bit slower but

it works very well.

QN is on the left and it's very fast and what happens on my

server?

I can notice that Mistral is the orange GPU and the consumption of this

prompt is 300 watt the blue one is q1 100 watt for the same prompt okay this

is the the inference time a few seconds the GPUs start to compute and then give

back the vector with the the response okay when we are doing inference we do

do stuff like this it's different from training training is always for weeks or

months at the maximum voltage in this case inference to this one enter on my

keyboard start the computation and stop start and stop start and stop this it

means that infrastructure is very different from the training models and

we need to build our European infrastructure to do this because it's

very difficult to find data center in Europe that has the requirements for the

AI and the inference this is an example of private GPT because it works on a

server private server and you have the the same results of gpt4 and what to finish my presentation

Conclusion: privacy, autonomy, and digital sovereignty

with these three key point privacy is not negotiable digital sovereignty is just isn't

just about data ownership because of course we are the owner of our data but it's about autonomy if

If you are not autonomous to manage your data, you are not independent and you are vulnerable

and not for EU only.

Thank you for your attention.

Finished reading?