In the last 30 years we generated a lot, a huge amount of data.
Every day we created more than 2 .5 millions of terabytes per day.
And by the end of the 2026 we will have 180 zettabytes.
it means 180 billions of terabytes it's a huge amount of data and it's difficult
to manage this data because every day we produce data since late 90s like me so
it's difficult to interact with our personal data because we have different
devices, different tools, so when we are looking for something or documents we
have to access to the search engine of WhatsApp or the search engine of our
email or search engine of our PC and it's difficult to manage all this kind
of data.
And where are located this data?
Inside data centers.
There are more than
10 ,000 of data centers in the world and the United States is the nation where we
have the most of them.
The solution of this problem to manage our data is
inference.
1So our vision is that we shift from research to inference.
The
The technology, the user experience will be the same,
the input text.
So before to introduce the concept of tokens
and the token industry,
I would like to summarize the history of Generalty AI
starting from 2012 when Google create Google DeepMind,
the group that released the technology of transformers
and large language model.
model.
After only three years OpenAI was born using this technology and after seven years
they released GPT -3, the large language model that works well to introduce to the market.
But in the last three years we saw the growth of the new industry, the token industry.
In 2022 we don't have any LLM in the market, so we had zero tokens per month generated.
After a few years, today, we have more than 50 trillions of tokens generated per month.
What it means?
Just an example to understand what are tokens.
Each large language model has a different way to tokenize our prompt.
so for example this is a simple prompt when you are using a prompt and you
press ENTER on your keyboard the transformers generate the token in this
case GPT -5 generates four tokens for this sentence and it means generate
numbers, vector numbers.
We give these numbers, the vector numbers, to LLM.
The LLM generated the computation and give back numbers and then translate in words.
So, transformers are AI models that analyze vast amounts of text to
to understand context and relationship,
enable them to generate accurate human -like responses.
In these videos, these cool videos,
you can see how the transformers works
to predict the words after the other words.
Vectors are numerical representation
that capture the meaning of words, phrases, or data
in a high dimensional space.
So it's very complicated, but for us, using generative AI is very simple because we have an input text and just write down our prompt.
When we do prompting on our systems, we generate tokens.
Generating tokens, it means compute, calculation by using GPUs.
When we are using GPUs, we are doing inference.
The CEO of NVIDIA said that in the next few years
we shift from the training industry to the inference industry
because when we finish to create models to train models we start to do more
inference than training and this is the reason that the chipset for trainings
doesn't exist specific for inference because now they are using GPUs for
for training and inference but with the same GPUs chipset so NVIDIA announces Rubin and Feynman AI
chips for 2027 and 2028, Jensen 1 just saying the inference market will grow in the next few five
years.
Google of course
do the same with TPU, not GPU
remember that NVIDIA built only
GPUs, graphical processing unit
not specific for the AI, this is the big fortune
of Jensen 1, that works the same for gaming
and artificial intelligence, and also Google
come out with the first Google TPU for the age of inference.
They are saying
that the market in the future will be the inference.
But there is a big problem
because if I want to connect all my knowledge base, Whatsapp, my email, my
storage and my inference, and I'm using OpenAI, Anthropic, etc.
to rely on
provider that use this large language private closed large language models you
are giving away your personal data so the solution there is only one solution
you have to use a private inference platform so now I want to talk about
training fine -tuning and inference training you need a lot of GPUs to do
try to train your model from scratch if you want to create a large language
model you must do training using deep learning technologies so you have to
to collect a huge amount of data
and start with deep learning
to generate a large language model.
You need weeks or months of these GPUs
at the maximum energy consumption.
Fine -tuning, when you want to inject your data set
inside a foundation model to have a vertical model
specific in your data set and your knowledge.
Inference is the interaction with large language models, prompting.
According to this research of CB Insights, we are running out of high quality data to train large language models.
models.
This research estimates that by the end of this year we will finish all high quality
data to train models, public data, available on the internet and open source content.
So
So OpenAI, Anthropix, Gemini needs more data, our data.
How many open source models are available right now?
More than 2 .5 million open source models
that you can download from HackingFace
and use on your own server.
And before my last slide,
I want to show you how prompts works
regarding energy consumption, just a curiosity
so I would like to take
this easy prompt and put here
on my platform and ask to
my inference platform using QN
open source model and compare with
Mistral at the same time.
Mistral is on the right, it's a little bit slower but
it works very well.
QN is on the left and it's very fast and what happens on my
server?
I can notice that Mistral is the orange GPU and the consumption of this
prompt is 300 watt the blue one is q1 100 watt for the same prompt okay this
is the the inference time a few seconds the GPUs start to compute and then give
back the vector with the the response okay when we are doing inference we do
do stuff like this it's different from training training is always for weeks or
months at the maximum voltage in this case inference to this one enter on my
keyboard start the computation and stop start and stop start and stop this it
means that infrastructure is very different from the training models and
we need to build our European infrastructure to do this because it's
very difficult to find data center in Europe that has the requirements for the
AI and the inference this is an example of private GPT because it works on a
server private server and you have the the same results of gpt4 and what to finish my presentation
with these three key point privacy is not negotiable digital sovereignty is just isn't
just about data ownership because of course we are the owner of our data but it's about autonomy if
If you are not autonomous to manage your data, you are not independent and you are vulnerable
and not for EU only.
Thank you for your attention.