Solving Sudoku using Transformers!

Introduction: Why Solve Sudoku With Transformers?

My name is Paolo. I'm a regular so yeah Usually when someone is missing a pick some my sort of research toys.

So today It's kind of like between theory and practice So this is like just like a project I've been doing and the idea is to solve Sudoku

but with transformers, right like no no the actual cartoons that that we've been grown up to, but the actual transformers from LLM technology.

So how many people are Sudoku experts or solvers? Does anyone practice Sudoku? Wow, it's not many

people. Yeah, it's on LinkedIn, isn't it? It's annoying, that one.

Your colleague has beaten you so yeah I wasn't big on Sudoku but I kind of learned to play it why it's it's it's a single player right you don't beat opponents but it's kind of like crosswords right and yeah I've been playing it for a few years so yeah

what's Sudoku you know how do we solve it as humans and you know how can you have a machine to solve it and like is he even worth solving it like you know Why why do we care and most importantly can transformers do this right because everybody is talking about AGI

So like can a transformer like Optimus Prime so that and yeah, show me the code, right? I always keep code

A Quick History of Sudoku

So Sudoku was actually Was it's like people think it's kind of new but the I mean the modern version of Sudoku was invented in 1979

But you know the example from the 18th century Originally it was a Swiss guy, I didn't know this, I was just searching like five minutes ago, which created something called Latin Squares, like, I guess it was a reference to the kind of Roman, one of the previous Roman games, so it was 1780s.

And then there was this American guy that was like an architect, and he had a hobby of doing like puzzle, made it this, you know, sort of modify that with a three by three sub -grid rule.

And then the recent one, well, the most recent one was sort of developed in Japan in April 1984, which is why it has this kind of weird Japanese name.

So, Suji wo Dukowa, I can't even pronounce that, which translates roughly to the digit must remain single, which makes sense.

And so why it's kind of interesting, I mean, like, well, you know, like you would think, well, it's like, I don't know, it's a child game, right? It's just to waste some of your time.

Why Sudoku Matters in Computer Science

But actually, if you look at computer science, it's an NP -complete problem.

I don't know how many people have done computer science theory, but for a problem that is NP -complete, there's two features. One is it's NP, so in non -deterministic polynomial time.

So essentially, if I give you a Sudoku grid, it's very easy and fast, relatively fast, so you can do it in polynomial time to see if the solution is correct.

So like checking the answer is easy, but at the same time, to actually solve it, it's NP -hard. So it's as hard as the hardest problem in the NP class.

So it's, yeah, it's like, it's very hard to solve, right? I mean, from a computer perspective. And so you have exponential complexity. So of course you know this is like people have solved this so I'm not claiming that

Classic Deterministic Solvers

I've done something like amazing so there are several ways that you do it like probably if you go into an interview and actually I was asked once many many years ago to solve Sudoku and I did it with the first approach which essentially it's backtracking so what

what you do, somebody give you a Sudoku puzzle, missing numbers, you start to fill them, sort of randomly, right, so you start from one to nine, and you start to fill it, and it's like, oh, I'm stuck,

you know, like, that doesn't match, you go back, right? So it's brute forcing, right? Time complexity is n by m, where n is the size of the Sudoku, and m is the clues,

you know, how many things are missing, but you always find the solution the the second way which is really interesting is constrained programming I don't know many people have studied that but the idea is it's very similar how the would you solve it in your mind like most people use this kind of approach where they they look at

the cells have less less numbers they start to put some numbers there and then they check horizontally and vertically to kind of see if the other ones are matching but yeah it's still kind of time complexity is exponential but you can prune these branches and the space complexity is

and at the power of two so like that's why it's kind of hard to keep i mean for a human because you know you you need to keep sort of a lot in your uh your memory when you want to solve that um and the third solution was from donald nutt which is still alive yeah it's 80 uh 88 he was

in the news recently for something else and he came up with this solution called dancing links uh like it's a very elegant approach like you know you should go and search it online and this is really fast right um you have time complexity c by n this is a constant and then space complexity uh it's cubic but you don't care because you know memory is not a big issue um so i did though you

You know, I've implemented all those three and yes, you can find solutions, but then

Probabilistic Alternatives

you can also do something more fancy, more probabilistic.

So you can do simulated annealing where essentially you, it's like a metal. So you start from high noise, like a diffusion process. You start from a very high chaotic status and then you cool down the temperature until you hopefully find that solution.

Of course, there is no guaranteed solution and has a complexity based on number of iterations,

But you can also do Monte Carlo tree search, which is also probabilistic in nature and doesn't give you a solution.

Maybe you can try it for many times until you find one, but in general you have to stop at some point.

Baselines and What Works Best

And all of this works. So I did some results just to kind of see, OK, what's the baseline?

And you can see Sudoku, based on the actual number missing, a different level of complexities. You know, people usually argue, but usually F4, you know, it's the kind of easy expert mode. So easy, medium, hard, and expert.

And if you look at the annealing, CP solver, and DFS, you know, they all solve that. But when you go into this kind of probabilistic methods, like annealing, you know, they kind kind of struggle because, especially in the R problems, they can't find the actual solution.

Of course, you can look at different metrics or how many iterations does it take to solve it and DLX is the best one. You can see it's kind of constant, so it doesn't really depend on the complexity. The other ones have some, they are based on how many numbers are missing.

and so you can do all these things. I mean, like memory usage, et cetera, et cetera, right? So you can also like time by difficulty and this kind of stuff.

So the, well, I mean, the best algorithm so far is the DFS, sorry, the DLX, dancing links beats everyone. It's very deterministic and low memory consumption. You can solve it really fast, right?

Can LLMs Solve Sudoku?

But we have AGI, right? but you know, can we solve this through an LLM, right? And people, you know, everyone now is like, yeah, let's take an old problem,

just try to solve it with an LLM, right? Let's see what does it do, right? Does it work? And so unfortunately it doesn't, right?

And you can try yourself, unless you have a very common Sudoku problem that was seen in training. Yes, it will solve it,

Sudoku Benchmarks and the Memorization Problem

but it's just memorizing the actual output. But people have spent time doing this. So they build the Sudoku bench leather board where essentially they developed like variants of Sudoku puzzles that have never been seen by an LLM.

So it's like not cheating. And unfortunately, yeah, you know, it's really bad, right? If you can look at the average solving rate, even GPT -5 high is like 30 % and

you know, as you go down in the kind of less capable models, like I think Gemma it was mentioned or or when you get like near 0 % with multi -step and single shot obviously is even less on average. So yeah, you can solve that. I guess it's kind of obvious.

You can play with it if you want. The problem is because there's no way for an LLM to kind of understand that there is a constraint and like one thing you can do is to essentially say, hey, like try to create a Sudoku variant and and he doesn't understand the constraints, right?

So this is, people have been doing this,

Beyond Chat LLMs: Specialized Reasoning Models

and then Yann LeCun, which is a famous professor, and he used to work for Meta, and in general he said, okay, I'm going to leave Meta, and I'm going to build a new type of AGI

that he calls AMI, and he developed this energy -based model reasoning model and guess what he basically announced that he was able to solve Sudoku puzzles okay and the first comparison he did was

hey I have this specially trained model and I can beat other LLMs which I think he was kind of pointless because it's like you're comparing something very specialized or something very general but you know he has a lot of money and he's quite famous so I thought okay maybe you know I can do like you know I

I don't have like 24 millions in the bank, but maybe I can try to build something that can sell Sudoku with a transformer architecture.

Designing a Transformer That Solves Sudoku

And so the problem is like, well, you know, transformers are trained on text, where you have words, right? So that goes left to right, and you have to complete, you have to predict the next words. It's a very simple autoregressive model,

but you know, how do you do this on Sudoku? I mean, it's like there are no words. There are numbers.

Maybe those are tokens You know, they need to be you know, it's a fixed size. So it's not infinite, right? Like it's not like you have an infinite sentence.

So it's but you like you don't predict something next right yet predict something in between

Data, Encoding, and Augmentation

So the first first thing I did was okay, let's take some real Sudoku puzzles I didn't want to generate them myself just you know, do something real and And yeah, somebody ranked them on Kaggle. It's a free data set, so you can download it, and there's a lot of them. And they're ranked by complexity, right?

And then the next one was like, well, how do I encode this thing, this puzzle? This would be an encoder, an embedding, I need to have attention, then maybe a decoder, and sometimes I need to also augment this data set because I want to be able, well, I want the transformer to see also slight variation of that, so you can rotate it and do some reflections so that you can create more of those samples if you want to.

So I was kind of looking into literature and say, well, is this similar to text? And if you look at other type of transformers called BERT, they kind of do a similar thing, like you have a sentence and you randomly remove some of the words. And the when you train this thing it needs to guess the missing words, you know the the one in the yellow part, right? So it's kind of similar to Sudoku except that the structure, you know The numbers you are guessing are the you know, the ones that you have to feel But of course, there is a some sort of inherent rule between them, right? Like like even if you put five and there's another five on the other side, you know, it doesn't work, right? So there's some sort of regularity that you have to sort of embed into this one

But also is very similar to job scheduling. So if you have a which is also another MP problem, NPR problem, like if you have a hospital and you have like, let's say, NHS is doing this greatly, by the way, if you have 10 people and you have to kind of assign them to different shifts, you have some sort of constraints, and it's kind of the same thing, right? You have some sort of holes and you have to put the right people in the right spaces, and that costs you, I don't know, 20 billions, apparently, pounds a year in tax money. funny.

Architecture and Attention for Sudoku Constraints

So I did a lot of trial and error, to be honest. I played with different embeddings and different tension layers. And at the end, I came up with this architecture, which it's

published. I tried to get Gemini to visualize it, which kind of looks OK. But the idea is

is the vocabulary, it's like 10 tokens, so essentially you have one to nine are the numbers and zero is the one that is missing.

And I sort of structured them to follow these, these, you know, like, you know, six, sorry, nine blocks.

And the trick was, you need to be, like, you need to embed this knowledge of the attention because you need to know that a number in the top left cell cell needs to know that the numbers around him, right?

So he needs to know that there needs to be nine different numbers within that cell. And he also needs to look on the right and on the bottom of it, right? So it's kind of like mask language modeling, but they only look at the left side.

But here, it needs to look within this cell, and then on the right and on the bottom, right?

So I work with these several iterations, different size, different strategies. strategies, different output projections. So it was mostly experimentation.

And after a couple of days, I sort of figured out this 18 million parameter model that was trained with this architecture. I think it's published, so you can go online and check it.

Training Objective and Iterative Inference

But the The idea is you take your training data set, you have the complete puzzle, but also you know the solution, right? But you also have the actual problem, like which things are missing.

So you take all these digits, you embed them in this special matrix, which is 81 by nine in terms of projections, and then you calculate this loss function, which is essentially the cross -entropy loss between the one that is missing in the one that's guessing, right? And you just do this update.

In Firenze, it'll be more complicated, so you have to do, it's kind of like deep seek, I guess, where you do a forward pass, you get some guesses, but then you have to iterate until you find something that has the final solution,

because as you pick these numbers, some of them might be valid, but just within a box, maybe it's valid on the right side, but it's not valid on the vertical, right? So you have to do this kind of loop several times until you find the actual solution.

So it's kind of similar to a diffusion process where you start from a high noise, you know, like this image diffusion models where you start from noise, and then some of the pixels start to become lines, and then it becomes like a color inside, becomes the eyes, et cetera.

So it's kind of an iterative process process where you pick your best guess, and then you go back and see if it was satisfying this constraint until you solve that. And of course, it doesn't mean you will solve it, so sometimes it just won't find the solution, because it's probabilistic, right?

Results: Accuracy, Solve Rate, and Speed

So after a few days, I came with this, which I think it's pretty amazing for just a weekend and project.

So I managed to get 93 .33 % in terms of validation accuracy, in terms of cell level accuracy, which is essentially, you know, how good is to guess a number inside each cell is 92, 93%.

And in terms of solving the full puzzle, like success rate, it's between 75 and 90%, right? Depending on the difficulty.

At the end, I mean, without calculating all my attempts, of course, like, you know, once I figured out the architecture, it took me, like, three, it takes this to train 3 .5 hours on a decent -sized GPU.

And the speed, it's, I mean, it's fast, right? Because there's a GPU doing the inference. It's like half a second, you know, one second, depends on the complexity.

And you can run it on your machine for, like, it is just an and 80 million parameters. You can't use LLM Studio because, yeah, it's not like, yeah, it's not a text model, unfortunately.

I didn't have time to build an app, but maybe I'll put it online.

And yeah, so you can see, like, obviously, like, it doesn't solve, you know, it's sort of, depending on the level, he has better success rates, so the easy ones, it's like, you know, almost 100%. And then when you go to the,

It's kind of interesting, if you see this graph, easy one and hard one is very good, but in between the certain type, the four to five, it was a little bit worse than the other ones. And I don't know, maybe there were less of those examples. I don't know exactly why, but yeah, it's not like a linear scale. It's more like there's a dip in the hard versus extreme model. model.

And yeah, I mean, you can see the cell accuracy change based on the difficulty, and you get this kind of weird trend where it dips on that one. And yeah, so what's the

Key Takeaways: Practical Research Without a Huge Budget

takeaway? Don't be afraid of doing research.

I think you can, you know, you can, like, Like, you know, if, you know, Yanli kind of got, I think, allegedly got 24 million to source Sudoku, which is, you know, the best people from Meta. But I had like probably $12, you know, between a couple of beers and some free credits.

And, you know, I managed to sort of, well, I don't know if he's beating me. He hasn't published any accuracy figures. So, you know, I did it.

So, and yeah, I think the interesting thing is like, you know, I would be on Wirenext, but a lot of people think to do research, you have to have a lot of data, a lot of hardware. But, you know, if you focus on like, you know, if you focus on specialized models to solve specific problems, you know, you don't need to have a mega budget, right?

So you can, you know, you can do it on your own.

Code, Checkpoints, and Next Steps

so yeah there's a github it's not live yet but i'm going to publish tonight so you can you know i publish everything the the actual checkpoint so you don't have to train it um and like yeah there is the you know how do you actually train it the inference time and maybe i will i will put an app if somebody wants to try it but um yeah i thought it was something fun to share uh and uh

Conclusion and Q&A

And yeah, if you have any questions, it's not about what kind of beer, but...