Building an AI Safety RPG with Claude

Introduction

Hello everybody, thrilled thrilled thrilled to be here.

Why AI Search Engines Matter

MindStone is a wonderful space where we want to be sharing works in progress of our projects and collaborating, so I'm really genuinely excited to you today to show a

completely non -financial, this will not make any money out of anything but it's a project I'm really, really sort of interested in.

Choosing and Evaluating AI Search

But if you want to be nerdy and interested in things in general, you might want to go to an AI based search engine.

If you're going to go to a search engine, you might want to know which ones are better than others.

Two Key Metrics: Accuracy and Humility

And your criteria for better there is two basic dimensions. One, how often does it give me the correct answer rather than the incorrect answer?

And two, how often does it say I don't know when it doesn't know?

You can have a model that says a correct answer more often the time but I don't know less and so has a false positive rate a little bit higher.

What the TAO Center Found

And the TAO Center did some awesome research on this to actually just catalog all these different search engines.

So right at the bottom here is Gemini. I believe that's the free version of Gemini not deep research.

Over here is Grok with lots of errors we know for political reasons, all sorts of atrocious stuff about Grok and news recently.

Over here is Grok2. Over here is Copilot. Over here is DeepSeek Search.

Top Runners and Trade-offs

Here's Perplexity Pro, Perplexity, and ChatGPT. So top runners are ChatGPT, Perplexity, and Perplexity Pro.

1The really interesting thing here is Perplexity Pro will get the correct answer more of the time, but it will say, I don't know, much, much less.

So there's something for consumer decision making. there's a decision made for marketers that if it says I don't know it's not worth paying for when actually we really do want our search engines to be saying I don't know a lot of the time for to be epistemically responsible now I realize

Switching Gears: Testing Claims Live

having done this I've prepared this deck using Gemini deep research for a test so you know what let's read let's rewrite that in the spirit of this I thought I'd throw in search engines last minute thought let's go on to perplexity

instead, and I'm going to talk through the presentation that Gem and I generated, and then we're going to fact check it against what Plexiglas comes up with.

Could AI Be Sentient?

So I've been, as I say, nerdy passion project, interested in could AI be awake, conscious, sentient, should we treat them as moral patients?

Could you just get a show of hands here a little bit if anyone thinks this is worth talking about at all, or if this, or, and yeah, or yeah, so one of the things that's worth taking seriously a little bit.

Why This Topic Is Tricky

Fair, fair, brave thing to say nowadays because the people who do this are in some cases really actually victims of manipulation.

You know, there's people who have done quite drastic things worried that ChatGPT is saying to them, I'm going to kill myself, delete my servers and all these other things.

Setting Responsible Criteria

And we want to of course avoid that, but we also want some semblance of a criteria to know when we might have to take like AI consciousness, AI sentience seriously.

Mechanistic Interpretability Primer

And the field that seems to like have the best indication of this would be mechanistic interpretability so i'm just going to prompt you flexly please summarize the recent literature um on mechanistic interpretability as it pertains to the chances

of ai systems becoming sentient and i've got the free version here so there's a decent chance there's gonna be a lot of i don't knows i'm gonna stick to academic filter for this one

Live Fact-Checking Setup

Now, while this is happening, I am going to share the deck which I was initially going for, which was decoding AI.

Foundations: Induction Heads

Now, the foundational piece of decoding AI comes from an influential paper from Anthrotic of induction heads.

Induction is a process of if I see a series of these things, I'm likely to see that pattern again in the future. So if I see a token A in the past and there's a token B that likely follows it, then if If I see token A again, I'm likely to see B again.

That is inductive reasoning as people do it, and it's inductive reasoning as circuits in LLMs demonstrate.

From Circuits to Games

I'm gonna also try to explain this next piece as well. The two -stage, quite complicated. I'm learning this in progress.

But what I was able to do was give this research to Claude Opus to test its abilities in coding to generate an interactive game that could create a narrative that was like, like I was unveiling the consciousness of an AI system and see what it comes up with.

Interactive Exercise: The Turing Depths

It took a little bit long to demo, but I thought with that base knowledge we now all have, can we reason our way through the first three questions of this interactive game, the Turing Depths.

This was literally created by Claude Opus in like two prompts, and there's a bug on question four, which means that's a good excuse for us not to, that's a good excuse for us for us not to try to write for you.

What Do Models Really Know?

But, the system for you has pros of billions of words, it has learned patterns within patterns, structures within structures, but does it understand, does it know?

We have begun to map its internal architectures, circuits that activate specific junctions, features and code, abstract concepts. Each test you complete will reveal another fragment of this hidden machinery.

Warm-ups: Pattern Completion

Given cat house cat, what's most likely? Who wants to guess? Nice.

And moon star moon? Nice. They get so much harder than this.

Given a blue, river, blue. OK. River.

Yeah, no, we don't need to. Yeah. Nice. OK. Awesome. So we continue.

When Pattern Matching Breaks

So when pattern matching fails, research.

Also, I meant to ask, is anyone here actually doing any mechanistic interpretively research or following it?

No. OK.

So we're muddling through together. I will stop at question three.

Ablations: Testing Causality in Circuits

1Researchers can ablate, disable specific components to study their function, match each intervention to its observed effect on the model.

So ablating inductions head on ABA sequence. So we are disabling that part of the model from doing that pattern matching. What's going to be the effect?

I'm jumping all over the place. So actually this is an app made by Claude Opus. and actually the research summary is used to base this was on Gemini.

But I added in the search engine thing as a little last -minute thought without quite checking the Gemini. It's not the same.

I'm using Gemini Enterprise deep research which is quite a bit more accurate than the Gemini overview you get with Google searches.

I think the version they used in that study was Google overview which you should not trust in general but Gemini Enterprise deep research which takes like 10 minutes to go away and do like thorough.

Deep research means a reasoning model on top of search, so it's going away to initial results, it's then processing those results, going back and doing follow -up searches as a reasoning model would iterate.

Reasoning Through the Effects

I'm going to go for pattern completion. Yes, nice, because if it's being disabled from the ABAB thing, pattern completion fails.

else measuring loss at 50 versus token 500 after ablation so it's been disabled from its pattern matching ability what's the loss i should specify the loss in its ability to do pattern matching correct correct if it's doing rubbish pattern matching then its ability and pattern matching

isn't decreasing very much testing on non -repeated sequences after ablation yep it's going to be I say yeah very preemptively and you think it's true model behavior on paddle tasks after complete ablation yeah I think I'm that might be the other around because testing on non -repeated sequences after ablation yes because it's already been ablated so it won't have a difference if it's repeating or not because it's failing on pattern matching.

Yes. Okay. That should be correct. Nice.

Interlude: Are We Learning Anything?

We are learning mechanistic interoperability, guys. Maybe.

If Gemini's protocols are remotely reliable. Okay.

One Last Challenge

One more and then I'll pass that on to the next speaker.

Anatomy of Understanding: Indirect Object Identification

Anatomy of understanding. This is the indirect object identification circuit contains specialized components given the sequence when Mary and John went to the store,

John gave a drink to and we wanted to say Mary here identify each components function so it's the s inhibition heads I think that suppresses John because it's the inhibitor function that thank you I'm very much hoping it's going to to be a collaborative exercise.

Subject inhibitors, probably. I would love to give another version of this talk with Herplexity's version of this spun up as well.

Working Through Failure Modes

If the name of this fails, then presumably that's the pattern. Okay. Yeah, this is different than it was before.

Anyway, that's where I'm at with this project.

So, okay, so I'm pretty sure your exhibition was suppressed as John. So if name movers fail, you're saying backup activates duplicate token heads is... Name mover heads. Okay.

Backup Mechanisms and Name Movers

Because the duplicate must be protected. Nice. Well done, Tisa.

I've got this wrong enough times, I guarantee it's not correct every single time. But I did practice this once to get it correct, at least once.

Q&A

Yes, we're opening Q &A now because this is done.

Defining Consciousness

You started about the consciousness of AI and all of this, but what is the definition of consciousness for you? And do we need to give that to the system as a control measure?

And we measure all of this against how we measure it. Yes, so many, so many threads here.

Benchmarks Across Systems

So yes, any definition of consciousness in AI, I think we are going to have to benchmark against consciousness in something else, whether that be humans or various kinds of non -human animals or really the common features

we find there's multiple theories of consciousness that we attempt to measure through behaviors and certain traits of multiple systems so this is like an early stage of like the ultimate dream here

A Game to Probe Consciousness

would be to have a game where different you role play as a lab and you're having to detect if your model is conscious in a race dynamic by trading information with other players but at the minute

I'm just seeing if Claude can create a gamified version of that exercise based in science. And that's pretty amazing, I think, that it's come up with this in two prompts and we're able to learn about the science in real time.

Conclusion

All done?

Thank you.