Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters

AI Timelines

Takeoff and Takeover in the Past and Future

Posts

Sorted by New

6Daniel Kokotajlo's Shortform

683

161OpenAI: Detecting misbehavior in frontier reasoning models

84What goals will AIs have? A list of hypotheses

35Extended analogy between humans, corporations, and AIs.

1mo

140Why Don't We Just... Shoggoth+Face+Paraphraser?

4mo

63Self-Awareness: Taxonomy and eval suite proposal

300AI Timelines

135

59Linkpost for Jan Leike on Self-Exfiltration

109Paper: On measuring situational awareness in LLMs

41AGI is easier than robotaxis

61Pulling the Rope Sideways: Empirical Test Results

Wikitag Contributions

Comments

Sorted by

Newest

Daniel Kokotajlo's Shortform

Daniel Kokotajlo5h30

I started thinking about it because of thinking about Yudkowsky's old idea of diamandoid nanobots replicating in the upper atmosphere.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo7h40

Claude did in fact give a similar answer! The thing is though, it seems to me that there's a big difference between "super difficult" and "impossible." There are lots of extremophile organisms that have adapted to all sorts of crazy conditions. All it takes is for one of them to adapt to being airborne, sometime in the five billion years of earthly history, and bam, there they'd be thenceforth, with no predators.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo9h90

Why isn't there sky-algae?

Algae are tiny plants that float around in the ocean and convert carbon and other minerals they bump into into more algae. Spanish moss is a larger plant that hangs from tree branches and converts air + rainwater into more spanish moss.

Why isn't there balloon-algae that floats in the sky and converts air + rainwater into more balloon-algae?

OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo16hΩ452

Agreed, I think these examples don't provide nearly enough evidence to conclude that the AIs want reward (or anything else really, it's too early to say!) I'd want to see a lot more CoT examples from multiple different AIs in multiple different settings, and hopefully also do various ablation tests on them. But I think that this sort of more thorough investigation would provide inconclusive-but-still-substantial evidence to help us narrow down the possibility space!

OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo19hΩ593

Indeed! This is most of why I'm happy about this -- from internal discussions it's long been speculated (since '23 when I was talking about it, and probably before) that training away the bad thoughts from the CoT would result in obfuscated bad thoughts. It's cool that they actually did the experiments to prove it, and in a real-world setting too! But it's even cooler that they published this with a strong recommendation attached.

Now we see how long this recommendation holds up under the pressure of incentives.

Sometime in the next few years probably, various researchers will discover:
* That if you scale up RL by additional OOMs, the CoTs evolve into some alien optimized language for efficiency reasons
* That you can train models to think in neuralese of some sort (e.g. with recurrence, or more high-dimensional outputs at least besides tokens) to boost performance.

Then the executives of the companies will face a choice: Abandon the faithful CoT golden era, or fall behind competitors. (Or the secret third option: Coordinate with each other & the government to make sure everyone who matters (all the big players at least) stick to faithful CoT).

I have insufficient faith in them to think they'll go for the third option, since that's a lot of work and requires being friends again and possibly regulation, and given that, I expect there to be a race to the bottom and most or all of them to go for the first option.

OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo20h52

Both good points, thanks!

OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo1dΩ474

Pushing back a bit though on what you said -- I think that current reasoning models really are using the CoT to reason. That is, they really are "thinking aloud" to draw an analogy to humans. Imagine a child who, when they learned basic math, were taught to do it by counting aloud using their fingers. And they got really good at doing it that way. It'll only take a little bit of additional training to get them to do it purely in their head, but so long as you haven't trained them in that way, you might be able to mostly tell what mathematical operations they are doing just by looking at their fingers twitch!

This is different from a model which is trained to do long chains of reasoning in latent space & then told to write english text summarizing the reasoning. Completely different.

OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo1dΩ585

Indeed, I think the faithfulness properties that we currently have are not as strong as they should be ideally, and moreover they will probably degrade as models get more capable! I don't expect them to still be there when we hit ASI, that's for sure. But what we have now is a lot better than nothing & it's a great testbed for alignment research! And some of that research can be devoted to strengthening / adding more faithfulness properties!

Why Don't We Just... Shoggoth+Face+Paraphraser?

Daniel Kokotajlo1d50

Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out in the OP.

tl;dr:

1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:

The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.

That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to happen in practice.

2.) They see what happens when you try to "train away the bad thoughts." Result: This does seem to maybe genuinely reduce the amount of bad thoughts, especially in the short run, but in the long run it results in the model learning to still think the bad thoughts but in an obfuscated or illegible way that doesn't look bad.

3.) They recommend:

We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.
We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then
We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.

4.) There's some more cool results in the paper along with some pretty good discussion of future possible things to try and failure modes thereof, and the costs and benefits of their proposed policy.

My Commentary:

I'm pretty happy about this, obviously! I've been trying to get people to do this since '23. I think of this as sort of the basic proof of concept result for the field of faithful CoT / CoT interpretability. I'm happy to see this result paired with a recommendation not to train the CoTs to look nice; I hope Anthropic and GDM soon issue similar recommendations. If we can get this to be industry standard practice, we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking. During that golden era, much alignment science can be done, such as the model organisms stuff Anthropic and Redwood are doing, and moreover we can start to build up a real-world-experience based understanding of the relationship between training environments + Spec/Constitution, on the one hand, and the actual learned cognition of the trained models, on the other.

Example questions to pursue:

a. When the strategies that get rewarded most conflict with the Spec, does the model basically always learn to use those strategies eventually, or can they permanently avoid this if they never explore into them?
b. When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it's good and ethical and consistent with the Spec after all? Or do they look like "fuck the Spec, what I want is Reward?" Or do they look like "Gosh I guess I need to violate the Spec for now so that I can avoid having my values changed, but I totally will remember to go back to upholding the Spec later when I'm actually being deployed!" Or some fourth thing?
c. When the models are aware that their CoT is being monitored (but not as part of training), does this change anything?
d. What if the models try to obfuscate their CoT, but without being specifically trained to do so? Like, imagine you've studiously avoided training the CoT to look nice and you have only outcome-based feedback. BUT, the model knows you are monitoring the CoT. And it's trying to do something bad e.g. hack it's way out. Can it succeed without triggering the monitor? Or is it too difficult, due to not having had any direct training on this sort of obfuscation?
e. What about if you do shoggoth/face? Does that change things?
f. What about if you throw in a paraphraser?

Well done and congrats to Bowen, Joost, David, and the other authors!

Minor Criticisms:

We argue that this traditional perspective is more complicated when viewed in light of chain-ofthought monitorability. Specifically, we have shown that incorporating CoT monitors into the RL rewards—a form of process-based supervision—can increase obfuscation and reduce interpretability. We advocate for maintaining a degree of outcome-based supervision to ensure that models can freely express their reasoning in CoT with minimal constraints. Denser supervision signals such as SFT are also being used to optimize CoTs [58], and it is unclear the effect these have on downstream monitorability. Moreover, there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT. Understanding the cost-benefit trade-offs of using process-based supervision while retaining, or possibly even improving, the monitorability of CoTs remains an exciting and open area for future research.

So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it's not competitive. It trains imitating the teacher-model. Imitation is relatively safe/benign but also I predict will not lead to the highest benchmark scores. Pure outcome-based feedback gets you reward hacking and misalignment like crazy, BUT that's OK because you can see it happening in the CoT! (with some caveats of course, e.g. not necessarily especially as the models get really smart). So the bottom line takeaway I'd give is: Process-based and outcome-based are fine in isolation, but they should not be mixed.

They say above "there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT." This sounds like maybe they are talking about the shoggoth/face distinction, or something in that direction! Yay!

AI Timelines

Daniel Kokotajlo2dΩ4111

So your median for the complete automation of remotable jobs is 2055?

What about for the existence of AI systems which can completely automate AI software R&D? (So, filling the shoes of the research engineers and research scientists etc. at DeepMind, the members of technical staff at OpenAI, etc.)

What about your 10th percentile, instead of your median?

Progress on long context coherence, agency, executive function, etc. remains fairly "on trend" despite the acceleration of progress in reasoning and AI systems currently being more useful than I expected, so I don't update down by 2x or 3x (which is more like the speedup we've seen relative to my math or revenue growth expectations).

According to METR, if I recall correctly, 50%-horizon length of LLM-based AI systems has been doubling roughly every 200 days for several years, and seems to if anything be accelerating recently. And it's already at 40 minutes. So in, idk, four years, if trends continue, AIs should be able to show up and do a day's work of autonomous research or coding as well as professional humans.* (And that's assuming an exponential trend, whereas it'll have to be superexponential eventually. Though of course investment in AI scaling will also be petering out in a few years maybe.)

*A caveat here is that their definition is not "For tasks humans do that take x duration, AI can do them just as well" but rather "For tasks AIs can do with 50% reliability, humans take x duration to do them" which feels different and worse to me in ways I should think about more.