When I first read about AI alignment1 in the early 2010s, it was a niche concern of weirdo Internet communities. AI was advancing – DeepMind released AlphaGo in 2015 – but the progress felt more like a novelty than a real technological disruption. When I took a class on AI in the fall of 2018(!), the thesis was that AI was hard and progress slow. To fret about AI superintelligence was a bit like worrying about mining rights on asteroids. In the last ~9 months, however, that’s all changed. This generation of LLMs increasingly seems to represent a real step-change in AI capabilities, and AI concerns have broken out into the mainstream.
Though AI alignment is only just entering the public consciousness, concern around intelligences with goals orthogonal to our own has been a part of human civilization for millennia. Foundational stories like Homer’s epics are obsessed with human vulnerability in a world dominated by capricious gods that are more powerful than humans and whose interests do not always align with ours. In the wake of the space age, we began telling stories about aliens much more capable than us visiting Earth and wreaking havoc. Jewish folklore gives the example of the Golem: a powerful creature created by humans to fulfill a particular goal that we lose control of. Genies are an interesting case as well – they, like a computer program, will do exactly what you say, but sometimes we only realize after the fact that what we say we wanted was not what we actually wanted or what was best for us. In some cases like the Golem or the genie, the problem is intelligences that imperfectly model our goals, whereas with the Homeric gods and alien invasions, the problem is other intelligences whose goals do not align with our own. Both can be dangerous. Perhaps because of our early competition with other hominids, we have long understood the risks of a world with other unaligned intelligences.
On the flipside, successfully aligning other intelligences has been a key strategy of civilizational progress. We turned wolves into dogs and made friendly allies out of fierce enemies. We even aligned each other to our mutual benefit: we created legal systems to escape the anarchy of each pursuing their own ends no matter the cost to another and did the same thing at the national scale with global treaties and organizations. During the Cold War, we succeeded in (sufficiently) aligning the world superpowers to avoid nuclear annihilation.
Though the problem of aligning intelligences is ancient, AI presents an interesting case because (1) it has the potential to be more powerful than any other intelligence we’ve encountered, and (2) we have a unique ability to direct its operations. Wolves never posed an existential threat to humanity, but also the Greeks couldn’t reprogram the gods (or the natural forces the gods represented). The stakes are high, but the problem is tractable.
When I say “the stakes are high”, it’s important to clarify that the stakes are not only high if you assume we’re at risk of human extinction. All you have to believe is that AI systems will be increasingly connected to all of the systems we touch because of their immense potential. You don’t want your radiology AI giving false negatives because it thinks humans don’t like hearing bad news or your cleaning robot smashing up your china for the fun of it. Well-aligned AIs offer tremendous benefit and poorly aligned AIs will lead to frustration and stagnation, if not catastrophe.
In thinking about alignment, I see two big buckets of concerns. One is represented by aliens and the Homeric gods: we don’t want intelligent systems pursuing destructive goals. Lots of attention has been paid to this set of concerns. When ChatGPT was released, people raced to jailbreak it and show how it could be used for malicious ends. Reinforcement Learning from Human Feedback seems to be the best tool at the moment for mitigating the ability for AIs to pursue bad ends. The second bucket is a bit subtler and is represented by the Golem and the genie: we don’t want “well-intentioned” intelligences nominally pursuing our goals but in ways that are harmful. The famous paperclip maximizer thought experiment gets at the problem well. Our technology here seem even less mature. I think what you’d want is a way to see into the AI’s “mind” to understand how it interprets your goals and the ability to course correct as needed.
This can all sound somewhat fantastical, and I’ve certainly fed that with the literary comparisons. Fundamentally, though, the worry is that we’re going to increasingly give control over real-world capabilities to what are effectively random number generators. Instead of letting a random number generator play Russian roulette with our energy grid or weapons systems, we should figure out ways to constrain the set of possible outcomes such that even if we can’t perfectly guess or specify the random number generators’ results, we can trust that they won’t be harmful to us.
I’m excited to see new labs and startups develop these alignment capabilities and make our increasingly promising AI safe and useful for the world’s benefit. Part of my excitement is obviously because of the amazing opportunity for advancing human wellbeing, but I’m also excited because of the humanistic implications. Our understanding of language, the mind, math, and so much more is intertwined with the puzzles of AI. Might we ironically be on the precipice of a boom time for philosophy and art?
We’re standing at the dawn of a new technological age. If you’re working on these topics, I’d love to talk.
As AI alignment is making its way into the mainstream, I’ve heard people say that the term is confusing. I’m going to stick with it here, but “AI reliability” might be clearer?