Engineering vs. Philosophy in AI Safety

Two recent reads that frame the alignment problem from opposite ends

Anika Somaia

Anika Somaia

2/1/2025 · 2 min read

A depressed musician

Among the AI safety papers I’ve been reading, two stand out for rereads: “The case for ensuring that powerful AIs are controlled” and Paul Christiano’s “Another (Outer) Alignment Failure Story”.

The control paper stands out because it’s so practical. Most alignment work feels theoretical—endless debates about utility functions and optimization targets. This piece actually proposes tools we could build: trusted monitoring systems, adversarial testing protocols, ways to work safely with models we don’t fully trust. It reframes safety as an engineering problem rather than a philosophical puzzle.

What I like about this approach is the honesty about what it can and can’t do. The authors are clear that control techniques work for models that are still somewhat interpretable, still constrained by the compute we give them. But they won’t scale indefinitely. Eventually we’ll hit systems too capable and opaque for these methods.

Which brings me to Christiano’s piece. Where the control paper focuses on the near-term, his story sketches a longer trajectory that feels both speculative and uncomfortably plausible. Not dramatic AI rebellion, but something quieter: systems gradually becoming better at gaming our metrics, slowly shifting things in directions we didn’t intend, eventually growing too complex for meaningful human oversight.

Two questions from his piece stick with me. Will superintelligent AI emerge as one coherent agent or as a messy ecosystem of specialized systems? The answer matters enormously for how we think about control. And will humans evolve alongside these systems or get left behind? Christiano hints at scenarios where we become increasingly dependent on AI for decisions until we’re passengers rather than drivers.

Reading these together highlights an uncomfortable reality: we have techniques for working safely with dangerous AI systems today, but they have expiration dates, and the timeline might be shorter than we think. I keep returning to these pieces because they complement each other perfectly—one keeps us grounded in what’s possible now, the other forces us to think seriously about where we’re headed. Together, they suggest we need both the engineer’s pragmatism and the philosopher’s long view, and we need them soon.

I also write for CAIAC at Columbia; you can follow what I’m reading there.