What to Do When Everything Is on Fire

The authors of Crisis Engineering have led teams through chaos again and again—now they’re sharing their toolkit for anyone in a tough spot.

Crisis Engineering Book Cover

Summary

The authors of Crisis Engineering propose that crises — defined by surprise, chaos, and hard deadlines — follow surprisingly similar patterns. They offer a repeatable toolkit for leaders, based on their core finding: crisis conditions create rare opportunities for rapid organizational change that peacetime simply doesn’t allow.


The just-published book Crisis Engineering is a field guide for leading through high-stakes meltdowns and coming out the other side with an organization transformed. From platform collapses and pandemic logistics to wildfire response, it’s filled with riveting stories. “I thoroughly enjoyed reading every bit,” says Ryan Panchadsaram, “It’s a page turner.”

Panchadsaram met the three coauthors when they worked in the federal government with systems failing at the highest levels. Mikey Dickerson led the rescue of HealthCare.gov and went on to create the United States Digital Service. Matthew Weaver co-founded the Site Reliability Engineering discipline at Google and later established the Defense Digital Service at the Department of Defense. Marina Nitze was CTO of the U.S. Department of Veterans Affairs and a White House technology advisor. More recently, the three founded the crisis engineering firm Layer Aleph.

Their Google roots gave each OKR experience, stress-tested at scale. In this conversation, Weaver mentioned the value of OKRs as an organization scales quickly, with new staff figuring things out in real time.

But facing catastrophes are different. Dickerson, Weaver, and Nitze have spent their careers inside exactly these moments — and they’ve developed a repeatable way through them. We spoke to Weaver to learn more…

[This interview has been edited for length and clarity.]

Panchadsaram: Crises seem to be arriving faster and hitting harder. Is the world actually getting more chaotic, or are we just less prepared for it?

Weaver: I think the reasons are complex and interlocking. There are environmental factors — climate change, the shifting global political order. There’s also the fact that much of the West’s infrastructure — everything from plumbing and wiring up through our institutions and organizations — consists of what I’d call complex systems: They’re part machine, part people, and they’re big and brittle.

Panchadsaram: So how do you and your coauthors define a crisis?

We have a short list of criteria, starting with fundamental surprise (the organization encounters circumstances it has no learned memory of handling); no shared story (everyone has a different idea about what’s happening); disruption of core process (Ticketmaster can’t sell tickets); high visibility (you’re on MSNBC, CNN, and Fox News all day); and a rigid deadline or time frame — a merger close date, a regulatory deadline, a launch window — something you simply can’t push back. If you have three of the five, we think you’re there.

Panchadsaram: Who did you write this book for?

Weaver: To do any of the work we call Crisis Engineering — a name we had to coin — you often need action at the leadership level. A lot of leadership is about trying to change the human parts of a complex system and discovering how difficult that is. Crisis conditions change that equation. You can make rapid, directed changes to the human side of an organization during a crisis in ways that simply aren’t possible otherwise.

Panchadsaram: You’ve been through some of the most high-stakes failures in recent memory — Healthcare.gov, the pandemic response. What do those moments teach you that you couldn’t have learned any other way?

Weaver: One thing that’s a little surprising, and you’ll see it reflected in the book, is how similar the fix is, regardless of the context. That’s part of what gave us the confidence to write all this down. The toolkit works regardless of the stakes. We’ve done roughly thirty of these engagements. I did over twenty while inside the federal government. If you just apply the tools, you can proceed through the crisis. The organization can learn to do new things.

Panchadsaram: I love that. If you’d tried to write this in year one or two — even right after Healthcare.gov — you wouldn’t have had the full playbook, the framework, or the patterns.

Weaver: And we would still have believed you could use these tools without the crisis conditions being present. We tried to convince leadership — people who own budget and space inside an organization — to do novel things without the crisis conditions, and that is unsuccessful. We didn’t write about that, except to say: don’t try these things in peacetime.

Panchadsaram: There are a couple of mentions of AI in the book — including a note about the critical difference between machine controllers and human controllers: Machines don’t fight to preserve or expand their autonomy, but humans do. Given where AI is today, with all these agents and autonomous systems, do you have any additional observations?

Weaver: Yes! Despite all the hype and excitement, and despite genuinely interesting applications, AI is still fundamentally a machine component. It has all the problems machine components have, and it magnifies some of them. One thing we’re spending a lot of time thinking about, both as authors and in our firm, is the sheer amount of complexity organizations are pouring into their machine systems right now — and in many cases, it’s being done without the historical controls an organization would normally apply. Security is one example. We’re seeing enormous complexity flow in with all the gates open: Plug it [AI] into all your production databases! Plug it [AI] into all your authentication systems! Over the past thirty years, we haven’t seen that kind of behavior.

Panchadsaram: For us at Measure What Matters, OKRs are a critical tool — they help leaders and organizations stay focused. But a crisis can scramble everything overnight. How do you hold onto clarity when everything’s on fire?

Weaver: The useful thing about everything being on fire is that it usually means at least one thing is true: there is a shared understanding that everything is on fire. A common example we use is Ticketmaster on the day Taylor Swift tickets go on sale and the system doesn’t work. That is a clarifying moment. You can get the entire organization to agree: What we should be doing right now is selling tickets, and we are not doing that.

Then you begin assembling facts one at a time. Is Ticketmaster.com up? You can find out right now, in this room. And you make sure that understanding is shared — ideally on a board in front of people, or in a shared document. You’re building a collective picture of what reality is right now. That’s clarifying. You start there.

Crisis Engineering is available now.