3 links tagged with all of: resilience + incident-response
Click any tag below to further narrow down your results
Links
Cloudflare experienced significant network failures in November and December 2025, prompting them to launch a "Code Orange: Fail Small" initiative. This plan focuses on improving the resilience of their network by implementing controlled rollouts for configuration changes, enhancing failure handling, and streamlining emergency response processes.
The author reflects on their experience during the recent Cloudflare outage, highlighting how system limits and complex failures can lead to unexpected problems. They emphasize the importance of understanding the context behind decisions made during incidents and the value of detailed incident writeups for learning.
Uptime Labs emphasizes the importance of treating non-production incidents seriously to foster a blameless culture and enhance organizational resilience. By analyzing a recent incident where a developer accidentally deleted a resource, the team identified opportunities for systemic improvements while promoting psychological safety and open communication.