OpenAI releases detailed report on ChatGPT outage: only caused by a small change

Author：Eve Cole Update Time：2024-12-19 10:32:02

OpenAI recently released a detailed report on the 4 hours and 10 minutes outage of the ChatGPT and Sora services on December 11. This incident affected many users and attracted widespread attention. The report explains in detail the cause of the failure, the difficulties encountered by engineers, and the final process of restoring service, providing valuable lessons for other technical teams. The report clearly explains the root cause of the failure, as well as a series of emergency measures taken by engineers when facing a control plane crash, which reflects OpenAI's ability to respond to emergencies.

Last week (December 11), OpenAI's ChatGPT and Sora services were down for 4 hours and 10 minutes, affecting many users. Now, OpenAI officially releases a detailed report on ChatGPT outage.

To put it simply, the root cause of this failure was a small change, but it led to serious consequences. The engineers were locked out of the control surface at the critical moment and were unable to deal with the problem in time. Regarding this failure, OpenAI engineers quickly launched a number of repairs after discovering the problem, including reducing the size of the cluster, blocking network access to the Kubernetes management API, and increasing the resources of the Kubernetes API server. After several rounds of efforts, engineers finally restored access to part of the Kubernetes control plane and took steps to divert traffic to healthy clusters, ultimately achieving a full recovery of the system.

The incident occurred at 3:12 PM PST as engineers deployed a new telemetry service to collect Kubernetes (K8S) control plane metrics. However, the service was unintentionally configured too broadly, causing every node in every cluster to simultaneously perform resource-intensive K8S API operations. This situation quickly caused the API server to crash, causing the K8S data plane of most clusters to lose service capabilities.

It is worth noting that although the K8S data plane can theoretically run independently of the control plane, the functionality of DNS relies on the control plane, which makes it impossible for services to communicate with each other. When API operations are overloaded, the service discovery mechanism is damaged, leading to the paralysis of the entire service. Although the problem was located within 3 minutes, the engineer was unable to access the control plane to roll back the service, resulting in an "infinite loop" situation. A control plane crash prevented them from removing the problematic service and thus making recovery impossible.

OpenAI engineers immediately began exploring different ways to restore the cluster. They attempted to reduce the cluster size to reduce the API load on K8S and blocked access to the management K8S API so that the servers could resume normal operations. In addition, they also expanded the resource configuration of the K8S API server to better handle requests. After a series of efforts, engineers finally regained control of the K8S control plane, were able to delete the faulty service and gradually restore the cluster.

During this period, engineers also moved traffic to restored or newly added healthy clusters to reduce the load on other clusters. However, as many services attempted to recover at the same time, resulting in resource limit saturation, the recovery process required additional manual intervention, and recovery of some clusters took a long time. Through this incident, OpenAI is expected to learn from its experience and avoid being "locked out" again when encountering similar situations in the future.

Report details: https://status.openai.com/incidents/ctrsv3lwd797

Highlight:

Cause of failure: Small telemetry service changes caused K8S API operations to be overloaded, resulting in service paralysis.

Engineer's Dilemma: A crash in the control plane makes engineers inaccessible, making it impossible to handle the problem.

⏳ Recovery process: By reducing the size of the cluster and increasing resources, the service was finally restored.

Although this incident caused service interruption, it also provided OpenAI with valuable experience, prompting them to improve their system architecture and emergency plans, further improve service stability, and provide users with more reliable service guarantees. I believe OpenAI will learn from this incident, continue to improve its technology, and bring a better experience to users.