The first time a young person deletes the repository and runs away.

Congratulations, this is the first production incident and also the first T0 incident.

Standardize the incident levels of your own server:

T0: Extremely serious incident. Irreversible loss of user data.
T1: Serious incident. Reversible loss of user data, requiring ops intervention to restore data. Or service functionality is unavailable.
T2: General incident. No loss of user data, only display anomalies or other presentation-related issues.
T3: UI anomaly. Troubles caused by UI layout or other UI-related issues.

Basics#

Number: 0x0001
Date: 2021-04-17
Level: T0

Process#

In the afternoon of April 17th, logged into the production server and executed the yum upgrade update command. The reason was that the Docker environment version on the server was too low and needed to be updated. After executing the command, the server automatically restarted without providing a corresponding restart prompt. It was initially determined that this was due to kernel updates. However, the server failed to recover the service after a long time. After accessing VNC to check, it was found that the virtual machine was unable to load the disk, resulting in the occurrence of the incident. All server content was lost.

Contacted the service provider in a timely manner afterwards, attempting to obtain the damaged disk image for recovery. However, it was not possible to obtain it, and the only suggestion received was to reinstall.

Strange#

After the reinstallation, the same disk system template was obtained from the service provider. After installation, the yum upgrade update command was executed again, but there were no kernel-level updates, not even a single update. So I still don't really understand the root cause of this incident.

Impact#

The only service with state on the server is Miser, and the only user is myself. Due to the lack of timely scheduled data backups, the data can only be restored to the version manually recovered three months ago.
The service-related files are stored on the backblaze server and are not affected by this incident.

Actions#

After this incident, I have been thinking for a long time about "officially operating a product". Fortunately, Miser has always been used by me alone, even though friends have repeatedly expressed their desire to use it and participate in the development of the entire product. My personal data is not really that valuable, but once a user actually puts their content here and you lose it, then integrity and trustworthiness will become very low.

About Monitoring#

Before this, I only set up Sentry on the service to monitor unhandled exceptions, specifically collecting 5XX errors in real-time. However, in the process of operation, I have encountered many strange problems, such as sudden high CPU usage or sudden memory pressure.

Since I am using traditional virtual machines and not scalable cloud services, system-level monitoring such as CPU and memory should also be put on the agenda. In other words, a complete set of Prometheus + Grafana should be deployed in production.

About Reliability and Backup#

It's strange to say that I haven't touched the production server for several months, almost half a year. I have been using GitHub Actions to implement a complete CI/CD pipeline. The reason for manually logging into the server this time was to deploy a backup service. However, it was this operation that caused the incident, possibly due to being out of practice.

The service provider explained to me very well that there were indeed significant issues with my operation. He said that if he were to do it, he would take a snapshot of the disk first, then perform the operation. If there are no issues, everything is fine. If there are issues, the snapshot can be used to restore the entire disk.

Indeed, this time I did not do a good dry run. Every time, I directly ran Docker commands at the pod level, and if there were problems, I would simply remove them. This time, the operation on the host machine lacked a sense of awe.

Therefore,

Minimize direct operations on the host machine. After setting up the Docker cluster, most operations should be performed at the Docker level.
The priority of backup services should be raised to the highest level, as data is of utmost importance.
The reliability of dry runs and the necessity of a dev or uat environment that corresponds one-to-one with production need to be considered.