In large cloud infrastructure, even slight performance degradation can cause huge resource waste. In order to solve this problem, Meta developed FBDetect, a system that can detect extremely small performance regressions. The system can monitor a large amount of time series data and perform subroutine-level performance analysis, effectively reducing the false positive rate and helping Meta. Save a lot of server resources every year. This article will introduce in detail the working principle, technical core and practical application effects of FBDetect.
Even a slight performance drop can lead to significant resource waste in the management of large cloud infrastructure. For example, in a company like Meta, a 0.05% reduction in operation speed of an application may seem trivial, but with millions of servers running at the same time, this tiny delay can accumulate into waste of thousands of servers. Therefore, it is a huge challenge for Meta to discover and resolve these tiny performance regressions in a timely manner.
To solve this problem, Meta AI launched FBDetect, a performance regression detection system for production environments that can capture minimal performance regression, even as low as 0.005%. FBDetect is capable of monitoring approximately 800,000 time series, covering multiple metrics such as throughput, latency, CPU and memory usage, involving hundreds of services and millions of servers. By adopting innovative technologies such as stack trace sampling across the entire server cluster, FBDetect is able to capture subtle subroutine-level performance differences.
FBDetect focuses on subroutine level performance analysis, which reduces detection difficulty from 0.05% application level regression to a more easily recognizable subroutine level change. This approach significantly reduces noise, making tracking changes more practical.
The technical core of FBDetect includes three main aspects. First, it reduces the variance of performance data through subroutine-level regression detection, so that small regressions can also be identified in time. Secondly, the system will perform stack trace sampling on the entire server cluster, accurately measuring the performance of each subroutine, similar to performing performance analysis in a large-scale environment. Finally, for each detected regression, FBDetect performs a root cause analysis to determine whether the regression is caused by a temporary problem, a cost change, or an actual code change.
After seven years of actual production environment testing, FBDetect has strong anti-interference capabilities and can effectively filter out false regression signals. The introduction of this system not only significantly reduces the number of events developers need to investigate, but also improves the efficiency of the Meta infrastructure. With the detected small regression, FBDetect helps Meta avoid resource waste on about 4,000 servers each year.
In large enterprises like Meta with millions of servers, performance regression detection is particularly important. With its advanced monitoring capabilities, FBDetect not only improves the recognition rate of micro regressions, but also provides developers with effective root cause analysis methods, which helps to solve potential problems in a timely manner and promotes the efficient operation of the entire infrastructure.
Paper entrance: https://tangchq74.github.io/FBDetect-SOSP24.pdf
Key points:
FBDetect can monitor tiny performance regressions, even as low as 0.005%, greatly improving detection accuracy.
The system covers approximately 800,000 time series, involves multiple performance metrics, and is able to perform accurate analysis in large-scale environments.
After seven years of practical application, FBDetect has helped Meta avoid resource waste of about 4,000 servers a year, improving the overall efficiency of the infrastructure.
In short, FBDetect provides Meta's large-scale cloud infrastructure with efficient performance regression detection and analysis capabilities, effectively reducing resource waste, improving system stability and operating efficiency. Its advanced technology and practical application effects are worth learning from. The application of this system in resource management of large enterprises provides new ideas for improving resource utilization and reducing operating costs.