I have recently been learning about nodejs monitoring. Although I don’t have the energy to learn to write a simple version of monitoring, I still can’t help but learn how to obtain these indicators (after consulting a lot of information, I feel that there is too little introduction to this content on the domestic Internet. I am also sorting out the server node knowledge points, so I will summarize them in this article and share them with you).
There may be problems with some of the indicators in this article. Welcome to exchange. In fact, you can organize these data and write them into a monitoring library and use them in your own small and medium-sized projects. Then the front-end react has tools such as bizcharts and g2, and the front-end draws the large screen of data by itself. I think the data dimensions collected by esay monitor are not as comprehensive as ours.
The performance bottlenecks of servers are usually the following:
CPU usage and CPU load, both of which can reflect the busyness of a machine to a certain extent.
CPU usage is the CPU resources occupied by running programs, indicating how the machine is running programs at a certain point in time. The higher the usage rate, it means that the machine is running a lot of programs at this time, and vice versa. The level of usage is directly related to the strength of the CPU. Let's first understand the relevant APIs and some terminology explanations to help us understand the code for obtaining CPU usage.
os.cpus()
returns an array of objects containing information about each logical CPU core.
model: a string specifying the model of the CPU core.
speed: A number specifying the speed of the CPU core in MHz.
times: An object containing the following properties:
NOTE: The nice
value is for POSIX only. On Windows operating systems, the value of nice
is always 0 for all processors.
When you see the user and nice fields, some students are confused about the advantages, and so am I, so I carefully inquired about their meaning, please continue.
user indicates the proportion of time the CPU is running in user mode .
Application process execution is divided into user mode and kernel mode : the CPU executes the application process's own code logic in the user mode, usually some logical or numerical calculations ; the CPU executes system calls initiated by the process in the kernel mode, usually in response to the process's request for resources. .
A userspace program is any process that is not part of the kernel. Shells, compilers, databases, web servers, and desktop-related programs are all user-space processes. If the processor is not idle, it is normal that most of the CPU time should be spent running user-space processes.
Nice represents the proportion of time the CPU runs in low-priority user mode . Low priority means that the nice value of the process is less than 0.
user represents the proportion of time the CPU runs in kernel mode .
Generally speaking, kernel mode CPU usage should not be too high unless the application process initiates a large number of system calls. If it is too high, it means that the system call takes a long time, such as frequent IO operations.
idle indicates the proportion of time the CPU is in idle state, in which the CPU has no tasks to perform.
irq represents the proportion of time the CPU handles hardware interrupts .
The network card interrupt is a typical example: after the network card receives the data packet, it notifies the CPU for processing through a hardware interrupt. If the system network traffic is very heavy, a significant increase in irq usage may be observed.
If the user state is less than 70%, the kernel state is less than 35%, and the overall state is less than 70%, it can be counted as a healthy state.
The following example illustrates the use of the os.cpus() method in Node.js:
Example 1:
// Node.js program to demonstrate the //os.cpus() method // Allocating os module const os = require('os'); // Printing os.cpus() values console.log(os.cpus());
Output:
[ { model:'Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz', speed:2712, times: { user:900000, nice:0, sys:940265, idle:11928546, irq:147046 } }, { model:'Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz', speed:2712, times: { user:860875, nice:0, sys:507093, idle:12400500, irq:27062 } }, { model:'Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz', speed:2712, times: { user:1273421, nice:0, sys:618765, idle:11876281, irq:13125 } }, { model:'Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz', speed:2712, times: { user:943921, nice:0, sys:460109, idle:12364453, irq:12437 } } ]
The following is the code for how to obtain cpu utilization
const os = require('os'); const sleep = ms => new Promise(resolve => setTimeout(resolve, ms)); class OSUtils { constructor() { this.cpuUsageMSDefault = 1000; //CPU utilization default period} /** * Get the CPU utilization for a certain period of time* @param { Number } Options.ms [Time period, the default is 1000ms, which is 1 second] * @param { Boolean } Options.percentage [true (returned as a percentage result) | false] * @returns { Promise } */ async getCPUUsage(options={}) { const that = this; let { cpuUsageMS, percentage } = options; cpuUsageMS = cpuUsageMS || that.cpuUsageMSDefault; const t1 = that._getCPUInfo(); // CPU information at t1 time point await sleep(cpuUsageMS); const t2 = that._getCPUInfo(); // CPU information at t2 time point const idle = t2.idle - t1.idle; const total = t2.total - t1.total; let usage = 1 - idle / total; if (percentage) usage = (usage * 100.0).toFixed(2) + "%"; return usage; } /** * Get CPU instantaneous time information * @returns { Object } CPU information * user <number> The number of milliseconds the CPU spent in user mode. * nice <number> The number of milliseconds the CPU spends in nice mode. * sys <number> Number of milliseconds the CPU spent in system mode. * idle <number> The number of milliseconds the CPU spent in idle mode. * irq <number> The number of milliseconds the CPU spent in interrupt request mode. */ _getCPUInfo() { const cpus = os.cpus(); let user = 0, nice = 0, sys = 0, idle = 0, irq = 0, total = 0; for (let cpu in cpus) { const times = cpus[cpu].times; user += times.user; nice += times.nice; sys += times.sys; idle += times.idle; irq += times.irq; } total += user + nice + sys + idle + irq; return { user, sys, idle, total, } } } const cpuUsage = new OSUtils().getCPUUsage({ percentage: true }); console.log('cpuUsage: ', cpuUsage.then(data=>console.log(data))); // My computer is 6.15%
. The CPU load (loadavg) is easy to understand and refers to a certain period of time. The number of processes occupying CPU time and processes waiting for CPU time is the load average. The processes waiting for CPU time here refer to processes waiting to be awakened, excluding processes in the wait state.
Before this we need to learn a node API
os.loadavg()
returns an array containing 1, 5 and 15 minute average load.
Load average is a measure of system activity calculated by the operating system and expressed as a decimal.
Load average is a Unix-specific concept. On Windows, the return value is always [0, 0, 0]
It is used to describe the current busyness of the operating system. It can be simply understood as the average number of tasks that the CPU is using and waiting to use the CPU per unit time. The CPU load is too high, indicating that there are too many processes. In Node, it may be reflected in repeatedly starting new processes using the Forbidden City module.
const os = require('os'); //Number of CPU threads const length = os.cpus().length; //The average load of a single-core CPU, returns an array containing the average load of 1, 5, and 15 minutes os.loadavg().map(load => load / length);
Let’s explain an API first, otherwise you can’t read it Understand our code for obtaining memory indicators,
This function returns 4 parameters, the meanings and differences are as follows:
Use the following code to print the memory usage of a child process. It can be seen that rss is roughly equal to the RES of the top command. In addition, the memory of the main process is only 33M, which is smaller than the memory of the child process. It can be seen that their memory usage is calculated independently.
var showMem = function(){ var mem = process.memoryUsage(); var format = function(bytes){ return (bytes / 1024 / 1024).toFixed(2) + 'MB'; }; console.log('Process: heapTotal ' + format(mem.heapTotal) + ' heapUsed ' + format(mem.heapUsed) + ' rss ' + format(mem.rss) + ' external:' + format(mem.external) ); console.log('------------------------------------------------ ---------------'); };
For Node, once a memory leak occurs, it is not so easy to troubleshoot. If it is monitored that the memory only rises but does not fall, then there must be a memory leak problem. Healthy memory usage should go up and down. When access is large, it rises, and when access falls, it decreases.
const os = require('os'); // Check the current Node process memory usage const { rss, heapUsed, heapTotal } = process.memoryUsage(); // Get system free memory const systemFree = os.freemem(); // Get the total system memory const systemTotal = os.totalmem(); module.exports = { memory: () => { return { system: 1 - systemFree / systemTotal, // System memory usage heap: heapUsed / headTotal, // Current Node process memory usage node: rss / systemTotal, // Current Node process memory usage ratio of system memory} } }
Disk monitoring mainly monitors disk usage. Due to frequent log writing, disk space is gradually used up. Once the disk is not enough, it will cause various problems in the system. Set an upper limit for disk usage. Once disk usage exceeds the warning value, the server administrator should organize logs or clean up the disk.
The following code refers to easy monitor 3.0.
const { execSync } = require('child_process'); const result = execSync('df -P', { encoding: 'utf8'}) const lines = result.split('n'); const metric = {}; lines.forEach(line => { if (line.startsWith('/')) { const match = line.match(/(d+)%s+(/.*$)/); if (match) { const rate = parseInt(match[1] || 0); const mounted = match[2]; if (!mounted.startsWith('/Volumes/') && !mounted.startsWith('/private/')) { metric[mounted] = rate; } } } }); console.log(metric)
I/O load mainly refers to disk I/O. It reflects the read and write situation on the disk. For applications written by Node, which are mainly for network services, it is unlikely that the I/O load will be too high. The I/O pressure of many readings comes from the database.
To obtain I/O indicators, we need to understand a Linux command called iostat. If it is not installed, you need to install it. Let’s take a look at why this command can reflect the I/O indicators
iostat -dx
Property description
rrqm/s: The number of merge read operations per second. That is, rmerge/s (the number of times read requests to the device are merged per second, and the file system will merge requests to read the same block) wrqm/s: The number of merge write operations per second. That is, wmerge/s (the number of times write requests to the device are merged per second) r/s: The number of reads from the I/O device completed per second. That is rio/s w/s: The number of writes to the I/O device completed per second. That is wio/s rsec/s: Number of sectors read per second. That is rsect/s wsec/s: Number of sectors written per second. i.e. wsect/s rkB/s: K bytes read per second. It is half of rsect/s because each sector size is 512 bytes. wkB/s: Number of K bytes written per second. It is half of wsect/s. avgrq-sz: Average data size (sectors) per device I/O operation. avgqu-sz: Average I/O queue length. await: The average waiting time (milliseconds) for each device I/O operation. svctm: Average processing time (milliseconds) of each device I/O operation. %util: What percentage of one second is used for I/O operations, that is, the percentage of CPU consumed by IO.
We only need to monitor %util.
If %util is close to 100% , it means that too many I/O requests are generated. The I/O system is fully loaded and there may be a bottleneck on this disk.
If await is much larger than svctm, it means that the I/O queue is too long and the response time of the application becomes slower. If the response time exceeds the range that the user can tolerate, you can consider replacing a faster disk, adjusting the kernel elevator algorithm, and optimizing the application. , or upgrade the CPU.
monitors the page response time of Nodejs. The solution is selected from the blog article of teacher Liao Xuefeng.
Recently I want to monitor the performance of Nodejs. Recording and analyzing Log is too troublesome. The simplest way is to record the processing time of each HTTP request and return it directly in the HTTP Response Header.
Recording the time of an HTTP request is very simple. It means recording a timestamp when receiving the request, and recording another timestamp when responding to the request. The difference between the two timestamps is the processing time.
However, the res.send() code is spread across various js files, so you can't change every URL processing function.
The correct idea is to use middleware to achieve it. But Nodejs does not have any method to intercept res.send(), how to break it?
In fact, as long as we change our thinking slightly, abandon the traditional OOP method, and view res.send() as a function object, we can first save the original processing function res.send, and then replace res.send with our own processing function:
app.use (function (req, res, next) { //Record start time: var exec_start_at = Date.now(); //Save the original processing function: var _send = res.send; // Bind our own handler function: res.send = function () { //Send Header: res.set('X-Execution-Time', String(Date.now() - exec_start_at)); // Call the original processing function: return _send.apply(res, arguments); }; next(); });
In just a few lines of code, the timestamp is done.
There is no need to process the res.render() method because res.render() internally calls res.send().
When calling the apply() function, it is important to pass in the res object, otherwise the this of the original processing function points to undefined, which directly leads to an error.
Measured home page response time 9 milliseconds
Glossary:
QPS: Queries Per Second means "query rate per second", which is the number of queries a server can respond to per second , is a measure of how much traffic a specific query server handles within a specified period of time.
On the Internet, the performance of a machine that serves as a domain name system server is often measured by query rate per second.
TPS: is the abbreviation of TransactionsPerSecond, which is the number of transactions/second. It is a unit of measurement for software testing results. A transaction refers to the process in which a client sends a request to the server and the server responds. The client starts timing when it sends a request and ends when it receives the server's response to calculate the time used and the number of completed transactions.
QPS vs TPS: QPS is basically similar to TPS, but the difference is that a visit to a page forms a TPS; but a page request may generate multiple requests to the server, and the server can count these requests as " QPS". For example, accessing a page will request the server twice. One access will generate a "T" and two "Q"s.
. Response time: the total time it takes to execute a request from the beginning to the end when the response data is received, that is, the time from the client initiating the request to receiving the server response result.
Response time RT (Response-time) is one of the most important indicators of a system. Its numerical value directly reflects the speed of the system.
The number of concurrency refers to the number of requests that the system can handle at the same time. This also reflects the load capacity of the system.
The throughput (pressure-bearing capacity) of the system is closely related to the CPU consumption of requests, external interfaces, IO, etc. The higher the CPU consumption of a single request, the slower the external system interface and IO speed, and the lower the system throughput capacity, and vice versa.
Several important parameters of system throughput: QPS (TPS), number of concurrencies, and response time.
QPS (TPS): (Query Per Second) Number of requests/transactions per second
Concurrency: Number of requests/transactions processed by the system at the same time
Response time: Generally, the average response time
is calculated after understanding the meaning of the above three elements. The relationship between them:
Let's understand the above concepts through an example. According to the 80/20 rule, if 80% of daily visits are concentrated in 20% of the time, this 20% of time is called the peak time.
1, 300w per day PV is on a single machine. How many QPS does this machine require?
(3000000 * 0.8) / (86400 * 0.2) = 139 (QPS)
2. If the QPS of a machine is 58, how many machines are needed to support it?
139 / 58 = 3
At this point, if you do the front-end architecture of general small and medium-sized projects and deploy your own node services, you will know how many machines are needed to form a cluster to report ppt. Haha, you can calculate a rough value with PV .
We need to understand the stress test (we need to rely on the stress test to obtain qps). Take the ab command as an example:
command format:
ab [options] [http://]hostname[:port]/path
Common parameters are as follows:
-n requests total Number of requests -c concurrency Number of concurrency -t timelimit The maximum number of seconds for the test, which can be regarded as the timeout time of the request -p postfile File containing data that requires POST -T content-type Content-type header information used by POST data Copy code