Node.js performance monitoring requires a different approach than traditional server environments. The single-threaded event loop, garbage-collected memory model, and asynchronous I/O create unique failure modes that generic CPU and memory metrics alone cannot capture. This article covers the essential metrics and tools you need to keep Node.js applications running smoothly in production.
Why Node.js Performance Monitoring Is Different
Unlike multi-threaded servers where a slow operation blocks only one thread, a single CPU-heavy operation in Node.js blocks the entire event loop, stalling all concurrent requests. Garbage collection pauses can introduce unpredictable latency spikes. Common failure modes include event loop starvation, memory leaks from uncleaned closures, callback thrashing, and unhandled promise rejections silently swallowing errors. Understanding these characteristics is the first step toward effective monitoring.
Event Loop Lag
Event loop lag measures the delay between scheduling a timer callback and its actual execution. This is the single most important health indicator for a Node.js process.
const { monitorEventLoopDelay } = require("perf_hooks");
const histogram = monitorEventLoopDelay();
histogram.enable();
// In your health check endpoint:
setInterval(() => {
const lag = histogram.mean / 1e6; // Convert nanoseconds to milliseconds
histogram.reset();
if (lag > 50) logger.warn({ eventLoopLag: lag }, "Event loop lag detected");
}, 10000);
| Lag | Interpretation |
|---|---|
| < 10 ms | Healthy |
| 10–50 ms | Concerning — investigate |
| > 50 ms | Critical — immediate action needed |
Common causes include synchronous CPU-heavy operations, excessive JSON parsing on large payloads, and poorly optimized database queries running in the main thread.
Garbage Collection Metrics
V8’s garbage collector runs in two phases: the fast scavenge (young generation) and the slower mark-sweep-compact (old generation). Long or frequent GC pauses directly impact p99 latency.
const { PerformanceObserver } = require("perf_hooks");
const obs = new PerformanceObserver((list) => {
for (const entry of list.getEntries()) {
if (entry.detail.kind === "gc") {
logger.warn({
gcDuration: entry.duration,
gcKind: entry.detail.kind,
gcType: entry.detail.gctype,
}, "GC pause detected");
}
}
});
obs.observe({ entryTypes: ["gc"] });
High allocation rates trigger frequent GC cycles, increasing CPU overhead and pause times. Mitigation strategies include object pooling for hot paths, reducing temporary object allocation in request handlers, and using Buffer.allocUnsafe for large buffers when safe.
Memory Heap Analysis
Node.js exposes memory usage through process.memoryUsage(), which returns rss, heapTotal, heapUsed, external, and arrayBuffers. For deeper analysis, heap snapshots reveal exactly which objects are retaining memory.
const { writeHeapSnapshot } = require("v8");
const used = process.memoryUsage();
for (const [key, value] of Object.entries(used)) {
logger.info({ metric: `memory.${key}`, value: Math.round(value / 1024 / 1024) }, "Memory usage");
}
// Trigger snapshot on high memory
if (used.heapUsed / used.heapTotal > 0.8) {
writeHeapSnapshot(`/tmp/heap-${Date.now()}.heapsnapshot`);
}
Analyze snapshots in Chrome DevTools to identify common leak patterns: growing event listener counts, stale caches without eviction, unreleased setInterval timers, and accidental global variable accumulation.
CPU Profiling
Node.js includes a built-in profiler activated via the --prof flag. Generate a flame graph to visualize where CPU time is spent:
node --prof app.js
# After load testing:
node --prof-process isolate-*.log > processed.txt
Clinic.js provides a more user-friendly suite with the Flame tool for CPU profiling, Bubbleprof for async latency, and Doctor for overall health recommendations. Wide blocks at the top of a flame graph indicate hot functions; deep stacks suggest opportunity for refactoring.
// To enable the built-in profiler programmatically
const inspector = require("inspector");
const session = new inspector.Session();
session.connect();
session.post("Profiler.enable");
session.post("Profiler.start");
// ... after workload ...
session.post("Profiler.stop", (err, { profile }) => {
// Process or export profile data
});
OpenTelemetry Integration
OpenTelemetry provides a vendor-neutral standard for collecting traces, metrics, and logs. The Node.js SDK automatically instruments HTTP, gRPC, database clients, and messaging systems.
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: "http://jaeger:4318/v1/traces" }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
For custom business logic, create manual spans:
const { trace } = require("@opentelemetry/api");
const tracer = trace.getTracer("payment-service");
await tracer.startActiveSpan("processPayment", async (span) => {
span.setAttribute("orderId", order.id);
try { /* payment logic */ } finally { span.end(); }
});
APM Tools Comparison
| Tool | Setup | Strengths | Overhead |
|---|---|---|---|
| Datadog APM | Agent-based | Deep Node.js integration, runtime metrics | Low |
| New Relic | npm install | Auto-instrumentation, browser integration | Medium |
| Elastic APM | Agent-based | Open source, ELK-native | Low |
| OpenTelemetry + SigNoz | Manual setup | Self-hosted, full OTLP support | Low |
| PM2 + Keymetrics | Built-in | Lightweight, process management | Minimal |
Practical Monitoring Setup
Start minimal: export process metrics (event loop lag, memory, garbage collection) via prom-client to Prometheus, then visualize in Grafana. As complexity grows, add OpenTelemetry for distributed tracing and integrate an APM tool for deeper insights.
const client = require("prom-client");
const gcHistogram = new client.Histogram({
name: "nodejs_gc_duration_seconds",
help: "Time spent in GC",
buckets: [0.001, 0.01, 0.1, 1],
});
Begin monitoring iteratively: event loop lag and memory first, then CPU profiling and GC metrics when issues surface, and finally OpenTelemetry as your application scales into a distributed architecture.
