When troubleshooting performance issues on your cluster, it can be helpful to understand the difference between high CPU usage and high memory usage (or Out of Memory/OOM) events. These two symptoms often come from different causes, and identifying the right one helps us apply the correct fix faster.
High CPU: Commonly Caused by Slow or Inefficient Queries
On the ScaleGrid Monitoring tool, CPU spikes are typically triggered by slow, inefficient, or unindexed queries, regardless of whether you’re using MongoDB, PostgreSQL, MySQL, or Redis. This isn’t always a sign that your server is undersized — it usually means the database is overworking to process expensive queries.
Common Causes Across Engines:
-
Missing or suboptimal indexes
-
Large joins or aggregations
-
Too many concurrent or long-running queries
-
Full-table scans
-
Poorly designed queries (e.g., selecting all fields, complex filters, etc.)
What this means: High CPU is usually an application/query layer concern, not always a hardware scaling issue.
Running a Slow Query Analysis in ScaleGrid
To pinpoint which queries are responsible for the CPU spike, ScaleGrid provides a cross-database query analysis tool. This works for all supported engines that provide query-level metrics (PostgreSQL, MySQL, MongoDB, etc.).
Step-by-Step:
-
Identify CPU Spike Periods
-
Review your cluster’s CPU graph.
-
Pick a time interval with high CPU usage (30–60 min each) as outlined in the steps below
-
Open the Support Console and click "View" under the Monitoring column for the cluster you're troubleshooting.
-
Change the time window to a custom range depending on when the issue occurred.
-
Set the time zone to UTC from the drop-down.
-
Look at the CPU (%) chart and click the expand icon in the top-right corner of that chart.
-
Find the time ranges where the CPU hits close to or exactly 100%.
-
Pick a time interval, 30–60 minutes long, and note down its UTC start and end time.
-
-
-
Run the
EXEC_QUERY_ANALYSIS
Action-
Go to Support → Server Actions in the ScaleGrid Console.
-
Select
EXEC_QUERY_ANALYSIS
. -
Provide the UTC time interval you noted earlier.
-
Submit the action and note down the Action ID returned.
-
-
Download the Results
-
Run
GET_QUERY_ANALYSIS
with the same Action ID. -
Download the resulting
.txt
report.
-
-
Analyze in Excel
-
Open the report in Excel or any spreadsheet tool.
-
Create a new column:
IMPACT = Query Count × Duration (ms)
-
Sort this column from highest to lowest to see which queries had the biggest impact.
-
Optional: Multiply
nScanner × Count
(if applicable for your DB engine) to see high-scan frequent queries.
-
-
What to Look For
-
Queries without indexes
-
High number of rows returned (
nReturned
or equivalent) -
Queries running frequently or for long durations
-
-
Next Step
-
Share these findings with the customer to add indexes, rewrite problematic queries, or optimize access patterns.
-
High Memory or OOM: Usually Caused by Resource Contention
High memory usage or an OOM crash usually means your database process or system is handling too much data or too many tasks at once. This applies across all database types, though memory usage patterns differ between engines.
Common Causes:
-
Too many client connections
-
Large result sets or data held in memory
-
High parallel query execution
-
Caching, sorting, or buffering too much data
-
Workload spikes on the primary node
What this means: These issues typically indicate resource contention, and often require workload balancing or hardware scaling rather than query tuning.
Where to Check the Logs
For High CPU:
Review your database logs to find slow queries, lock contention, or performance warnings:
-
MongoDB:
/var/log/mongodb/mongod.log
-
PostgreSQL:
/var/log/postgresql/postgresql.log
-
MySQL:
/var/log/mysql/mysqld.log
-
Redis:
/var/log/redis/redis.log
Look for entries showing:
-
Long-running queries
-
Lock waits
-
Full collection/table scans
-
Missing index hints
For High Memory or OOM:
Check the system logs to confirm if the process was killed by the system due to memory limits:
-
Linux system log:
/var/log/messages
Look for:
-
Out of memory
-
oom-killer
events -
Any lines indicating which database process was terminated
What I’d Recommend
If You’re Facing High CPU:
-
Run the Query Analysis tool in ScaleGrid.
-
Identify and fix problematic queries or missing indexes.
-
Offload reporting/analytics queries to read-only replicas if applicable.
If You’re Facing High Memory or OOM:
-
Consider scaling up to a larger instance.
-
Split the load across Read-Write and Read-Only replicas to reduce pressure on the primary.
Summary & Final Thoughts
Symptom | Common Cause | Suggested Actions |
---|---|---|
High CPU | Expensive queries, lack of indexing | Run query analysis, add indexes, tune application queries |
High Memory / OOM | Heavy concurrent usage or caching | Scale up, balance load across nodes, tune memory settings |
Before making changes like scaling or restarts, we always recommend looking at monitoring, logs, and analysis tools — like ScaleGrid’s query analysis — so the root cause is addressed directly.
Comments
0 comments
Article is closed for comments.