Investigating High CPU vs High Memory on a Database Cluster – ScaleGrid, Inc

When troubleshooting performance issues on your cluster, it can be helpful to understand the difference between high CPU usage and high memory usage (or Out of Memory/OOM) events. These two symptoms often come from different causes, and identifying the right one helps us apply the correct fix faster.

High CPU: Commonly Caused by Slow or Inefficient Queries

On the ScaleGrid Monitoring tool, CPU spikes are typically triggered by slow, inefficient, or unindexed queries, regardless of whether you’re using MongoDB, PostgreSQL, MySQL, or Redis. This isn’t always a sign that your server is undersized — it usually means the database is overworking to process expensive queries.

Common Causes Across Engines:

Missing or suboptimal indexes
Large joins or aggregations
Too many concurrent or long-running queries
Full-table scans
Poorly designed queries (e.g., selecting all fields, complex filters, etc.)

What this means: High CPU is usually an application/query layer concern, not always a hardware scaling issue.

Running a Slow Query Analysis in ScaleGrid

To pinpoint which queries are responsible for the CPU spike, ScaleGrid provides a cross-database query analysis tool. This works for all supported engines that provide query-level metrics (PostgreSQL, MySQL, MongoDB, etc.).

Step-by-Step:

Identify CPU Spike Periods
- Review your cluster’s CPU graph.
- Pick a time interval with high CPU usage (30–60 min each) as outlined in the steps below
  - Open the Support Console and click "View" under the Monitoring column for the cluster you're troubleshooting.
  - Change the time window to a custom range depending on when the issue occurred.
  - Set the time zone to UTC from the drop-down.
  - Look at the CPU (%) chart and click the expand icon in the top-right corner of that chart.
  - Find the time ranges where the CPU hits close to or exactly 100%.
  - Pick a time interval, 30–60 minutes long, and note down its UTC start and end time.
Run the EXEC_QUERY_ANALYSIS Action
- Go to Support → Server Actions in the ScaleGrid Console.
- Select EXEC_QUERY_ANALYSIS.
- Provide the UTC time interval you noted earlier.
- Submit the action and note down the Action ID returned.
Download the Results
- Run GET_QUERY_ANALYSIS with the same Action ID.
- Download the resulting .txt report.
Analyze in Excel
- Open the report in Excel or any spreadsheet tool.
- Create a new column:
  IMPACT = Query Count × Duration (ms)
- Sort this column from highest to lowest to see which queries had the biggest impact.
- Optional: Multiply nScanner × Count (if applicable for your DB engine) to see high-scan frequent queries.
What to Look For
- Queries without indexes
- High number of rows returned (nReturned or equivalent)
- Queries running frequently or for long durations
Next Step
- Share these findings with the customer to add indexes, rewrite problematic queries, or optimize access patterns.

High Memory or OOM: Usually Caused by Resource Contention

High memory usage or an OOM crash usually means your database process or system is handling too much data or too many tasks at once. This applies across all database types, though memory usage patterns differ between engines.

Common Causes:

Too many client connections
Large result sets or data held in memory
High parallel query execution
Caching, sorting, or buffering too much data
Workload spikes on the primary node

What this means: These issues typically indicate resource contention, and often require workload balancing or hardware scaling rather than query tuning.

Where to Check the Logs

For High CPU:

Review your database logs to find slow queries, lock contention, or performance warnings:

MongoDB: /var/log/mongodb/mongod.log
PostgreSQL: /var/log/postgresql/postgresql.log
MySQL: /var/log/mysql/mysqld.log
Redis: /var/log/redis/redis.log

Look for entries showing:

Long-running queries
Lock waits
Full collection/table scans
Missing index hints

For High Memory or OOM:

Check the system logs to confirm if the process was killed by the system due to memory limits:

Linux system log: /var/log/messages

Look for:

Out of memory
oom-killer events
Any lines indicating which database process was terminated

What I’d Recommend

If You’re Facing High CPU:

Run the Query Analysis tool in ScaleGrid.
Identify and fix problematic queries or missing indexes.
Offload reporting/analytics queries to read-only replicas if applicable.

If You’re Facing High Memory or OOM:

Consider scaling up to a larger instance.
Split the load across Read-Write and Read-Only replicas to reduce pressure on the primary.

Summary & Final Thoughts

Symptom	Common Cause	Suggested Actions
High CPU	Expensive queries, lack of indexing	Run query analysis, add indexes, tune application queries
High Memory / OOM	Heavy concurrent usage or caching	Scale up, balance load across nodes, tune memory settings

Before making changes like scaling or restarts, we always recommend looking at monitoring, logs, and analysis tools — like ScaleGrid’s query analysis — so the root cause is addressed directly.

Related to