EKS Node Group Design Guide — On-Demand, Spot, and System Workloads

Node Group Design Strongly Affects EKS Stability and Cost

A common first setup is one node group for everything.

Over time that causes:

contention between system and app workloads
wider impact from Spot interruptions
poor control over cost versus reliability

Node groups are not just infrastructure units. They are policy boundaries.

A Practical Split

Typical production split:

system node group
app node group
Spot worker node group

Example:

system-ng
- CoreDNS
- ingress controller
- cluster autoscaler

app-ng
- APIs
- web services
- standard backends

spot-ng
- batch jobs
- async workers
- interruption-tolerant workloads

Why System Nodes Should Be Separate

Critical cluster services need maximum stability.

If system components share nodes with application spikes, you get:

resource contention
unstable ingress behavior
control-plane-adjacent reliability issues

This is why an On-Demand system node group is often worth it.

Where Spot Nodes Fit

Spot works best for workloads that can survive interruption:

queue-driven workers
batch jobs
restart-tolerant background processes

Less suitable:

ingress controllers
critical stateful APIs
cluster-essential components

Spot is not just cheap compute. It is interruptible compute.

Labels and Taints

Labels alone are often not enough.

Useful pattern:

system-ng
- label: workload=system
- taint: dedicated=system:NoSchedule

spot-ng
- label: lifecycle=spot
- taint: spot=true:NoSchedule

Then workloads opt in with tolerations.

This helps avoid accidental scheduling drift.

Instance Type Strategy

system-ng

stable On-Demand
modest but predictable sizing

app-ng

instance families matched to application profile

spot-ng

multiple instance families
capacity-optimized strategy

Diversity matters especially in Spot groups.

Autoscaling Must Match the Design

Node groups should also align with:

Cluster Autoscaler or Karpenter strategy
scale-up speed
disruption tolerance
pod disruption budgets

Node design and autoscaling cannot really be separated.

Common Mistakes

putting all workloads in one node group
running ingress on Spot
using labels without taints

These usually show up later as operational instability.

Closing Thoughts

Good EKS node group design is really about separating workload classes:

system workloads
standard services
interruption-tolerant workers

That separation improves both reliability and cloud cost control.