EKS Node Group Design Guide — On-Demand, Spot, and System Workloads
A practical guide to EKS node group design. Covers how to separate system nodes, application nodes, and Spot worker nodes using labels, taints, and workload boundaries for better cost and stability.
Node Group Design Strongly Affects EKS Stability and Cost
A common first setup is one node group for everything.
Over time that causes:
- contention between system and app workloads
- wider impact from Spot interruptions
- poor control over cost versus reliability
Node groups are not just infrastructure units. They are policy boundaries.
A Practical Split
Typical production split:
- system node group
- app node group
- Spot worker node group
Example:
system-ng
- CoreDNS
- ingress controller
- cluster autoscaler
app-ng
- APIs
- web services
- standard backends
spot-ng
- batch jobs
- async workers
- interruption-tolerant workloads
Why System Nodes Should Be Separate
Critical cluster services need maximum stability.
If system components share nodes with application spikes, you get:
- resource contention
- unstable ingress behavior
- control-plane-adjacent reliability issues
This is why an On-Demand system node group is often worth it.
Where Spot Nodes Fit
Spot works best for workloads that can survive interruption:
- queue-driven workers
- batch jobs
- restart-tolerant background processes
Less suitable:
- ingress controllers
- critical stateful APIs
- cluster-essential components
Spot is not just cheap compute. It is interruptible compute.
Labels and Taints
Labels alone are often not enough.
Useful pattern:
system-ng
- label: workload=system
- taint: dedicated=system:NoSchedule
spot-ng
- label: lifecycle=spot
- taint: spot=true:NoSchedule
Then workloads opt in with tolerations.
This helps avoid accidental scheduling drift.
Instance Type Strategy
system-ng
- stable On-Demand
- modest but predictable sizing
app-ng
- instance families matched to application profile
spot-ng
- multiple instance families
- capacity-optimized strategy
Diversity matters especially in Spot groups.
Autoscaling Must Match the Design
Node groups should also align with:
- Cluster Autoscaler or Karpenter strategy
- scale-up speed
- disruption tolerance
- pod disruption budgets
Node design and autoscaling cannot really be separated.
Common Mistakes
- putting all workloads in one node group
- running ingress on Spot
- using labels without taints
These usually show up later as operational instability.
Closing Thoughts
Good EKS node group design is really about separating workload classes:
- system workloads
- standard services
- interruption-tolerant workers
That separation improves both reliability and cloud cost control.