Argo Rollouts Guide — How to Run Progressive Delivery on Kubernetes

Why Argo Rollouts matters

Teams often say they do canary or blue-green deployment, but the harder questions are:

how much traffic should move at each step
what metrics define success
where should manual approval exist
when should rollback happen automatically

Argo Rollouts helps express those rules as Kubernetes resources.

Core components

A typical setup includes:

a Rollout resource
one or more AnalysisTemplates
ingress or service mesh integration
a metric source such as Prometheus

The real value is not just deployment, but deployment plus verification.

Why not plain Deployment

A standard Deployment focuses on replacing ReplicaSets.
Argo Rollouts focuses on controlled progressive release.

That gives you:

staged traffic weights
pause steps
automatic analysis
failure detection
rollback or promotion control

Example canary strategy

strategy:
  canary:
    steps:
      - setWeight: 10
      - pause: { duration: 300 }
      - analysis:
          templates:
            - templateName: success-rate-check
      - setWeight: 30
      - pause: { duration: 300 }
      - setWeight: 60
      - pause: { duration: 300 }

The important point is not the percentages themselves, but what each stage is trying to validate.

What should define failure

Useful rollout metrics often include:

5xx rate
P95 or P99 latency
success rate for business-critical APIs
pod restarts
downstream dependency error rates

The right question is not “are the pods alive” but “is service quality holding up.”

Prometheus-based analysis example

metrics:
  - name: success-rate
    interval: 1m
    successCondition: result[0] >= 99
    provider:
      prometheus:
        query: |
          100 * (
            sum(rate(http_requests_total{status!~"5.."}[5m])) /
            sum(rate(http_requests_total[5m]))
          )

That is where Rollouts becomes operationally meaningful: verification is part of the release itself.

Ingress and traffic shifting concerns

Traffic movement is usually handled through:

NGINX Ingress
ALB Ingress Controller
Istio or Linkerd

Important caveats:

sticky sessions can distort traffic distribution
caches can hide issues in the new version
internal and external traffic may behave differently

So “10 percent traffic” is not always equal to “10 percent real user impact.”

Where to put manual approval

Full automation is not always the right choice.

Manual approval is still useful for:

payment and checkout paths
authentication changes
large migrations
pre-event production releases

Lower-risk services usually benefit more from fully automated analysis.

Rollback must be fast and predictable

In a bad rollout, recovery speed matters more than discussion speed.

Practical rules:

define failure conditions in advance
automate rollback where possible
distinguish rollout pause from rollback
connect analysis failures to alerting

Operations should never have to guess whether a rollout is paused, degraded, or already reverted.

Common mistakes

using lagging or noisy metrics
analyzing whole-service metrics instead of rollout-specific signals
making pause windows too short to observe real behavior
forgetting alerting and runbooks after failed analysis
trusting infrastructure metrics without business metrics

A sensible adoption path

define service-level indicators first
validate Prometheus queries and thresholds
start with low-risk services
define which paths need manual approval
wire rollback and alerting into the deployment workflow

Closing thoughts

Argo Rollouts is not just a deployment tool. It is a way to turn deployment into an observable, testable operational process.

Many teams understand canary delivery conceptually. Fewer teams connect release stages to real service quality signals and automatic rollback. That is where progressive delivery becomes real operations instead of release theater.