Skip to main content
Case Studies

Website Deployed but Not Updated? — The Story Behind Debugging a Dashboard That Said "Success" When It Wasn’t

What happens when your CI/CD system reports every deploy as successful, but your website still shows content from 3 days ago? Here’s a step-by-step debugging case every DevOps team should understand.

17 Mar 20268 min
DevOpsKubernetesCI/CDDebuggingInfrastructureCase Study

What Happened?

One day, we published a new article to the website. The CI/CD pipeline built successfully ✅, deployed successfully ✅, and the dashboard was green across the board — but when we opened the site, it was still showing content from 3 days ago. The new article was nowhere to be found.

And it didn’t happen just once. Looking back, we found that 3 deployments in a row had all been reported as "successful" — yet none of them had actually updated the site.

This is a record of how we traced the issue, from what we saw → what we assumed → what we eventually discovered.


Step 1 — Was It Really "Successful"?

The first thing we did was inspect every stage of the pipeline:

  • Build — the image was built and pushed to the registry ✅
  • Deploy — the configuration was applied to the cluster successfully ✅
  • Rolloutstuck! timed out after 2 minutes, but handled with a fallback message instead of an error ❌

The key detail: the system reported "success" because the timeout was treated as a warning, not an error — so the pipeline passed even though the final stage had failed.

First lesson: "No error" does not mean "success" — silent failures are always more dangerous than loud ones.


Step 2 — Check the Most Obvious Hypotheses First

Once we knew the rollout was stuck, the next question was: "Why?" We started with the most common explanations.

Hypothesis A: Not enough resources?

We opened the monitoring dashboard and saw CPU usage at only 7%, memory at 38%, and a very low load average.

❌ Ruled out — there was plenty of capacity.

Hypothesis B: Health checks were failing?

We reviewed the configuration. The health check endpoint was set correctly and had worked in the previous version.

❌ Ruled out — no config changes there.

Hypothesis C: The image couldn’t be pulled?

We checked the registry — the image had been pushed successfully and could be pulled normally.

❌ Ruled out — not an image issue.


Step 3 — Go Back in Time to Find Where It Broke

Once the early hypotheses were eliminated, we changed our approach — go back to the last deployment that actually worked and compare from there:

Version Result Time Taken
v1.1.195 ✅ Success ~14 seconds
v1.1.196 ❌ Timeout >120 seconds
v1.1.197 ❌ Timeout >120 seconds
v1.1.198 ❌ Timeout >120 seconds

The pattern was clear: v1.1.195 completed normally. Starting with v1.1.196, every rollout got stuck.

We checked what changed between those two versions — and found nothing had changed in the infrastructure. The only update was new content on the site.


Step 4 — Root Cause: A Domino Effect

After digging deeper, the real picture started to emerge.

How Rolling Updates Work

The container orchestration system used a rolling update strategy for zero-downtime deployment:

  1. Create new containers first (surge)
  2. Wait until the new ones are ready (readiness check passes)
  3. Then shut down the old ones (terminate)

With the setting that prevents reducing available capacity, the old containers can only be terminated once the new containers are fully ready.

The Problem: Chain Reaction

  1. v1.1.196 timed out — the rollout didn’t finish within the expected time, so the system left it in an "in-progress" state
  2. v1.1.197 was deployed on top of it — but the system was still processing the previous rollout, so the new one got stuck too
  3. v1.1.198 came next — stacking another layer on top, like falling dominoes

Second lesson: A timed-out rollout doesn’t disappear — it stays there until someone explicitly fixes it.


Step 5 — Why Didn’t the Pipeline Alert Anyone?

This was the most painful part: the pipeline completed every time because the timeout was handled as only a warning.

The logic worked like this:

  • If the rollout succeeds → show "success"
  • If the rollout times out → show "⚠️ timeout" but still treat the pipeline as passed

The result: a green dashboard ✅ every time, while no one realized the site had been stuck on an old version for 3 full days.

Third lesson: Every ignored warning is a future error — a timeout should be a failure, not a warning.


The Fix — 3 Levels

Level 1: Immediate Recovery

Force a restart so the system clears the stuck rollout and starts a clean deployment from scratch.

Level 2: Prevent It from Happening Again

  • Change the pipeline so rollout timeout = failure, not warning
  • Add a step to clear any previous stuck rollout before starting a new deploy
  • Deploy one service at a time — don’t deploy the web app and API together

Level 3: Monitoring & Alerting

  • Set an alert when the deployed version ≠ the version actually being served
  • Check the response header after deployment to confirm it matches the latest version
  • Run a smoke test after deploy — if the content doesn’t match, automatically roll back

5 Lessons for Every Team

1. Don’t Trust a Green Dashboard

"Build succeeded" ≠ "Deploy succeeded" ≠ "The system is actually working" — always verify the final outcome.

2. Silent Failures Are More Dangerous Than Loud Failures

A noisy failure gets fixed immediately. A quiet one accumulates until it turns into a crisis.

3. Go Back in Time Before Guessing

Instead of guessing "it’s probably this," find the last known good state and compare from there. It’s the fastest way to eliminate bad assumptions.

4. Every Warning Needs an Escalation Path

A warning that happens 3 times in a row is no longer just a warning — it’s an incident.

5. Design Pipelines to "Fail Fast, Fail Loud"

A good system should make noise when something goes wrong, not hide the problem behind a green status indicator.


Key Takeaways

This issue wasn’t caused by broken code, a server outage, or insufficient resources — it came from the gap between a successful build and a real deployment, a gap nobody was monitoring.

In DevOps, what you don’t measure is what you don’t know. And what you don’t know is exactly what comes back to hurt you when you least expect it.


If your organization is dealing with a similar problem — deployments that look successful but don’t actually update the system, pipelines that are too quiet, or infrastructure that needs to be more resilient — talk to the Enersys team. We help design and fix DevOps systems so they work in the real world.


References

"Empowering Innovation,
Transforming Futures."

ติดต่อเราเพื่อทำให้โปรเจกต์ของคุณเป็นจริง