Case Studies

Website Deployed but Not Updated? — The Story Behind Debugging a Dashboard That Said "Success" When It Wasn’t

What happens when your CI/CD system reports every deploy as successful, but your website still shows content from 3 days ago? Here’s a step-by-step debugging case every DevOps team should understand.

17 Mar 20268 min

DevOpsKubernetesCI/CDDebuggingInfrastructureCase Study

What Happened?

One day, we published a new article to the website. The CI/CD pipeline built successfully ✅, deployed successfully ✅, and the dashboard was green across the board — but when we opened the site, it was still showing content from 3 days ago. The new article was nowhere to be found.

And it didn’t happen just once. Looking back, we found that 3 deployments in a row had all been reported as "successful" — yet none of them had actually updated the site.

This is a record of how we traced the issue, from what we saw → what we assumed → what we eventually discovered.

Step 1 — Was It Really "Successful"?

The first thing we did was inspect every stage of the pipeline:

Build — the image was built and pushed to the registry ✅
Deploy — the configuration was applied to the cluster successfully ✅
Rollout — stuck! timed out after 2 minutes, but handled with a fallback message instead of an error ❌

The key detail: the system reported "success" because the timeout was treated as a warning, not an error — so the pipeline passed even though the final stage had failed.

First lesson: "No error" does not mean "success" — silent failures are always more dangerous than loud ones.

Step 2 — Check the Most Obvious Hypotheses First

Once we knew the rollout was stuck, the next question was: "Why?" We started with the most common explanations.

Hypothesis A: Not enough resources?

We opened the monitoring dashboard and saw CPU usage at only 7%, memory at 38%, and a very low load average.

❌ Ruled out — there was plenty of capacity.

Hypothesis B: Health checks were failing?

We reviewed the configuration. The health check endpoint was set correctly and had worked in the previous version.

❌ Ruled out — no config changes there.

Hypothesis C: The image couldn’t be pulled?

We checked the registry — the image had been pushed successfully and could be pulled normally.

❌ Ruled out — not an image issue.

Step 3 — Go Back in Time to Find Where It Broke

Once the early hypotheses were eliminated, we changed our approach — go back to the last deployment that actually worked and compare from there:

Version	Result	Time Taken
v1.1.195	✅ Success	~14 seconds
v1.1.196	❌ Timeout	>120 seconds
v1.1.197	❌ Timeout	>120 seconds
v1.1.198	❌ Timeout	>120 seconds

The pattern was clear: v1.1.195 completed normally. Starting with v1.1.196, every rollout got stuck.

We checked what changed between those two versions — and found nothing had changed in the infrastructure. The only update was new content on the site.

Step 4 — Root Cause: A Domino Effect

After digging deeper, the real picture started to emerge.

How Rolling Updates Work

The container orchestration system used a rolling update strategy for zero-downtime deployment:

Create new containers first (surge)
Wait until the new ones are ready (readiness check passes)
Then shut down the old ones (terminate)

With the setting that prevents reducing available capacity, the old containers can only be terminated once the new containers are fully ready.

The Problem: Chain Reaction

v1.1.196 timed out — the rollout didn’t finish within the expected time, so the system left it in an "in-progress" state
v1.1.197 was deployed on top of it — but the system was still processing the previous rollout, so the new one got stuck too
v1.1.198 came next — stacking another layer on top, like falling dominoes

Second lesson: A timed-out rollout doesn’t disappear — it stays there until someone explicitly fixes it.

Step 5 — Why Didn’t the Pipeline Alert Anyone?

This was the most painful part: the pipeline completed every time because the timeout was handled as only a warning.

The logic worked like this:

If the rollout succeeds → show "success"
If the rollout times out → show "⚠️ timeout" but still treat the pipeline as passed

The result: a green dashboard ✅ every time, while no one realized the site had been stuck on an old version for 3 full days.

Third lesson: Every ignored warning is a future error — a timeout should be a failure, not a warning.

The Fix — 3 Levels

Level 1: Immediate Recovery

Force a restart so the system clears the stuck rollout and starts a clean deployment from scratch.

Level 2: Prevent It from Happening Again

Change the pipeline so rollout timeout = failure, not warning
Add a step to clear any previous stuck rollout before starting a new deploy
Deploy one service at a time — don’t deploy the web app and API together

Level 3: Monitoring & Alerting

Set an alert when the deployed version ≠ the version actually being served
Check the response header after deployment to confirm it matches the latest version
Run a smoke test after deploy — if the content doesn’t match, automatically roll back

5 Lessons for Every Team

1. Don’t Trust a Green Dashboard

"Build succeeded" ≠ "Deploy succeeded" ≠ "The system is actually working" — always verify the final outcome.

2. Silent Failures Are More Dangerous Than Loud Failures

A noisy failure gets fixed immediately. A quiet one accumulates until it turns into a crisis.

3. Go Back in Time Before Guessing

Instead of guessing "it’s probably this," find the last known good state and compare from there. It’s the fastest way to eliminate bad assumptions.

4. Every Warning Needs an Escalation Path

A warning that happens 3 times in a row is no longer just a warning — it’s an incident.

5. Design Pipelines to "Fail Fast, Fail Loud"

A good system should make noise when something goes wrong, not hide the problem behind a green status indicator.

Key Takeaways

This issue wasn’t caused by broken code, a server outage, or insufficient resources — it came from the gap between a successful build and a real deployment, a gap nobody was monitoring.

In DevOps, what you don’t measure is what you don’t know. And what you don’t know is exactly what comes back to hurt you when you least expect it.

If your organization is dealing with a similar problem — deployments that look successful but don’t actually update the system, pipelines that are too quiet, or infrastructure that needs to be more resilient — talk to the Enersys team. We help design and fix DevOps systems so they work in the real world.

References

ลิงก์ที่เกี่ยวข้อง

Genesis AI Platform

ลองใช้ AI ในองค์กรจริง

บริการของ Enersys

ดูบริการทั้งหมดของเรา

ติดต่อเรา

ปรึกษาโปรเจกต์ของคุณ

Back to Insights

You Paid a Fortune for AI + ERP, but Got Only 10% of the Value — The “Last Mile” Problem Nobody Talks About

90% of enterprise AI projects fail—not because the technology is bad, but because people refuse to change. HBR and erp.today expose the Last Mile problem that costs companies millions every year.

Analog Businesses Are Dying — UTCC Reveals Thailand’s Rising Stars vs Falling Stars for 2026

The Thai Chamber of Commerce is clear: internet cafes, print media, and bookstores are fading away, while Cloud, Cybersecurity, and the Creator Economy are surging ahead. Thailand’s digital GDP is growing 4.2%—twice the pace of the national economy. Which side is your business on?

An Enersys Company Tour — Inside Every Room of a Thai Software House: Who Does What, and How AI Augments Each Role in 2026

Clients and partners keep asking the same questions — what does Enersys actually do, and who handles what inside the company? This article opens every door of the software house — all 14 rooms (an auspicious number in Thai culture) — from the front desk and engineering floor up to the Executive Office. You'll see who owns what, which AI assistant works alongside each role, and why the human + AI mix in 2026 ships better work faster.

"Empowering Innovation,
Transforming Futures."

ติดต่อเราเพื่อทำให้โปรเจกต์ของคุณเป็นจริง