Incident Notification - Some product pages are showing CAT page

Uptime Impact: 5 hours, 20 minutes, and 47 seconds
Resolved
Updated

AI response from production call

Where information is not explicitly stated, I call it out as an assumption.

Production Incident Summary

  1. Incident Overview

Description: A large batch of product rule publishes resulted in massive rule duplication, causing the Rules services to become CPU‑bound and repeatedly restart. This led to product pages, configurators, and admin tools failing or timing out. Impacted Systems / Applications:

Rules API, Rules Admin, Rules Host (Autobahn / AVP) Product configurator PIP / Unified PIP pages Admin (Autobahn) product management PLA validation (for impacted products) Severity: High / Production Incident (customer‑facing impact confirmed and production notice issued). Start Time / Current Status:

Start: Prior to this call (exact timestamp not stated). Current status: Ongoing during the call; partial service recovery achieved, but impacted products remain broken.

  1. Business Impact

Customer / User Impact:

Customers encounter CAT/error pages, broken configurators, and failed PIPs for specific products. Internal users (PEXA, IR, Millwork) unable to reliably access or manage affected products in Admin. Scope of Impact:

Initially thought to be ~5 products, later confirmed more than five, varying by site (BCOM, CA, etc.). Includes high‑traffic, high‑value products (e.g., blackout and roller shades). Revenue / Operational Impact:

Implicit revenue risk due to broken add‑to‑cart and configurator flows (not quantified). SLA / SLO Risk:

High risk — services were timing out and CPU was pegged at ~100% on Rules servers.

  1. Timeline of Events

Times are approximate unless explicitly stated.

Early morning (pre‑call):

PEXA publishes large drafts with new rules. Publishing retries occur due to failures, causing exponential rule duplication. During initial investigation:

Rules servers show repeated service stop/start events. CPU on AVP API 05/06 reaches ~100%. Call begins:

Teams confirm issue is customer‑impacting, not just internal. Production notice is initiated. Mid‑call:

Teams instruct everyone to stay out of Admin to reduce load. Configuration queues are shoveled into a temporary queue to allow Rules to recover. CPU and error rates begin to drop. Later in call:

Rules API is scaled horizontally by adding AVP API 01–04 into the target group. Redeploy performed (same code version) to register additional servers. Discovery:

Individual products found to have 20k–28k duplicate rules (some rules duplicated ~127 times). Decision point:

Acknowledgement that platforms are stabilized, but products remain unusable until database cleanup. End of call segment:

Team agrees to disable impacted products and begin database‑level deletion of duplicate rules.

  1. Root Cause (Known / Suspected)

Confirmed / Strongly Supported Root Cause:

Rules publishing process is not idempotent. When publishes fail and retry, the same evaluation and execution rules are re‑inserted repeatedly instead of deduplicating. Contributing Factors:

Large, complex products with hundreds of rules. Automatic retry behavior during failures. Configuration revalidation hammering the same Rules API used by live traffic. Known Gaps:

No guardrail to stop excessive retries or cap rule duplication.

  1. Actions Taken

Issued a production notice acknowledging customer impact. Instructed teams to stop accessing Admin and products. Shoveled configuration messages to a temporary queue to reduce load. Scaled Rules API horizontally by adding 4 additional servers (AVP API 01–04). Redeployed Rules services (no code change) to stabilize infrastructure. Identified and documented impacted products and duplication magnitude. Began planning database‑level cleanup of duplicate rules.

  1. Current Status

Overall state: Mitigated but not resolved. Working:

Core site pages for non‑impacted products. Rules services no longer pegged at 100% CPU after scaling. Still broken:

Impacted products’ PIPs, configurators, and Admin pages. Products with tens of thousands of duplicated rules remain unusable. Risk:

Any access to impacted products triggers rules execution and can re‑stress the system.

  1. Next Steps

Action

Owner

Notes

Disable impacted products across sites

Chandler Ullery / PEXA

Prevent further customer hits

Generate list of affected products (by site)

Chandler Ullery

Screenshot already shared in meeting chat

Write and execute DB delete script to remove duplicate rules

David Grothe & Bigmike Mcrorey

Target one product first

Keep config messages parked until cleanup completes

Dylan Yates

Prevent re-triggering

Review Rules engine PRs & backlog (duplication prevention)

Dylan Yates / Anish Patel

JIRA shared during call

(Timelines not explicitly stated — assume immediate / same‑day due to severity.)

  1. Risks & Blockers

Data risk: Deleting rules incorrectly could remove valid rules. Operational risk: Admin remains unusable for affected products until cleanup completes. Technical debt: Known Rules engine bugs remain unfixed. Visibility gap: No automatic detection for runaway duplication.

  1. Key Decisions Made

Confirmed customer‑impacting incident → production notice issued. No rollback (issue not caused by new deployment). Scale out before fixing data to stabilize platform. Disable products temporarily rather than keep retrying publishes. Database cleanup approved as necessary remediation.

  1. Follow‑Up Items

Postmortem required: ✅ Yes Preventative / Long‑Term Fixes:

Make rule publishing idempotent. Add retry caps / circuit breakers. Separate configuration revalidation traffic from live Rules API. Add monitoring for abnormal rule growth per product. Require safer publish gates for large product updates.

Escalation Highlight

This incident exposes systemic risk in the Rules engine affecting both availability and data integrity. Leadership escalation is warranted to prioritize architectural fixes, not just cleanup.

Avatar for GCC Technology Team
GCC Technology Team
Resolved

We've now resolved the incident. Thanks for your patience.

Avatar for GCC Technology Team
GCC Technology Team
Updated

The system is up and running for all products except for the following that are still impacted:

  • 500969
  • 500971
  • 500972
  • 505226
  • 507663
  • 507665
  • 548080
  • 548461
  • 554260
  • 661918
Avatar for GCC Technology Team
GCC Technology Team
Identified

We've confirmed there is a problem, we're working to resolve it.

Avatar for GCC Technology Team
GCC Technology Team
Investigating

Summary:

[What are the issues?]

Impact

  • Incident Started: 7:47am
  • Affecting Customers: Yes
  • Affecting CEC: Yes
  • Ongoing Impact:
    • Currently unable to view certain products and getting broken CAT pages

Detail:

[What are we doing to fix it?]

  • Our engineering team is aware of the issue and is working to get everything back up and running as quickly as possible possible.

ETA:

currently, no ETA

Please remain attentive to forthcoming updates, as additional information will be disseminated promptly upon availability. We sincerely apologize for any inconvenience this may cause to your work. We acknowledge the importance of your access to our services and appreciate your patience and understanding as we work diligently to resolve the issue.

Avatar for GCC Technology Team
GCC Technology Team
Began at:

Affected components
  • Blinds.com
    • PIP
    • Others
  • AmericanBlinds.com
    • PIP
    • Configurator
  • JustBlinds.com
    • PIP
    • Configurator
  • Blinds.ca
    • PIP
    • Configurator