AI response from production call
Where information is not explicitly stated, I call it out as an assumption.
Production Incident Summary
Description: A large batch of product rule publishes resulted in massive rule duplication, causing the Rules services to become CPU‑bound and repeatedly restart. This led to product pages, configurators, and admin tools failing or timing out. Impacted Systems / Applications:
Rules API, Rules Admin, Rules Host (Autobahn / AVP) Product configurator PIP / Unified PIP pages Admin (Autobahn) product management PLA validation (for impacted products) Severity: High / Production Incident (customer‑facing impact confirmed and production notice issued). Start Time / Current Status:
Start: Prior to this call (exact timestamp not stated). Current status: Ongoing during the call; partial service recovery achieved, but impacted products remain broken.
Customer / User Impact:
Customers encounter CAT/error pages, broken configurators, and failed PIPs for specific products. Internal users (PEXA, IR, Millwork) unable to reliably access or manage affected products in Admin. Scope of Impact:
Initially thought to be ~5 products, later confirmed more than five, varying by site (BCOM, CA, etc.). Includes high‑traffic, high‑value products (e.g., blackout and roller shades). Revenue / Operational Impact:
Implicit revenue risk due to broken add‑to‑cart and configurator flows (not quantified). SLA / SLO Risk:
High risk — services were timing out and CPU was pegged at ~100% on Rules servers.
Times are approximate unless explicitly stated.
Early morning (pre‑call):
PEXA publishes large drafts with new rules. Publishing retries occur due to failures, causing exponential rule duplication. During initial investigation:
Rules servers show repeated service stop/start events. CPU on AVP API 05/06 reaches ~100%. Call begins:
Teams confirm issue is customer‑impacting, not just internal. Production notice is initiated. Mid‑call:
Teams instruct everyone to stay out of Admin to reduce load. Configuration queues are shoveled into a temporary queue to allow Rules to recover. CPU and error rates begin to drop. Later in call:
Rules API is scaled horizontally by adding AVP API 01–04 into the target group. Redeploy performed (same code version) to register additional servers. Discovery:
Individual products found to have 20k–28k duplicate rules (some rules duplicated ~127 times). Decision point:
Acknowledgement that platforms are stabilized, but products remain unusable until database cleanup. End of call segment:
Team agrees to disable impacted products and begin database‑level deletion of duplicate rules.
Confirmed / Strongly Supported Root Cause:
Rules publishing process is not idempotent. When publishes fail and retry, the same evaluation and execution rules are re‑inserted repeatedly instead of deduplicating. Contributing Factors:
Large, complex products with hundreds of rules. Automatic retry behavior during failures. Configuration revalidation hammering the same Rules API used by live traffic. Known Gaps:
No guardrail to stop excessive retries or cap rule duplication.
Issued a production notice acknowledging customer impact. Instructed teams to stop accessing Admin and products. Shoveled configuration messages to a temporary queue to reduce load. Scaled Rules API horizontally by adding 4 additional servers (AVP API 01–04). Redeployed Rules services (no code change) to stabilize infrastructure. Identified and documented impacted products and duplication magnitude. Began planning database‑level cleanup of duplicate rules.
Overall state: Mitigated but not resolved. Working:
Core site pages for non‑impacted products. Rules services no longer pegged at 100% CPU after scaling. Still broken:
Impacted products’ PIPs, configurators, and Admin pages. Products with tens of thousands of duplicated rules remain unusable. Risk:
Any access to impacted products triggers rules execution and can re‑stress the system.
Action
Owner
Notes
Disable impacted products across sites
Chandler Ullery / PEXA
Prevent further customer hits
Generate list of affected products (by site)
Chandler Ullery
Screenshot already shared in meeting chat
Write and execute DB delete script to remove duplicate rules
David Grothe & Bigmike Mcrorey
Target one product first
Keep config messages parked until cleanup completes
Dylan Yates
Prevent re-triggering
Review Rules engine PRs & backlog (duplication prevention)
Dylan Yates / Anish Patel
JIRA shared during call
(Timelines not explicitly stated — assume immediate / same‑day due to severity.)
Data risk: Deleting rules incorrectly could remove valid rules. Operational risk: Admin remains unusable for affected products until cleanup completes. Technical debt: Known Rules engine bugs remain unfixed. Visibility gap: No automatic detection for runaway duplication.
Confirmed customer‑impacting incident → production notice issued. No rollback (issue not caused by new deployment). Scale out before fixing data to stabilize platform. Disable products temporarily rather than keep retrying publishes. Database cleanup approved as necessary remediation.
Postmortem required: ✅ Yes Preventative / Long‑Term Fixes:
Make rule publishing idempotent. Add retry caps / circuit breakers. Separate configuration revalidation traffic from live Rules API. Add monitoring for abnormal rule growth per product. Require safer publish gates for large product updates.
Escalation Highlight
This incident exposes systemic risk in the Rules engine affecting both availability and data integrity. Leadership escalation is warranted to prioritize architectural fixes, not just cleanup.
We've now resolved the incident. Thanks for your patience.
The system is up and running for all products except for the following that are still impacted:
We've confirmed there is a problem, we're working to resolve it.
[What are the issues?]
[What are we doing to fix it?]
currently, no ETA
Please remain attentive to forthcoming updates, as additional information will be disseminated promptly upon availability. We sincerely apologize for any inconvenience this may cause to your work. We acknowledge the importance of your access to our services and appreciate your patience and understanding as we work diligently to resolve the issue.
We’ll find your subscription and send you a link to login to manage your preferences.
We've sent you an email — please check your inbox and click the link to continue.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from GCC Systems Status, are you sure?
{{ error }}
We’ll no longer send you any status updates about GCC Systems Status.
Your email has been verified — you'll now receive status updates from GCC Systems Status.