Real-Time Event Ingestion D...

Status: Resolved
Start: 23 Apr 2025 00:18 UTC | End: 23 Apr 2025 01:47 UTC
Total duration: 1 h 29 m

Description
Between 00:18 UTC and 01:47 UTC our Enterprise US ingestion endpoints were unable to accept new connections. End-users experienced HTTP 503 errors and some real-time events during this window were not recorded.

Root Cause
A scheduled down-scaling event reduced the cluster node size. The extra connections that were safely handed off to the remaining nodes exceeded the VMs' maximum processes limit. Once that limit was reached the nodes rejected further connections even though CPU and memory were still well within capacity. The limit was reached due to natural increased traffic over time.

Resolution
At 01:24 UTC we identified the limit and immediately raised the threshold, then redeployed the affected nodes. Traffic began flowing normally at 01:42 UTC and the incident was fully resolved by 01:47 UTC.

Post-incident actions
We have increased VM process limit 2× across all clusters. Additionally, we have setup on automated alert on 80 % process utilisation.

Customer Impact & Next Steps
Enterprise US Ingestion API endpoints were degraded up to 89 minutes. Some events were dropped.

We continue to monitor process utilisation, CPU and memory metrics, and are refining our scaling rules to ensure capacity is provisioned on both compute and connection counts.

If you have any questions or notice remaining anomalies, please reach out to support@userpilot.co or via the in-app chat.

We apologize for the inconvenience caused. Thank you for your patience and trust.

Real-Time Event Ingestion Degradation (Enterprise US Cluster)