Problems accessing Stackify Retrace
Incident Report for Stackify
Postmortem

At roughly 2:06PM CDT our Azure Application Gateway started having connectivitiy issues to our backend Azure application. This made it decide that our application was offline and started responding with 502 errors.

Note: This was only impacting the Retrace website user interface, and not several API applications we serve. All APM, errors, logs, and APM data was still being ingested.

As soon as our monitoring alerted us of the issue we quickly identified that our application was working fine and it was more of an internet traffic routing issue. We changed our DNS to bypass the gateway but that caused the custom dashboards feature in Retrace not to be available. We also disabled proxying traffic through CloudFlare. After about 30 minutes, the gateways started working again and we changed the DNS back to using the gateway. This fully restored functionality of custom dashboards.

We also saw problems in our QA environment and issues with our Retrace monitoring agents reporting data to our staging environment. All of which were different but similar internet access/traffic routing types of problems.

After further research we believe the root cause was CloudFlare. We have also discovered that CloudFlare has had some networking issues today and they have a status page up about issues on their end. Stackify uses Cloudflare to proxy our traffic for CDN and performance issues. Intermittance in CloudFlare was the likely root cause of today’s problem. One of the steps we did during the outage was disabling CloudFlare’s proxy feature.

We were using CloudFlare’s traffic proxying in some places that we don’t necessarily need it. We have disabled that going forward.

Posted Mar 17, 2020 - 16:56 CDT

Resolved
Retrace UI continues to work just fine. The issue we were seeing across multiple Azure environments has all resolved as well.
Posted Mar 17, 2020 - 16:34 CDT
Monitoring
The issue with the Azure Gateway seems to have resolved. We are keeping a close eye on it. Custom dashboards are now working again.
Posted Mar 17, 2020 - 15:15 CDT
Identified
The issue has been identified. There appears to be an issue with Azure Application Gateways. We are seeing it in our production and QA environments both. We have redirected s1.stackify.com to bypass the gateway. Retrace's main UI now works except for custom dashboards. We are still investigating.
Posted Mar 17, 2020 - 14:37 CDT
Investigating
Stackify users may be experiencing issues accessing the login to Retrace at https://s1.stackify.com

Our team is working on this and we will update as more news becomes available. Please check back for an update. Thank you for your patience.
Posted Mar 17, 2020 - 14:18 CDT
This incident affected: Management Portal.