At roughly 2:06PM CDT our Azure Application Gateway started having connectivitiy issues to our backend Azure application. This made it decide that our application was offline and started responding with 502 errors.
Note: This was only impacting the Retrace website user interface, and not several API applications we serve. All APM, errors, logs, and APM data was still being ingested.
As soon as our monitoring alerted us of the issue we quickly identified that our application was working fine and it was more of an internet traffic routing issue. We changed our DNS to bypass the gateway but that caused the custom dashboards feature in Retrace not to be available. We also disabled proxying traffic through CloudFlare. After about 30 minutes, the gateways started working again and we changed the DNS back to using the gateway. This fully restored functionality of custom dashboards.
We also saw problems in our QA environment and issues with our Retrace monitoring agents reporting data to our staging environment. All of which were different but similar internet access/traffic routing types of problems.
After further research we believe the root cause was CloudFlare. We have also discovered that CloudFlare has had some networking issues today and they have a status page up about issues on their end. Stackify uses Cloudflare to proxy our traffic for CDN and performance issues. Intermittance in CloudFlare was the likely root cause of today’s problem. One of the steps we did during the outage was disabling CloudFlare’s proxy feature.
We were using CloudFlare’s traffic proxying in some places that we don’t necessarily need it. We have disabled that going forward.