Issue with making changes t...

Issue with making changes to data

Apr 11 at 12:15pm BST

Affected services

Breww core platform (inc. mobile app)

Resolved
Apr 11 at 12:15pm BST

We're really sorry, but we had a problem with allowing you to make changes to your data. Breww was fully usable in a "view-only" capacity, but most data "writes" were failing. The issue started around midnight and lasted until just before 7:30am (UK time).

The issue was caused by one of our security mechanisms (known as CSRF) erroneously rejecting user's requests to apply changes to their data. When you send data update requests to Breww, we verify that they come from a trusted source. This check thought the requests were not to be trusted, even though they were, in fact, from trusted sources.

We have extensive monitoring of our systems in place to ensure that we know about any issues with our platform. This monitoring will wake our team up in the middle of the night if needed. Unfortunately, our monitoring didn't detect this issue, so it didn't wake us up to fix it. This is why the problem took so long to be fixed. Once our team was aware of the problem, it was resolved within 10 minutes.

Why did it occur in the first place?

Ironically, we had deployed a change intended to improve reliability, which we deliberately made during the period of the day when Breww is at its quietest - which is late evening for our UK-based team. The change was well tested in our non-production environments, but due to some differing characteristics of our production environment, related to how the web application firewall (WAF) and reverse proxies work together, an HTTP header was being dropped that wasn't dropped in the other environments. This caused the CSRF checks to reject the requests to change data.

Due to the nature of the update, we also had it planned for Breww's quietest period to reduce the impact of any expected problems. This was unfortunate in this case as the change appeared to have worked perfectly, so our team went to bed unaware of these problems. We hadn't in any way anticipated that it would cause the problems that it did in the way it did, so after the change seemed to go live successfully, and because we didn't encounter any problems during the testing in our non-production environment, we were happy that all was well.

As we have customers all around the world, but we're based in the UK, our out-of-hours, isn't out-of-hours for everyone.

Going forward

This became such a significant problem as we were not alerted to the problem by our monitoring. This is arguably the biggest failing here. We'll be introducing new monitoring metrics to detect issues like this in the future and ensure the team is notified (and woken up if required). We're going to start monitoring application response codes for elevated levels/spikes, and ensure this triggers the same processes as so many other already-monitored metrics do.

We're sorry

We hold our hands up. We let you down today. We know that Breww is vital to your business and that issues like this are not acceptable. We've already learnt from it and will have processes/monitoring improved very soon.

We appreciate all the kind messages and thank you for being so understanding that problems do happen despite our best efforts. This will only make Breww stronger and more resilient going forward.