1Password Outage(28/04/22)
Breakdown

Curious about Outages, Distributed Systems, DevOps, Backend, SRE related stuffs.
Another day, another outage, another breakdown. Here I will be giving more insights on the 1Password outage on 28th April of this year. Users faced login errors and failed connection issues. Well as always let's take a deep dive into what's the engineering lessons we can take out from here.
What Happened?
On April 27th they did a update on the database and on the next the outage appeared as a consequence of the previous did. Though it was not a security issue, the users data was safe(Good for a password manager app). Well the upgradation of the database was successful. It was planned to optimize the performance of their database.
Later they it was revealed that few queries that were meant to optimize the system impacted the system, the system responded different to those queries. The thing to notice here is the same issue didn't appear in their test environment in controlled parameters so they were completed surprised see this issue. And ultimately all these led to temporary service disruption that impacted syncing data across devices, access to 1Password.com administrative interfaces, new account signups, and performance of the 1Password Connect server.
What they did
Incident handling is what makes the DevOps/SRE roles more cool as per my opinion. Let's see how at 1password they handled the outage. Earlier they were testing a lot to come to a conclusion that updating the current version of MySQL to the latest one will give them an edge and they did it in scheduled window which was already planned.
On 27th they saw a certain number of database connections were open with queries not completing efficiently. It was because of a lock contention created by the inefficient queries, later which fed up the connection limits.
So they brought down the sync services so that their system will get some time to recover(cool move though).
With their new hypothesis or theory they started optimizing the SQL queries, deployed in production too, scaled the database up to limit which they initially put a target on, verified whether it's feasible with the extra load too and it worked out.
What they could have done
Better capacity planning with more traffic/load
Better planning during testing itself
Keeping the sync service as much as possible independent
thanks for reading till here. any suggestions or rectifying my errors are most welcome.
here are more resources if you want to know more
https://blog.1password.com/update-on-our-recent-service-disruption/
https://1password.statuspage.io/incidents/4dkxnp84p27p

