MyFitnessPal Outage(22/04/22)
Breakdown

Curious about Outages, Distributed Systems, DevOps, Backend, SRE related stuffs.
Another day, another outage, another breakdown. The fitness app MyFitnessPal faced outages on their Webapp, iOS and Android app all together. Well I don't want to look into the scale of the impact, I will try to focus more on the type of impact and what can we learn from the outage from engineering prospective. Let's go for a deep dive folks...
What happened?
It started on 22nd of April this year and kept going on up to 30th April. The Webapp users were facing issue of not able to create/edit new items, add existing items to their cart/dairy. Similar kind of error problems were faced by the mobile app users too. Along with it few of them were getting login errors and server errors also. Moving forward to few more days, the data sync error too appeared. The data was getting stored locally only. It was a consequence of the backend server issue. That caused the services go corrupt which were responsible for the data sync with cloud.
What they did?
Initially they were unable to find any workaround and that's what happens in most of the cases but if you stick to problem for long then ideas popup in your brain casually and slowly. One of the major reason of this outage was a migration which they were performing behind the scenes. I, myself is part of migration for last one year and I know the effort and pain goes into a migration. There is always a chance of something will go wrong type of worry.
Well they found the error, solved it in few days. Next comes up the updating the database with the new migrated system part. When almost all of the services were placed perfectly the MVP became rude now. They resolved this one too and went for deployment of the respective code changes, cleanup of corrupt data, rebuilding of databases. This almost eradicated the login and sync issues.
What they could have done
A proper migration strategy. That's it. Migration are of different types. When you don't bring the site down, perform the migration when the load is on, well that type of migration is always risky because no one can make hard guess which services gonna respond bad to the migration. Migration strategy itself is a vast topic, I will write one article on it sometime later but before that here is an article that has a detailed explanation of different plans for Migration.
Many users lost their data during the downtime because they uninstalled the app which removed the locally saved data. This could have been easily tackled if the app doesn't remove the data even after uninstallation(eg : whatsapp).
- They has single point of failure for the data sync service which was given to a third party and the third party clearly doesn't have a fault. Somehow if they remove this single point of failure then the sync issue will not appear. Probable solution - keep a in house built sync service almost less dependent on the other services.
Thanks for reading up to here.
Here are few more resources if you want to know more :-
- https://support.myfitnesspal.com/hc/en-us/articles/5716717249933-Unable-to-Create-New-or-Edit-Existing-Foods-Meals-Recipes-Quick-Add-Sync-and-or-Login

