Last month there was an outage in the Azure – South Central US region, which, by reports, seemed to have some knock on effects for other regions.
This was reported at:
In the discussions that followed with our customers, particularly with those currently considering their digital transformation strategies including moves to Office 365 and/or Azure, some expressed varying levels of concern. This prompted some very valuable debate around Adexis and what we feel are some important viewpoints when it comes to digital transformation. Here were some of our thoughts;
Even with the enterprise-grade resources of Microsoft (or Amazon), 100% uptime of any service over a long period of time is not realistic. Between hardware issues, software bugs, scheduled downtime and human error, something, at some point will go wrong – just like in your on-premise environment. With all the buzz around cloud, it can sometimes be easy to forget that this is essentially just an IT environment somewhere else maintained by someone else. Like any IT environment, it is still reliant on humans and physical hardware which will inevitably experience failures of service from time to time.
Control and visibility
When an outage happens on-premise, the local IT team are able to remediate and have as much information as it’s possible to have – and can provide their users with detailed information regarding the restoration of service. Everything is in the hands of the local IT team (or the company to which it has been outsourced).
When an outage happens with Azure, the amount of information the local IT team has is minimal in comparison. Microsoft’s communication during O365/Azure outages varies, however, ETA’s and other information is generally vague at best. All control is with Microsoft and all the local IT team can say to staff is “Microsoft are working on it”. While Microsoft may be able to resolve the situation faster than you could on site (or not), the lack of visibility and control can sometimes be daunting. It’s not all doom and gloom though. In situations where the issue would need to be escalated to Microsoft anyway (i.e. premier support), the criticality of an international user-base can often mean a greater focus from Microsoft and inherently a faster resolution than what would be achieved for your single company.
Azure has many features which enable site resilience to protect a single data centre failure – but sometimes these are not used. This could be down to flawed design of services or simple cost saving. When architecting your environment (or engaging the experts at Adexis to provide these specialist services), it’s important you carefully consider your DR and BCP plans and ensure you have the redundancy built into your environment that matches those requirements. This is not unique to either cloud or on-premise and always must be carefully considered.
It’s not uncommon for on-premise service outages to be “fixed” by a reboot. Root cause analysis and effective problem management is something that while nice, not many IT teams have time to complete.
Microsoft have the resources to perform these functions to great depth and in-fact their brand depends on it. A complete root cause analysis feeds back into improvement of their overall operations, which leads to greater consumer confidence and therefore greater penetration into the market. They also literally have access to the source code for the operating systems and many apps, in addition to strong relationships with hardware vendors to be able to get patches/fixes in times that all of us can only dream of.
While Microsoft has been known to hold their cards close to their chest at times in terms of releasing the real root cause of outages, they are definitely invested in resolving those root causes behind the scenes and preventing further outages. This means that the environment remains far more up to date and typically, far more robust than an on-premise environment.
While Microsoft might suffer reputational damage as the result of an outage, do not expect any form of meaningful compensation
The finically backed SLA that salespeople spruik is a joke – http://www.microsoftvolumelicensing.com/DocumentSearch.aspx?Mode=3&DocumentTypeId=37
This is table for many services (but it does vary depending on specific services)
|Monthly Uptime %||Service Credit
A 31 day month has 44,640 minutes, 2,232 minutes is 5% of that. So the service would have to be down a whopping 37.2 hours to get back 100% of your fees for that month only, and the compensation is in the form of a service credit off next month’s bill.
How to claim this service credit is detailed on page 5 of the document and basically, the onus is on you to prove that there was an outage and submit the paperwork within 2 months. A separate claim must be created for each service. What this essentially means is it’s usually more effort than it’s worth to log the claim for the service credits.
Outages for cloud services must be anticipated, just like outages to on-premise services. The attitude of “It’s in the cloud so it’s not our problem” is simply not realistic and likely to catch you out, unprepared.
If you have vital services that you are considering moving to Azure (or AWS, or anywhere else), rest assured it can be safe to do so, but make sure you allow for site resiliency in your design and costing.
Adexis is neither pro, nor anti cloud. Unlike many other vendors, we have no skin in the game, no incentive to push you in one direction or the other. We are completely independent and can provide you with unbiased specialist advice on what is best for your environment and your business, including the pros and cons of staying on-premise or moving to the cloud for each service.
Every environment is different when it comes to security requirements, IT skillset, hardware availability, CapEx vs OpEx spend and a range of other factors – and these all feed into what is the best solution for your business.
If you’d like to explore your IT strategy further, please be sure to give us a call.