- Tools & Templates
- Security Jobs
- Data Protection
- Identity & Access
- Business Continuity
- Physical Security
- Security Leadership
The Skype mystery: Why blame the August Windows updates?
What I don't get though, is why didn't this happen in July? Microsoft puts these updates out every month, so why'd the crash happen now?
Like me, Internet Storm Center handler John Bambenek doesn't think Skype is doing a very good job of explaining what happened, so I asked John what questions put to Skype. His questions and Skype's answers are below.
Warning, if you're hoping for a straight answer on any of this, you're going to be disappointed. These answers come from Jennifer Caukin, a Skype spokeswoman. To her credit, she warned me first that there's nobody in the US who can answer questions in any detail today. Maybe by tomorrow we'll get some real answers.
Q -- Why did it take a full 24 hours after patching and rebooting for the
outage to occur?
A: The disruption was triggered by a massive restart of our user's
computers across the globe within a very short timeframe as they
re-booted after receiving a routine set of patches via Windows Update.
The high number of restarts affected Skype's network resources. This
caused a flood of log-in requests, which, combined with the lack of
peer-to-peer network resources, prompted a chain reaction that had a
critical impact. The 36 hours required to get the network back up was
due to the time needed to get the proper number of available
peer-to-peer network resources up and running.
OK I don't think she quite got this question. Maybe Skype can explain why the outage didn't start on Tuesday or Wednesday, when Microsoft's patches were released.
Q -- With the reboots distributed across many timezones, how did the end up
buckling your capacity?
Why didn't it happen last month too (and months prior)?
A: Normally Skype's peer-to-peer network has an inbuilt ability to self-heal, however, the day's traffic patterns combined with the large number of reboots revealed a previously unseen fault in the network resource allocation algorithm Skype uses. Consequently, the peer-to-peer network's self-healing function didn't work quickly enough. Regrettably, and as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.
Q -- How do you know it wasn't a DoS?
A: The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users' security was not, at any point, at risk.
Q -- Has Microsoft been contacted and what is there take on the situation?
A: Yes they have been contacted.
Microsoft told me that they didn't do anything different with their updates in August (they've blogged about the issue here). So why did this release kick off the problem? Nobody is saying.
Q - What are the details of the bug that they fixed? Was it a result of
something added recently?
A: The "abnormality" occurred in Skype software. To clarify: Skype's peer-to-peer core was not properly tuned to cope with the load and core size changes that occurred on 16th August. The reboots resulting from software patching merely served as a catalyst. This combination of factors created a situation where the self-healing needed outside intervention by our engineers.
What are your plans to avoid similar capacity problems?
A: This disruption was unprecedented in terms of its impact and scope. We would like to point out that very few technologies or communications networks today are guaranteed to operate without interruptions. We are very proud that over the four years of its operation, Skype has provided a technically resilient communications tool to millions of people worldwide. Skype has now identified and already introduced a number of improvements to its software to ensure that our users will not be similarly affected in the unlikely possibility of this combination of events recurring.
More comment on the thinness of Skype's explanation can be found here and here.
Thanks to cloud computing, your business data is everywhere and being accessed by everyone. Making the wrong decision to protect your data can result in high costs, increased risk and executive exposure. View this live webinar on cloud security and the evolving data center, and learn why a data-centric approach to security is the best bet for today's virtual environment.