Where we are today
We are pleased to announce that we’ve completed the first round of update reboots as of the evening of Thur Jan 11th. These reboots consisted of updated kernels with Kernel Page Table Isolation (KPTI) and CPU firmware (microcode) updates for a handful of our production systems, namely Intel Haswell, Broadwell, Skylake architectures.
In our last update, we detailed that there will likely be multiple reboots over a period of time in order to update CPU firmware (microcode). At this time, we have not cemented a timeline for these updates other than our original note of 2-4 weeks out. The current limitation here is a series of updates from our industry peers indicating there are reliability issues with some of the currently available microcode updates.
We are coordinating with our OEM vendor Dell and industry peers to ensure we are balancing confidence in the stability of microcode updates with the security considerations of the Meltdown & Spectre vulnerabilities.
What was the downtime impact?
Our team scheduled and executed on thousands of reboots in the last week and a lot of systems returned online without incident. It is never easy organizing these kinds of operations and they always uncover a series of issues that are a mix of software and/or hardware related issues.
We had an incidence rate of 1.27%, which we defined as systems that did not return online without intervention. Our internal tracking recorded each one of these incidences at the point at which they were handed off to our data center operations team. We subsequently validated patches against these systems along with a full audit yesterday to scope any systems that we may have missed updating. Subsequently, we had a little under two dozen systems that were either missed, skipped or required additional work to patch that was part of last night’s maintenance window.
The average downtime per-system through Saturday, January 6th was 8 minutes 39 seconds. The longest downtime (including systems with incidence events) was 2 hours 19 minutes. There was a single system that had returned online after an extended fsck with a corrupted data volume. The volume in question on this system was restored from our continuous data protection backups and was, given the restore window, effectively offline for 3 hours 31 minutes.
The average downtime per-system from Sunday, January 7th to Thursday, January 11th was 2 minutes 51 seconds. The longest downtime (including systems with incidence events) was 41 minutes.
The tangible reduction in downtime and incidence rate on and after Sunday, January 7th was the result of improvements in our procedures and the introduction of kexec. The kexec resource allowed our team to complete kernel upgrades without fully power cycling our servers, avoiding lengthy BIOS/POST boot delays.
What is the performance impact?
The topic of performance, outside of downtime itself, is the most frequent inquiry our customer service team has been receiving regarding the updates. This is understandable and the amount of media attention, stating broad performance reductions is a bit misleading.
We’ve thoroughly tested our platform for the workloads important to our customers. These tests consisted of measuring the performance pre and post kernel updates against PHP execution time, FPM threading, static content requests to Apache and requests per-second to Varnish, Redis, and MySQL. We’ve observed a negligible but measurable performance impact averaging about 5%.
That said, there is a caveat here. We have found that systems and resources that are heavily loaded toward their upper-performance limits are significantly impacted post-updates. We have very few systems within our infrastructure that we would describe as anywhere near upper-performance limits (overloaded). This is due to very strict user density limits on how our servers are filled and equally, strict hardware build requirements across all our product lines.
We break out our infrastructure by logical region boundaries, similar to other providers. We have been closely monitoring metrics on a per-region (e.g. US-midwest-1) and per-role (e.g. cluster load balancer) basis. Our data reinforced over the last week that we have not observed any broad performance impact.
Below are graphs for our two largest regions, US-midwest-1 and UK-south-1. These graphs represent the overall Load Average (a broad measure of system utilization) and CPU Idle Time (a measure of CPU utilization) for all systems in the respective regions. The time span of these graphs is from January 1st, 2018 00:00:00 UTC through January 8th, 2018 12:00:00 UTC. The red vertical lines indicate the point at which reboots were conducted in the respective regions.
The next steps for our team consist of monitoring for continued performance impacts and assessing CPU firmware (microcode) updates.
We will continue monitoring for performance impacts. We are thoroughly continuing to assess, on a daily basis, the performance of our platform against the KPTI kernel updates. If at any time we discover that performance impacts are deeper than already stated, we will issue an update with details.
CPU Firmware (microcode) Updates
Our teams are monitoring the status of CPU firmware (microcode) updates broadly across the technology space and preparing to validate the reliability and performance of those updates. When we have a confidence level that we feel balances reliability, performance and security in those updates, we will announce our next round of maintenance reboots.
We will attempt to use ‘kexec’ during the CPU firmware (microcode) updates, if they can be applied with the Linux kernel microcode loader, to minimize downtime. A process around these updates is already being worked on.
When we have more information available and more accurate timelines for continued updates, we will send out notifications accordingly as we begin to schedule emergency maintenance windows.
We appreciate your understanding and patience as we complete this process. If you have any questions or concerns, please reach out to our Support team via https://core.thermo.io
Our earlier Meltdown & Spectre Vulnerability posts can be found at: