This is Why Windows 10 is So Unreliable (Premium)

It’s can be painfully hard to watch, but an ex-Microsoft employee has posted a video explaining why Windows 10 is so unreliable. He’s right, but I can fill in a few more details to provide a more complete picture of the problem.

The background: Jerry Berg is a former senior software developer at Microsoft. He was let go after 15 years at the company and decided to pursue self-employment via a tech enthusiast YouTube channel. And, yes, he’s almost certainly correct about what happened to Windows 10 from a reliability perspective.

Here’s the story, according to Berg and with additional information he doesn’t provide.

Before the release of Windows 10, the software giant employed thousands of people who were dedicated solely to testing Windows quality. There were multiple teams, each covering specific parts of the OS, and representatives of each would meet daily to determine if the previous night’s build could be moved upstream in the build process towards winmain, which was “basically everyone’s code smashed together from all the different [parts] of the operating system, like UI, networking, in-box applications, drivers, display, kernel, HAL [hardware abstraction layer], etc.”

These meetings were critical, Berg says: It was a chance for real human beings to come together and debate whether code was good enough to ship. The bugs that they found were present on real hardware, he says, during automated tests run in a lab with thousands of individual computers. (I’ve actually seen various testing labs like that he describes, multiple times, at Microsoft’s campus during various visits over the years.) These machines were representative of the hardware diversity as was out in the market, at least as much as was possible.

Though Berg says that Microsoft still does some testing like this today, the software giant laid off the entire Windows Test Team, “with a few minor exceptions,” and “basically replaced it with the team that was testing Windows Phone.”

A bit of background here.

As you may recall, when Steven Sinofsky was ousted from Microsoft after the Windows 8 and Surface RT debacles, Terry Myerson, who had been leading the Windows Phone team previously, took over Windows. In doing so, he removed all of Sinofsky’s key lieutenants over time, and reversed all of Sinofsky’s terrible policies, most notably around development secrecy. This was both good and bad: The now-ineffectual Windows Insider Program came out of this change, which was initially good. But the people who had been creating and testing Windows for years were replaced, frankly, with a bunch of B-Teamers from the Windows Phone group. And it’s very clear now that they had no idea what they were doing.

“The whole reason the layoffs occurred is because three different divisions at Microsoft that were [essentially] sub-companies—Xbox, Windows Phone, and Windows—were all merged together … They wanted them all to share the same architecture.” This, too, was good and bad—yet another architectural shift certainly contributed to the Windows Phone failure—but whatever. It happened.

Today, most Windows testing is done via automation on virtual machines, not real hardware.  The problem with virtual machines, Berg says, is the lack of diversity. Testing on real hardware, the previous Windows Test Team was able to find “little transient issues” that would only crop up when using so many diverse configurations. So the only bugs they can find now are the ones that are obvious and basically impact all PCs.

He doesn’t mention this, but using virtual machines saved Microsoft a lot of money—one assumes that Terry Myerson’s bonus at the end of FY16 reflected that—but it raised an issue: Microsoft still needed to find the more esoteric bugs that its Testing Team used to find. And self-hosting—an internal policy by which employees run daily builds—could only help so much. Besides, by the time Windows 10 arrived, fewer and fewer employees were doing this.

So the second innovation of the Myerson years appeared: Telemetry.

Telemetry is forced, automatic data collection from real users’ PCs out in the wild. But because Windows 10 was literally new, it needed a way to collect this telemetry before the OS shipped. And it did so via the Windows Insider Preview program, of course.

“The problem is, most of the bugs aren’t something crashing,” Berg says, “but rather something not working properly. When something doesn’t work properly, it doesn’t generate the information that Microsoft needs in the form of a dump file. So, instead, they rely on users reporting that. But a lot of Windows Insiders don’t report these issues unless they’re catastrophic. So only a small percentage of the people hitting the issue is actually reporting it.”

It’s even worse than he says: While Dona Sarkar claims that she is “wrangling 16 million ninja techies” in her Twitter profile, the number of active Windows Insiders is far, far less than that. Like a tiny, single-digit percentage. And that means that only a very small user base is even available to do more than feeding the ineffectual telemetry database that Microsoft uses as its primary data point for Windows 10 reliability.

“Even if [Insiders] are reporting [bugs] they can’t give Microsoft the information they need to reproduce [the bugs],” Berg continues. “Microsoft has basically replaced flesh and blood humans [who] were creating automated test cases and unit tests that were running daily against [Windows] builds with … us, the consumers who are now testing this software and [automatically] sending them information from their computers.”

According to Berg, telemetry isn’t all bad: It’s great for tuning performance, he says, and can in some cases actually help Microsoft fix bugs. The problem, he says, is that telemetry can’t help Microsoft or its users find and fix bugs that occur outside of the process that crashed. Telemetry can collect mini-dumps, which are process-specific, not full dumps, which include a full picture of the PC’s memory, and will include other processes, which often contribute to issues. Full dumps are simply too big to send to Microsoft regularly.

Put simply, telemetry isn’t completely ineffective. But it is ineffective. Since they’re unable to see the complete picture for bugs and crashes, Microsoft’s developers are forced to look at a bug database and see where most problems occur. And even when they fix such a bug, they have no way of knowing whether it solves the problem, beyond pumping it out to Windows Insiders, hoping for the best, and seeing what that build’s telemetry shows. Sometimes, these fixes only fix part of the problem. Sometimes they cause new issues.

The problems with this system are obvious: We have multiple examples of Windows 10 Feature Updates that have shipped with major reliability issues, and two examples were Microsoft pulled back a Feature Update after it had literally shipped it to customers. This is why Windows 10 is now rolled out in waves, Berg says.

But even that system is flawed: The first people who get any given Windows 10 upgrade or update are essentially guinea pigs. “They’re dogfooding the software,” Berg says, using an old internal Microsoft term for running your own code. “They might as well be Windows Insiders. Microsoft doesn’t have high confidence in that software. Otherwise, they would just roll it out to everybody just like they did with updates before 2016.”

Exactly right.

Berg says that Microsoft has finally acknowledged the problems with this system—I’ll add that Myerson is gone, most notably—and is taking baby steps to address them. But he’s right that each feature update seems to arrive with a new set of problems. By not ever effectively testing the reliability of Windows 10, Microsoft has put itself and its customers on a roulette wheel. And you never know when the next issue will hit.

“These are issues that would have been found before,” he claims. “They would have been found by the Test Team that was at Microsoft, [who] were basically the gatekeepers” on product quality. “That’s actually what they called us.”

Berg recommends that Microsoft hires back the testers, or least recreates that kind of team again. “The actual Windows Team Team,” he notes, “not the Windows Phone migrated team!”

It’s a great idea. It will never happen.

Full disclosure: Berg falls on the wrong side of the Windows 10 privacy debate. And he gets a few things wrong in this video, for example stating that RTM means “release to media,” when in fact it means “release to manufacturing.” But none of that makes his central point, his story, incorrect. And Windows 10 will never be truly reliable, and thus trustworthy, until Microsoft fixes the underlying testing process. It’s a shame.

Gain unlimited access to Premium articles.

With technology shaping our everyday lives, how could we not dig deeper?

Thurrott Premium delivers an honest and thorough perspective about the technologies we use and rely on everyday. Discover deeper content as a Premium member.

Tagged with

Share post

Thurrott