Customer story

How Microsoft Used Outset to Increase Copilot Retention by 5%

The Challenge: Evaluating a Product Where Every Experience Is Different

Microsoft’s Copilot is a conversational AI product that helps users complete tasks and answer questions through natural language interaction.

Because the experience unfolds as a conversation, every interaction with Copilot can look different. Users bring their own questions, context, and goals — meaning no two experiences are exactly the same.

That variability made the product difficult to evaluate with traditional UX research. Small-sample research could provide depth, but it couldn’t capture the diversity of experiences users were having.

At the same time, traditional AI evaluation methods — like benchmarks or expert judgments — often miss an equally important signal: how real users experience a system in the context of their own goals and questions.

“As a UX researcher, a lot of times it was good enough to talk to 10 or 20 people,” said Christopher Monnier, Principal UX Researcher at Microsoft AI. “But the fundamental nature of Copilot means we need to do qualitative research with thousands of people.”

The Microsoft AI team needed a way to understand how Copilot performed for real users in real-world situations. And to do it at scale.

“The fundamental nature of Copilot means we need to do qualitative research with hundreds of people.”

The Solution: Outset Unlocked UX Evals

To solve this challenge, the Microsoft AI team turned to Outset.

They began with traditional human-moderated interviews to understand how people were using Copilot and what aspects of the experience mattered most to them.

From these conversations, the team identified several dimensions of the experience they wanted to evaluate more systematically — things like usefulness, clarity, and response quality.

They then built a study in Outset to evaluate these dimensions across a much larger group of participants.

Participants were asked to:

Interact with Copilot
Try a competing product
Rate their experience across several dimensions
Explain their ratings through dynamic follow-up questions

Outset’s AI interviewer allowed these sessions to run in parallel, dramatically increasing the scale of qualitative research and the level of variability the Microsoft AI team could capture.

“I can only interview one person at a time,” Monnier explained. “But Outset can interview 10 or 30 or more people at once.”

The study captured not only ratings, but also the why behind them through smart, dynamic follow-up questions. And not with a handful of people, with thousands of people in a few short days.

The approach eventually became known as UX evals — a method (now widely used even outside of Microsoft) for evaluating AI systems through first-person interactions with real users.

The Results: Clearer Insight into What Users Actually Value

Running UX evals through Outset gave the Microsoft AI team a much clearer picture of how people were actually experiencing Copilot.

In several cases, the findings challenged assumptions held internally by product teams.

“As a UX researcher, it’s always great when you can bust some myths that teams have about how people are using the product,” Monnier said.

By analyzing hundreds of real interactions, the team was able to pinpoint specific aspects of Copilot responses that mattered most to users — and identify gaps between what the system was capable of delivering and what people were actually experiencing.

Just as importantly, the research made those insights easy to communicate across the organization.

Researchers could pair statistical findings with video clips of real users explaining their experiences and screenshots of actual responses.

“That combination of quantitative analysis, qualitative insight, and real examples makes the evidence undeniable,” Monnier said.

For the team, this combination of scale and depth uncovered insights that would have been impossible to surface through traditional interviews or model evaluations alone.

“That combination of quantitative analysis, qualitative insight, and real examples makes the evidence undeniable.”

The Impact: A 5% Increase in Copilot Retention

The changes informed by the UX evals helped close the gap between what Copilot was capable of delivering and what users were actually experiencing.

According to internal estimates, those improvements increased Copilot retention by roughly 5 percent.

“For a product with the scale of Copilot, that’s a big number,” Monnier said. “Retention is one of the clearest signals that people are getting real value from the product.”

More importantly, the research helped the team improve the product in ways that would have been difficult to identify otherwise.

By grounding evaluation in the experiences of real users, rather than relying solely on benchmarks or expert judgments, the team gained a deeper understanding of how Copilot performs in real-world situations.

“We had a sense that Copilot could be delivering more value,” Monnier said. “This research helped us understand where that gap was and how to close it.”

“For a product with the scale of Copilot, [a 5% increase in retention] is a big number.”

The Takeaway

As conversational AI becomes a core part of more products, understanding user experience becomes more complex.

When every interaction can unfold differently, small-sample qualitative research and traditional benchmarks alone may miss important signals.

For the Microsoft AI team, Outset made it possible to combine qualitative depth with large-scale research, revealing insights that would have been difficult to uncover otherwise.

“The real unlock is scale,” Monnier said. “AI products create so much variability that you need to understand the experience across far more users.”

By making large-scale qualitative research possible, Outset helped the team uncover insights they would not have otherwise seen, and ultimately improve the experience of Copilot for millions of users.

Interested in learning more? Book a personalized demo today!

Book Demo