AI adoptionTeams

Why your AI pilot didn't scale (and what to do about it)

Successful pilots that quietly die are the most common pattern in AI adoption. The reasons are predictable, and mostly fixable.

Good Transformer28 May 202610 min read

The pilot worked. That's what makes this frustrating. A small group tried AI on a real workflow, the results were good, everyone was encouraged. And then, somehow, nothing changed at scale. Six months later the wider team works exactly as it did before, and the pilot is a slide in an old deck.

This is the single most common pattern we see. It almost never means the technology failed. It means the pilot proved the wrong thing.

If you're reading this having lived through that exact sequence (the initial excitement, the quietly encouraging results, and then the slow return to business as usual), this post is for you. And if you're about to run a pilot, it's for you too, because the mistakes are almost always made before the pilot begins, not during it. What follows is an attempt to name what's really going on, and what to do differently.

A pilot proves feasibility. Scaling needs something else.

A good pilot answers "can this work?" The honest answer is usually yes: a motivated team with leadership attention can make most things work. But scaling asks a harder question: "will this work when the people doing it aren't the enthusiasts, don't have special support, and have a hundred other priorities?"

Those are different questions, and a pilot designed only to answer the first one tells you very little about the second.

The gap between them is where most AI adoption programmes quietly die. The pilot was optimised for results, not for conditions. It ran in a protected environment and produced numbers that said little about ordinary conditions.

The three gaps that kill it

The enthusiast gap. Pilots are run by volunteers who want them to succeed. These are the people who signed up, who stayed late figuring out the prompts, who covered for each other when the tool returned something odd. They're genuinely useful people to learn from. But they are not representative of your organisation at scale.

The same scenario plays out again and again in professional services firms: a team of four analysts runs a six-week pilot using AI to draft first-pass client reports. They save around three hours per report. Leadership sees the numbers, approves the rollout, and hands it to the broader analyst cohort of thirty. Six months later, maybe six of those thirty use the tool regularly. The other twenty-four find reasons not to: the outputs don't match the firm's style, nobody told them how to adapt the prompts, and asking for help feels like admitting weakness in a culture that rewards self-sufficiency.

The friction the original four smoothed over by sheer enthusiasm came roaring back the moment it met people who hadn't opted in. The pilot hadn't failed. It had simply never been tested under the conditions it was meant to survive.

The support gap. During a pilot, someone is always available to unstick people. There's a channel, a champion, probably a weekly check-in. Problems get solved before they become reasons to quit. That infrastructure feels normal during the pilot. Remove it at scale and the experience collapses.

The mechanism here is simple: AI tools for professional work are not plug-and-play in most organisations. Prompts need calibration. Outputs need editing. Edge cases appear that no one anticipated. During a pilot, a person exists whose job is to handle all of this. At scale, that person is one human facing a hundred questions. And most of those questions never get asked, because asking them requires admitting you haven't figured it out, and most people would rather quietly not use the tool.

A good example from operations: a logistics company piloted an AI tool for summarising supplier communications. The pilot team had a dedicated Slack channel where the AI lead answered questions within the hour. The rollout had a one-page guide and a generic inbox. Adoption among the wider team hit roughly twenty percent within three months and didn't move. The tool wasn't worse. The support infrastructure had simply disappeared.

The integration gap. Pilots run beside the real system, on the side of someone's desk. The team logs into a separate tool, copies text across, pastes results back. This works when the pilot is the main thing on their plate. It doesn't work when it's one of fourteen things.

Scaling means putting the workflow into the real system: the tools people already open every morning, the steps already embedded in their daily routine. A workflow that requires someone to remember to go somewhere new, log in, copy, switch back, and paste will lose to the path of least resistance every single time.

The integration gap is mechanical, but the consequences are behavioural. In finance, a team piloting AI-assisted contract review used a standalone browser tool that sat outside their document management system. The pilot team bookmarked it and used it enthusiastically. The wider team, managing five concurrent deals at any given moment, simply didn't open it. Not because they rejected the idea. Opening it required a context switch that felt optional in a busy day. When the same capability was embedded directly in the document management system six months later, adoption tripled within eight weeks.

Scaling isn't a bigger pilot. It's a different problem that only looks the same.

Why good pilot numbers lie

Even when a pilot produces genuinely good metrics (high satisfaction scores, measurable time savings, strong quality ratings), those numbers can be almost entirely misleading as a predictor of scale. The MIT Media Lab's GenAI Divide report found high adoption sitting alongside very little measurable business return, and one of its central conclusions was that value depends on genuine integration and learning infrastructure, not just initial capability demonstration. Pilots tend to produce the first without the second.

The reason is selection and attention. Pilot participants are self-selected or manager-selected for openness to the tool. They receive more support, more attention, and more visible interest from leadership than they will ever receive again. The Hawthorne effect (the documented tendency for people to perform differently when they know they're being observed) means that pilot conditions inflate almost every metric that matters.

Satisfaction scores reflect the experience of being part of something new and supported, not the experience of using the tool in an ordinary week. Speed improvements reflect the effort of enthusiasts who spent their own time getting good at the tool, not the performance you'll get from a median user who had a thirty-minute onboarding and hasn't thought about it since. Quality scores reflect outputs that were reviewed and refined during the pilot, when scrutiny was high.

None of this means the pilot numbers are fabricated. They're real within the conditions of the pilot. The problem is that those conditions don't exist at scale, and the metrics don't warn you about that. A pilot that reports 40% time savings among six volunteers tells you very little about what will happen when the same workflow lands with sixty people who had no say in the matter.

This is why the post-pilot confidence often exceeds what the evidence actually supports. The numbers look good, the presentation goes well, and the decision to scale gets made, all on the basis of data that was only ever measuring a best-case scenario.

What to do instead

The answer is not to abandon pilots. It's to design them from the start for the conditions of scale rather than the conditions of a controlled experiment. Here are specific actions to take before the next pilot begins.

Define the average user, not the ideal one. Before you recruit pilot participants, write a one-paragraph description of the person who will be using this tool in month six: their existing workload, their level of enthusiasm for new tools, their tolerance for friction. Then ask whether your pilot participants resemble that person. If they don't, your results won't travel.

Remove yourself from the support channel on day one. This is the single most revealing stress test in a pilot. On the first day, make yourself unavailable for questions and see who figures it out alone, who asks a colleague, who gives up, and what question nobody asks because asking feels too embarrassing. The support gap shows up here immediately. If the tool can't survive forty-eight hours without its champion, it can't survive scale.

Include at least two sceptics by design. Not people who are hostile for the sake of it, but people who are genuinely uncertain whether this is worth their time. Their friction is the most useful signal in the pilot. What stops them? Is it the tool, the workflow, the training, or the culture? Each of those has a different fix.

Measure adoption among people who weren't in the room. Before the pilot ends, identify five people who weren't involved and show them the tool and the workflow. Measure how long it takes them to produce a usable output without any help. If the answer is "they couldn't" or "it took a very long time," the workflow isn't ready to scale regardless of what the pilot participants reported.

Build integration requirements into the pilot brief, not the rollout plan. The question "how does this connect to the tools people already use?" should be answered before the pilot starts, not after it ends. If the integration is hard, the pilot should surface that. If it's impossible, better to know at week two than at month seven.

Track the non-users. Most pilots measure the people who use the tool. The most important data is the people who tried it once and stopped. Interview them. Their reasons tell you what will block adoption across the wider team.

What to do next

If you're sitting with a pilot that worked and a rollout that didn't, or if you're designing a pilot that hasn't started yet, here are three actions worth taking this week.

First, pull the list of people who were in the pilot and compare it to the list of people in the planned rollout. For each person in the rollout who wasn't in the pilot, write one sentence about what might stop them using the tool. If you can't write that sentence, you don't yet know your rollout population well enough to deploy.

Second, map every step of the workflow the pilot used and mark which steps required a dedicated tool, a separate login, or a copy-paste action outside of the team's normal systems. Each of those marks is an integration gap. Decide before rollout whether you're going to close it or accept the adoption cost.

Third, find one person in your organisation who is representative of the median user (not a technology enthusiast, not resistant, just ordinary) and ask them to complete the pilot workflow without any help from you. Watch what happens. Don't rescue them. What breaks is what you need to fix.

The pilot that worked is genuinely good evidence that something is possible. The work that follows is figuring out whether it's possible under ordinary conditions, for ordinary people, in an ordinary week. That's a different test. Run it deliberately, and the answer stops being a slide in an old deck.

Sources and further reading

MIT Media Lab (Project NANDA), The GenAI Divide: State of AI in Business 2025. Industry report, not peer-reviewed; the headline figures have been contested. The argument that scaling requires genuine integration and learning infrastructure, not just initial capability demonstration.

Work with Good Transformer

Turn this thinking into working practice.

Explore team advisory