How to measure training when nobody wants to measure training
The frameworks exist. The methodology is not the problem. The problem is that measuring outcomes requires being accountable to them — and that is a different conversation entirely.
Practice EnablementJune 202610 min read
At the end of a partner enablement program I ran several years ago, a stakeholder asked me how it went. I told him: completion rate was 100%, assessment scores averaged 87%, and the post-program feedback was strong. He nodded. I nodded. We both understood that this conversation had ended. What neither of us said — what the structure of the conversation made it unnecessary to say — was that none of those numbers told us whether a single Wipro consultant could now handle a live client engagement without calling back to the product team. That question had no number attached to it. So it didn't get asked.
I have been thinking about that conversation for years. Not because it was unusual, but because it was completely normal. It is the default register of almost every training evaluation conversation I have been part of — an exchange of numbers that feel like evidence, between people who have quietly agreed not to examine what the numbers actually measure. The frameworks for doing better have existed for decades. Kirkpatrick published his four-level model in 1959. Brinkerhoff's Success Case Method has been in circulation since the early 2000s. Phillips built an entire ROI methodology on top of Kirkpatrick's Level 4. The methodology is not the problem. The problem is structural. And it is worth naming clearly before we get to what good measurement actually looks like.
Why the wrong metrics won
L&D functions did not gravitate to vanity metrics because they were lazy or incurious. They gravitated to them because the incentive structure made it rational. Completion rates are measurable, reportable, and defensible. Assessment scores are quantifiable and producible on a timeline that fits a quarterly report. Post-program NPS — that particular piece of borrowed consumer research methodology — gives you a single number that travels well in a slide deck. These metrics exist because they are reachable. They live inside the room. L&D controls them.
Business outcome metrics — the ones that would actually tell you whether training worked — live outside the room. They belong to line managers, to operations teams, to revenue functions. Getting access to them requires relationships, political capital, and a willingness to be held accountable to numbers you did not entirely control. Most L&D functions, in most organisations, have never been asked to build those relationships. So they built a reporting infrastructure around the numbers they could reach, and the organisation learned to accept those numbers as the story of training impact.
A 100% completion rate on a useless program is a perfect score on the wrong test. The metric is not lying. It is just answering a question nobody needed answered.
The deeper problem is that the wrong metrics do not just fail to measure impact — they actively obscure the absence of it. When a program produces a 4.6 out of 5 facilitator rating and an 84% post-test score, it looks like it worked. The report gets filed. The budget line gets justified. The next cohort gets scheduled. And somewhere, a practitioner who passed the assessment is quietly struggling in a production environment because the program that certified them was designed to produce scores, not capability.
The taxonomy of bad metrics
Not all vanity metrics are equally harmful. Some are merely uninformative. Others are actively misleading in ways that damage both participants and the credibility of the L&D function over time. It is worth being precise about what each one actually measures — and what it does not.
The metric
What it actually measures
Completion rate
Measures attendance and administrative compliance. A 100% completion rate tells you everyone showed up and stayed. It tells you nothing about what happened while they were there.
Participation, not capability
The honest version: this is a logistics metric, not a learning metric. Track it for scheduling. Do not include it in an impact report.
Post-training assessment score
Measures performance on the assessment instrument, which is only as good as the assessment's fidelity to the actual job. Most technical assessments test recall and procedural replication under controlled conditions.
Assessment performance, not job performance
Useful only if the assessment genuinely simulates the conditions and judgment demands of real work. Most don't. Passing a test about cloud architecture is not the same as designing one under client pressure.
Post-program NPS / smile sheets
Measures whether the participant found the experience satisfying. Enjoyment, engagement, and perceived relevance at the moment of exit. Borrowed wholesale from consumer research methodology and applied to learning without examination.
Experience satisfaction, not transfer
Enjoyment and capability are not the same variable. A participant can rate a program highly and transfer nothing. A participant can find a program demanding and uncomfortable and exit genuinely capable. NPS conflates the two.
Hours of training delivered
Measures volume. How much content was delivered to how many people across how many hours. Frequently used as a proxy for programme output in L&D reporting.
Volume, not value
This metric incentivises the wrong behaviour at a structural level. More hours delivered is not a better outcome — it is a bigger budget consumption. Optimising for this metric actively works against effective design.
Cost per learner
Measures delivery efficiency. How much it cost to put one person through one program. Widely used in procurement conversations and vendor evaluations.
Efficiency, not effectiveness
You can have a very low cost per learner for a program that produces no behaviour change. Cost per learner without a measure of what the learner can now do is a measure of how cheaply you ran something that may not have mattered.
The common thread across all of these is that they measure the training event itself — what happened inside the room, under L&D's control. The question they do not answer is what changed in the business. That question requires different data from different sources, and it requires L&D to go and get it.
The metrics that actually mean something
Every useful training metric shares one characteristic: it exists outside the training room. It is a business observation about participant behaviour after the program, not a record of what happened during it. This is what makes these metrics harder to collect — and why most L&D functions never track them. But it is also what makes them worth tracking.
Time-to-competence. How long from program exit to unsupported, consistent performance on the job? This is the most direct measure of whether a training program produced what it claimed to produce. It requires manager observation and a clear definition of what "competent" looks like — both of which most organisations fail to specify in advance. That failure is a design problem, not a measurement problem. If you have not defined the exit state before you designed the program, you cannot measure whether the participant reached it.
Error rate on first independent tasks. After a participant returns to their role, how often do they produce work that needs to be corrected, escalated, or redone? How frequently do they call back to the facilitator or support team for help with problems the program was supposed to have equipped them for? This is measurable if you build the relationship with managers to get the data. Most L&D functions do not build that relationship. The data exists. It just belongs to someone else.
Manager-observed behaviour change — structured, not anecdotal. Kirkpatrick Level 3, done properly, is not a checkbox survey sent three months after training asking managers whether they noticed any changes. It is a structured observation protocol built around specific, observable behaviours identified during the design phase. The behaviours the participant was trained to demonstrate. Whether they are demonstrating them. If you did not identify those behaviours during design, you cannot observe them during evaluation — which is another way of saying that measurement starts at the curriculum level, not at the reporting level.
Escalation frequency. In any role that involves judgment — and most technical roles do — the frequency with which a trained practitioner escalates a decision they should be able to own is a signal. High escalation frequency after training means the program produced knowledge without confidence, or capability without enough scenario depth to handle real-world variability. It is a leading indicator of transfer failure before the failure becomes visible in output quality.
First-attempt pass rate on real production tasks. Not on the assessment. On the actual work. The first client engagement. The first configuration that goes to production. The first time a new hire handles a case without supervision. Whether they get it right the first time, and how far off they are when they don't, is the closest thing to a ground truth that L&D has access to.