How to measure training when nobody wants to measure training

At the end of a partner enablement program I ran several years ago, a stakeholder asked me how it went. I told him: completion rate was 100%, assessment scores averaged 87%, and the post-program feedback was strong. He nodded. I nodded. We both understood that this conversation had ended. What neither of us said — what the structure of the conversation made it unnecessary to say — was that none of those numbers told us whether a single Wipro consultant could now handle a live client engagement without calling back to the product team. That question had no number attached to it. So it didn't get asked.

I have been thinking about that conversation for years. Not because it was unusual, but because it was completely normal. It is the default register of almost every training evaluation conversation I have been part of — an exchange of numbers that feel like evidence, between people who have quietly agreed not to examine what the numbers actually measure. The frameworks for doing better have existed for decades. Kirkpatrick published his four-level model in 1959. Brinkerhoff's Success Case Method has been in circulation since the early 2000s. Phillips built an entire ROI methodology on top of Kirkpatrick's Level 4. The methodology is not the problem. The problem is structural. And it is worth naming clearly before we get to what good measurement actually looks like.

Why the wrong metrics won

L&D functions did not gravitate to vanity metrics because they were lazy or incurious. They gravitated to them because the incentive structure made it rational. Completion rates are measurable, reportable, and defensible. Assessment scores are quantifiable and producible on a timeline that fits a quarterly report. Post-program NPS — that particular piece of borrowed consumer research methodology — gives you a single number that travels well in a slide deck. These metrics exist because they are reachable. They live inside the room. L&D controls them.

Business outcome metrics — the ones that would actually tell you whether training worked — live outside the room. They belong to line managers, to operations teams, to revenue functions. Getting access to them requires relationships, political capital, and a willingness to be held accountable to numbers you did not entirely control. Most L&D functions, in most organisations, have never been asked to build those relationships. So they built a reporting infrastructure around the numbers they could reach, and the organisation learned to accept those numbers as the story of training impact.

A 100% completion rate on a useless program is a perfect score on the wrong test. The metric is not lying. It is just answering a question nobody needed answered.

The deeper problem is that the wrong metrics do not just fail to measure impact — they actively obscure the absence of it. When a program produces a 4.6 out of 5 facilitator rating and an 84% post-test score, it looks like it worked. The report gets filed. The budget line gets justified. The next cohort gets scheduled. And somewhere, a practitioner who passed the assessment is quietly struggling in a production environment because the program that certified them was designed to produce scores, not capability.

The taxonomy of bad metrics

Not all vanity metrics are equally harmful. Some are merely uninformative. Others are actively misleading in ways that damage both participants and the credibility of the L&D function over time. It is worth being precise about what each one actually measures — and what it does not.

The metric

What it actually measures

Completion rate

Measures attendance and administrative compliance. A 100% completion rate tells you everyone showed up and stayed. It tells you nothing about what happened while they were there.

Participation, not capability

The honest version: this is a logistics metric, not a learning metric. Track it for scheduling. Do not include it in an impact report.

Post-training assessment score

Measures performance on the assessment instrument, which is only as good as the assessment's fidelity to the actual job. Most technical assessments test recall and procedural replication under controlled conditions.

Assessment performance, not job performance

Useful only if the assessment genuinely simulates the conditions and judgment demands of real work. Most don't. Passing a test about cloud architecture is not the same as designing one under client pressure.

Post-program NPS / smile sheets

Measures whether the participant found the experience satisfying. Enjoyment, engagement, and perceived relevance at the moment of exit. Borrowed wholesale from consumer research methodology and applied to learning without examination.

Experience satisfaction, not transfer

Enjoyment and capability are not the same variable. A participant can rate a program highly and transfer nothing. A participant can find a program demanding and uncomfortable and exit genuinely capable. NPS conflates the two.

Hours of training delivered

Measures volume. How much content was delivered to how many people across how many hours. Frequently used as a proxy for programme output in L&D reporting.

Volume, not value

This metric incentivises the wrong behaviour at a structural level. More hours delivered is not a better outcome — it is a bigger budget consumption. Optimising for this metric actively works against effective design.

Cost per learner

Measures delivery efficiency. How much it cost to put one person through one program. Widely used in procurement conversations and vendor evaluations.

Efficiency, not effectiveness

You can have a very low cost per learner for a program that produces no behaviour change. Cost per learner without a measure of what the learner can now do is a measure of how cheaply you ran something that may not have mattered.

The common thread across all of these is that they measure the training event itself — what happened inside the room, under L&D's control. The question they do not answer is what changed in the business. That question requires different data from different sources, and it requires L&D to go and get it.

The metrics that actually mean something

Every useful training metric shares one characteristic: it exists outside the training room. It is a business observation about participant behaviour after the program, not a record of what happened during it. This is what makes these metrics harder to collect — and why most L&D functions never track them. But it is also what makes them worth tracking.

Time-to-competence. How long from program exit to unsupported, consistent performance on the job? This is the most direct measure of whether a training program produced what it claimed to produce. It requires manager observation and a clear definition of what "competent" looks like — both of which most organisations fail to specify in advance. That failure is a design problem, not a measurement problem. If you have not defined the exit state before you designed the program, you cannot measure whether the participant reached it.

Error rate on first independent tasks. After a participant returns to their role, how often do they produce work that needs to be corrected, escalated, or redone? How frequently do they call back to the facilitator or support team for help with problems the program was supposed to have equipped them for? This is measurable if you build the relationship with managers to get the data. Most L&D functions do not build that relationship. The data exists. It just belongs to someone else.

Manager-observed behaviour change — structured, not anecdotal. Kirkpatrick Level 3, done properly, is not a checkbox survey sent three months after training asking managers whether they noticed any changes. It is a structured observation protocol built around specific, observable behaviours identified during the design phase. The behaviours the participant was trained to demonstrate. Whether they are demonstrating them. If you did not identify those behaviours during design, you cannot observe them during evaluation — which is another way of saying that measurement starts at the curriculum level, not at the reporting level.

Escalation frequency. In any role that involves judgment — and most technical roles do — the frequency with which a trained practitioner escalates a decision they should be able to own is a signal. High escalation frequency after training means the program produced knowledge without confidence, or capability without enough scenario depth to handle real-world variability. It is a leading indicator of transfer failure before the failure becomes visible in output quality.

First-attempt pass rate on real production tasks. Not on the assessment. On the actual work. The first client engagement. The first configuration that goes to production. The first time a new hire handles a case without supervision. Whether they get it right the first time, and how far off they are when they don't, is the closest thing to a ground truth that L&D has access to.

For partner-facing programs specifically — which is where this question became unavoidable for me — the business metrics are even more precise. Time-to-first-deployment on a client engagement. Deal velocity for partner-led implementations. Escalation frequency back to the product team. Defect rate on partner-delivered configurations. These are measures of practice performance, and they exist in the business whether the L&D function tracks them or not. The question is whether L&D is willing to connect its work to them.

The moment you agree to be measured by those numbers, you have agreed to be accountable for a line of causality that extends well beyond what happened in the room. That is the right level of accountability. It is also why most organisations never get there.

The purpose of measuring — before the metrics

There is a prior question that most measurement conversations skip over: what are you actually trying to learn by measuring? This sounds obvious. It is not.

If you are measuring to justify budget, you will select metrics that make the investment look defensible. Completion rates and cost-per-learner figures serve this purpose well. They are not lies. They are a curated truth assembled to support a conclusion already reached.

If you are measuring to improve design, you need different data entirely. You need to know where the capability gap persists after training, not how many people completed the program. You need to know which scenarios in the capstone predicted real-world performance, so you can weight them differently next time. You need early signals of transfer failure — not to assign blame, but to identify whether the failure originated in the design, the deployment conditions, or the participant's post-training environment.

If you are measuring to demonstrate business impact, you need the business metrics. Full stop. No training metric substitutes for a business outcome. And if you have not built the stakeholder relationships that give you access to those metrics — with operations, with line management, with the revenue or performance function that owns the numbers — that is the first problem to solve. Not the methodology.

Measurement is not the last step of a training program. It is a design decision made at the beginning. You cannot retrospectively measure what you did not define in advance.

Why the right metrics don't get used

The frameworks have existed for decades. Kirkpatrick's four levels. Brinkerhoff's Success Case Method. Phillips' ROI chain. Every serious L&D function is aware of them. Very few use them beyond Level 1 and Level 2. The question is why — and the honest answer is not that the methodology is too complex or the data too hard to collect. The honest answer is accountability aversion.

Measuring to Level 3 — behaviour change back on the job — requires the L&D function to accept that its program is a contributing variable in a business outcome that it does not fully control. The manager's behaviour after training matters. The team environment matters. Whether the participant gets to apply what they learned within the first two weeks matters enormously. L&D can design for all of these things — it can brief managers, it can build application frameworks, it can create structured re-entry conditions — but it cannot guarantee them. And agreeing to be measured by an outcome you cannot fully control is a different risk posture than reporting a completion rate.

The deepest irony is that the L&D functions which resist outcome measurement are often the most sophisticated methodologically. They know the frameworks. They understand the distinction between Level 1 and Level 4. They could build the measurement architecture if asked. The barrier is not knowledge. It is that nobody upstream has required it, and the current reporting structure gives them perfectly adequate numbers to present in a review meeting without ever touching a business outcome. The incentive to stay in that lane is real.

It takes a stakeholder willing to ask a harder question — or an L&D leader willing to offer accountability before it is demanded — to break the pattern. In my experience, it is almost never the measurement methodology that changes first. It is the commissioning conversation. Someone, somewhere in the organisation, starts asking whether the training worked in the way that matters. And that question, when it arrives seriously, reorganises everything downstream of it.

The honest version of the metrics conversation

If your organisation is currently measuring training by completion rates, assessment scores, and post-program NPS, you are not measuring training impact. You are measuring training delivery. Those are different things, and conflating them is a choice — often an unconscious one — with real consequences for how programs get designed, funded, and evaluated.

The path forward is not to immediately replace your current metrics with a Level 4 ROI analysis. That conversation requires relationships and access that take time to build. The path forward is to be honest, in the next measurement conversation you have, about what your current numbers actually tell you. Completion rate: here is what it measures and what it doesn't. Assessment score: here is its fidelity to real-world performance. NPS: here is what satisfaction and capability have in common, and what they don't.

Then ask the question that comes next: what would it take to know whether this actually worked? That question, asked seriously and followed to its conclusion, is where good measurement begins. It almost never begins with the metrics.

Why I never got to implement this

I have spent several hundred words explaining what good measurement looks like. I should be equally direct about this: outside of the partner enablement work at Apttus, I have not implemented it at scale. Not because the methodology was unclear. Because the clients were not interested.

The training organisations I worked with — and most training organisations, if they are honest — are not in the business of measuring outcomes. They are in the business of delivering training. The commercial model is built around coverage: how many topics, how many days, how many heads in seats. The commissioning conversation is almost always the same. What needs to be covered? What is the timeline? When can we start? The question of what a participant should be able to do independently at the end — and how we would know if they got there — does not appear on the agenda. It is not suppressed. It simply has no natural home in a transaction structured around delivery.

The consequence of that model is a ceiling. Bloom's taxonomy has six levels. The training industry, as currently structured, reliably delivers the first three — Remember, Understand, Apply — and leaves the rest to chance. Participants exit a program at the Apply level, equipped to replicate a demonstrated procedure in a controlled environment. Whether they get to Analyse, Evaluate, or Create is left to them and whatever happens to be in front of them when they return to the job. Most do not get there. The conditions are not created. The scaffolding is not built. And before the gap has time to show up as a visible performance problem, the participant has been reassigned to a different project, a different suite of applications, and the cycle repeats with a fresh cohort and another set of completion certificates.

The training industry does not have a measurement problem. It has a business model problem. Measuring outcomes would require being accountable to them — and the current model is not structured around outcomes. It is structured around events.

What I carry from the Apttus partner programs is the knowledge of what becomes possible when the business stakes are high enough to change that conversation. When a consulting practice's revenue depends on its consultants performing in client rooms, the commissioning conversation shifts. Outcome metrics stop being theoretical and become operationally necessary. The training function stops being a delivery service and starts being a capability infrastructure. The difference in what gets measured — and what gets designed — is significant.

That experience is also why I am clear about what I am looking for next. Not another organisation where the ceiling is completion rates and smile sheets. An organisation willing to have a different conversation about what training is actually for.

What a data-driven trainer looks like — and why it matters to the business

The phrase gets used loosely. In most L&D contexts it means someone who can pull LMS reports with a degree of fluency and present completion data in a dashboard. That is not what I mean. A genuinely data-driven trainer is something structurally different — and the difference shows up not just in the metrics they track but in how they approach every stage of the work, from the first commissioning conversation to the last capstone assessment.

They define the exit state before they design anything. The data-driven trainer's first question is not what needs to be covered. It is: describe the work this person will be doing in ninety days, in enough detail that I can build an assessment around it. That question forces a precision that most commissioning conversations never reach. It also creates the measurement anchor — because you cannot track time-to-competence if you have not defined what competence looks like, and you cannot observe behaviour change if you have not specified which behaviours you are watching for.

They instrument the design, not just the delivery. Assessment is not an end-of-program event in a data-driven curriculum. It is distributed throughout. Every exercise, every simulation, every guided practice attempt generates a signal. Which problems trip participants consistently? Where does independent performance diverge from guided performance? Which scenarios in the capstone predict real-world performance and which do not? A data-driven trainer is collecting and reading those signals in real time, not waiting for a post-program survey to tell them something went wrong.

They follow the participant out of the room. The conversation with line managers is not optional. Neither is access to the post-training performance data that tells you whether transfer actually happened. A data-driven trainer builds those relationships as part of the program design — not as a reporting afterthought. They define what managers need to observe, when, and how to report it. They establish what "good" looks like in the first independent tasks. They create the conditions for the data to exist before the program runs, because they know it will not materialise on its own afterwards.

They use the data to improve the next cohort, not just report on the last one. This is where the compounding value lives. An organisation that runs the same onboarding program for three years without tracking post-training performance has no way of knowing whether the program improved, degraded, or stayed irrelevant. An organisation with a data-driven trainer has a longitudinal dataset: which design decisions predicted capability, which scenarios proved to be noise, which cohort characteristics correlated with faster time-to-competence. That dataset is an asset. It makes every subsequent program more precise and every budget conversation more defensible.

The impact of a data-driven training function is not confined to the L&D reporting line. When training is connected to business outcome metrics, it changes what is visible to the business units it serves. Managers stop receiving completion certificates and start receiving capability assessments. Operations teams get leading indicators of ramp time before new hires hit the floor. Revenue functions can trace the line between enablement investment and practitioner performance in client-facing roles. The training function stops being a cost centre that delivers events and starts being an intelligence function that tracks and accelerates capability development across the organisation.

That is a different kind of L&D. It requires a different kind of trust from the organisation — access to data, access to managers, access to the post-training environment. But it produces a different quality of result. Not just better-trained participants. A business that actually knows what its people can do.

I have described this in some detail because I want to be transparent about what I am offering and what I am looking for. The capability transfer philosophy, the curriculum design method, the assessment architecture — those I have developed and applied across eighteen years and five countries. The full measurement infrastructure, connected end-to-end from design through to business outcome data, is something I have built in part and theorised in full. What I have not found yet is an organisation with the appetite to build it completely.

That organisation exists. The business case for it is not complicated. Training that can demonstrate its own impact compounds over time — better design, faster ramp, lower error rates, more defensible investment decisions. The only thing required is a willingness to measure what actually matters, and to hold the training function accountable to it. I am looking for the organisation that is ready for that conversation. When I find it, the measurement architecture described in this article is where we start.

References & Further Reading

Kirkpatrick, D.L., & Kirkpatrick, J.D. (2006). Evaluating Training Programs: The Four Levels. 3rd Edition. Berrett-Koehler. — The foundational framework. The gap between what Kirkpatrick described and what most organisations actually implement at Level 3 and 4 is the central measurement problem in L&D.

Brinkerhoff, R.O. (2003). The Success Case Method: Find Out Quickly What's Working and What's Not. Berrett-Koehler. — An alternative to comprehensive evaluation: find the extreme cases (what worked best, what failed most) and interrogate them. More actionable than aggregate data for most design improvement purposes.

Brinkerhoff, R.O. (2006). Telling Training's Story: Evaluation Made Simple, Credible, and Effective. Berrett-Koehler. — On why most training measurement misses the question it should be asking, and how to frame impact in terms stakeholders outside L&D can actually use.

Phillips, J.J. (1997). Return on Investment in Training and Performance Improvement Programs. Butterworth-Heinemann. — On extending Kirkpatrick's framework to Level 5: isolating the effect of training from other variables and converting it to financial terms. Useful less for the arithmetic than for the discipline it requires at the design stage.

Anderson, L.W., & Krathwohl, D.R. (Eds.) (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives. Longman. — On the six levels of cognitive demand, and the structural gap between the Apply level most training reaches and the Analyse, Evaluate, and Create levels most participants need.

These notes are published when there is something worth saying. To receive new Field Notes directly, write to hello@datadomine.com with the subject line: Field Notes.

All Field Notes Programmes Get in touch