Transfer Pricing25 March 202611 min read

AI comparables in transfer pricing: an honest assessment

Which parts of TP are genuinely improved by AI, which are being oversold, and how we benchmark our own comparables engine against the status quo.

Lena Park

Lead Transfer Pricing Engineer, FiscalEyes

In a nutshell

AI clearly improves search-cone construction, functional tagging and data-quality flagging — measured against an analyst baseline.
AI is oversold for royalty CUPs, synthetic comparables and one-click quartile selection.
Our internal benchmark: 88% precision / 91% recall on search; 94% agreement on routine distributor profiles.
Audit defense is a narrative problem. AI without a traceable audit trail will fail.
Use AI for distribution and routine services. Keep value-chain analysis human.

The state of TP comparables in 2026

Transfer pricing comparables are the single most labour-intensive activity in the international tax workflow. A typical TNMM benchmark for a routine distributor — the kind a multinational runs dozens of times per year — still consumes between 25 and 60 analyst hours across database queries, manual rejection of false positives, financial-data normalization, qualitative review, and the writing of the local file. The economics are unforgiving: at $200/hour blended cost, a single benchmark is $5,000–$12,000 of internal time, and a typical mid-sized multinational runs sixty to eighty benchmarks per year. The total is a defensible, but not negligible, line on the tax-function budget.

Into this workflow has arrived a wave of AI-powered TP tools, ours among them. The vendor pitch is uniform: faster benchmarks, lower cost, fewer errors. The problem with that pitch is that it collapses several quite different problems into a single “AI does TP” story. Some of those problems are genuinely solvable today; others are not. We think the honest version of the assessment is more useful than the marketing version, both for our prospective customers and for the broader profession.

What follows is the framework we use internally — and benchmark ourselves against — when we assess where AI improves a TP workflow and where it does not yet.

Where AI genuinely improves the workflow

Three layers of the comparables process are well-suited to AI, and we see meaningful improvements over a manual baseline.

1. Search-cone construction

The starting point of any benchmark is a search strategy: a set of activity codes (NACE, SIC, NAICS), a geography, a size band, a list of independence criteria, and an exclusion list of obviously non-comparable entities. The classical approach is for an analyst to write the search by hand, run it against Bureau van Dijk's Amadeus or Orbis, and iterate until the result set looks defensible. The work is mechanical and the analyst is mostly remembering things.

A well-trained model can construct a defensible first-pass search from a tested-party functional description. Our internal benchmark, run against an analyst-built test set of 200 prior benchmarks, shows that a model-built search overlaps the human search at 88% precision and 91% recallin the initial result set. The work that remains is iteration — and that is where the analyst's judgment still matters.

2. Functional-analysis tagging

Reading the published descriptions of 1,500 candidate entities and tagging each with a functional profile is exhausting work that takes most of a junior analyst's day. Modern language models do this reliably and consistently. On the same test set, the model-tagged functional profile agrees with the senior analyst review at 94% on routine distribution profiles, dropping to ~82% on more nuanced profiles (limited-risk distributors, contract R&D providers, principal-fragmented chains). The error mode is consistent: the model is more conservative than a senior reviewer in retaining edge cases.

3. Data-quality flagging

The least glamorous and most useful application. Comparables data is dirty: dormant entities, accounting changes, M&A noise, group-only filings, currency redenominations. A model trained on historical rejection patterns flags candidates likely to be rejected in qualitative review with high accuracy. We see roughly 70% of analyst rejections caught by the pre-review flag, which materially shortens the manual review stage.

Where AI is being oversold

Three layers are being marketed aggressively and, in our view, should be approached with skepticism.

1. Royalty CUP databases

Several vendors now offer “AI-curated royalty rate databases” for the comparable uncontrolled price (CUP) method. The marketing implies that AI extracts royalty rates from public licensing agreements and produces a defensible comparable set. The reality is that public licensing agreements almost never disclose enough about the licensed property, territory, exclusivity, term, sublicensing rights, or commercial context to support a CUP without significant qualitative work. AI extraction is fast; the underlying data is too sparse to support the conclusion. Tax authorities know this and increasingly reject CUP analyses sourced from AI-curated databases without supporting qualitative work.

2. Synthetic comparables

The most aggressive vendor pitch is “synthetic comparables” — generated entities with imputed financials based on what a comparable entity would have looked like in a jurisdiction where actual comparables are scarce. There is no tax authority on earth that accepts this, and we would not defend it. If the rejection rate of an analyst-built search in a particular jurisdiction is 95% (as it is for several Latin American and African markets), the answer is to use the next-best inter-quartile range with a documented adjustment, or to apply a different method entirely. Synthetic comparables are not a workaround.

3. Automatic quartile selection

Tax authorities apply jurisdiction-specific rules to the construction of the inter-quartile range — the use of multi-year data, the inclusion or exclusion of loss-making entities, the application of working-capital adjustments, and the choice of the financial indicator (operating margin, Berry ratio, ROCE). These are not algorithmic decisions. They are jurisdiction-specific and audit-driven. A “one-click quartile” result that ignores the jurisdiction-specific construction rules will not survive contact with a real transfer-pricing audit.

Our honest benchmark: precision/recall on a curated test set

We maintain an internal benchmark of 200 closed-out TP benchmarks across 14 jurisdictions, where we have the analyst output, the senior reviewer's overrides, and (in 31 of them) the post-audit result. We run our own model against the benchmark every quarter. The current numbers, against a senior-analyst baseline:

Search-cone construction. Precision 88% / Recall 91%. Steady over three quarters.
Functional-profile tagging. Agreement 94% on routine distributors; 82% on complex profiles.
Pre-review rejection flagging. 70% of analyst rejections caught; 4% false-positive rate on accepted entities.
Final inter-quartile range vs. analyst.Median delta of 38 basis points on the lower quartile and 51 basis points on the upper. Within the noise band of analyst-to-analyst variation.
Audit defensibility (n=31). Of the 31 benchmarks where we have post-audit data, 28 of the AI-assisted analyses survived audit on the same terms as the corresponding human-only analyses. The three that moved did so because of qualitative arguments, not because of the comparable set.

The headline: at the comparable-set construction layer, AI and a senior analyst converge. At the qualitative defense layer, the analyst still wins.

FiscalEyes in this workflow. The Transfer Pricing module ships with the benchmark above as a built-in regression test. Every model update has to clear the same bar before it goes live. The numbers in this article are the numbers in the product — create a free account and run a benchmark on your own tested party.

The audit-defense problem

A point that is rarely raised in vendor pitches but matters enormously in practice: a benchmark is not a deliverable, it is a narrative. The local file has to explain why the comparable set was constructed the way it was, why each rejection was made, why the financial indicator was chosen, why the period was selected, why the adjustments were applied. In an audit, the auditor walks through the narrative and tests its consistency. AI-generated benchmarks that produce a number without a defensible narrative do not survive this process — even if the number is correct.

The implication is that AI tooling should produce traceable artifacts, not opaque results. Every rejection should carry a reason; every adjustment should be logged; every search-strategy iteration should be reconstructable. We've learned this the hard way over three years of product iteration, and we'd be skeptical of any TP tool — ours included — that cannot produce a full audit trail of the decisions the model made.

How we use AI internally

For the in-house TP teams using FiscalEyes, the workflow we recommend (and use ourselves) is:

Functional analysis: human first. The tested-party characterization is the most consequential decision in the benchmark. AI should consume it; AI should not produce it.
Search construction: AI first, analyst review.Let the model build the first-pass search. The analyst spends time on the iteration, not on the boilerplate.
Candidate review: AI for rejection, analyst for acceptance. The model proposes rejections. The analyst confirms or overrides each. Acceptances are reviewed in the same pass.
Quartile construction: deterministic rules, audit-trail logged. Use the explicit jurisdictional rule set; do not rely on a black-box result.
Local file: human-written. The narrative is the deliverable. AI can draft sections; the analyst owns the final document.

This workflow cuts a typical benchmark from 35 hours to ~12 hours of analyst time. It does not eliminate the analyst. Anyone selling that pitch is not telling the full story.

Recommendations for in-house TP teams

Use AI for distribution and routine services.These are the highest-volume, lowest-risk benchmarks. The AI uplift is largest here, and the audit consequence of an error is bounded.
Keep value-chain analysis human.Profit-split methods, residual analyses, and DEMPE allocations should remain analyst-led. AI can structure the data; it should not draw the conclusion.
Demand a regression-test benchmark from your vendor. If the vendor cannot show a closed-loop benchmark of their own model against analyst output, you do not yet know what the model can do.
Audit the audit trail. Run a sample of AI-generated benchmarks past your most senior reviewer before you put them into a local file. The first round of adoption is where errors propagate; spend the time.
Plan for the OECD's simplification of baseline distribution. The Amount B / simplified distribution profile is now in force in many adopting jurisdictions. Where it applies, the comparables work collapses to a fixed-rate determination — and the AI question becomes irrelevant for that population.

The bottom line

AI is genuinely useful in transfer pricing. It is most useful at the parts of the workflow that were already mechanical, and least useful at the parts that were always going to be qualitative. A TP team that treats AI as a replacement for analyst judgement will produce benchmarks that fail audit. A TP team that treats AI as a triage layer — and keeps the narrative, the qualitative review, and the audit defense firmly in human hands — will run twice as many benchmarks at higher quality. We build for the second team.

Take it further

Run your next TP benchmark in FiscalEyes.

Free account, no credit card. Plug in a tested-party profile and see the search-cone, functional tagging, candidate review and inter-quartile range — with the full audit trail behind every decision the model made.

Start free See what's inside

Structuring