How I Built an AI-Powered Psychometric Scoring Engine

Using large language models for personality-based job-fit assessment, and the calibration rabbit hole that followed.

The Challenge

My boss asked me a question that sounded simple: “Can you get an AI to generate a psychometric scoring formula from any text description?”

I said yes before I fully understood what I was agreeing to.

Here’s what makes psychometric formula design so hard. A trained psychometrician reads a competency description and has to decide, from a pool of dozens of behavioral traits, which ones actually predict success. That alone sounds manageable. But the traits aren’t neatly separated. They overlap. They blur into each other. “Follow-through” and “perseverance” sound similar but measure fundamentally different things. One is about finishing tasks under normal conditions. The other is about pushing through when everything is working against you. Pick the wrong one and your formula quietly drifts from measuring what you intended.

Now multiply that by every pair in the pool. Planning ability and policy adherence share the same personality foundations but serve completely different purposes. The capability to work alone and the preference for working alone look identical on paper but one is a skill and the other is a desire. Initiative and proactivity both involve doing things before being asked, but one is about starting and the other is about anticipating.

A seasoned psychometrician navigates these distinctions through years of practice, pattern recognition, and a kind of trained intuition that’s hard to articulate. They’ve seen thousands of profiles. They know, almost by feel, that a competency built around “strategic influence” needs a specific cocktail of assertiveness, political awareness, and confidence, not the generic “leadership plus communication plus teamwork” that a surface reading would suggest.

The question was whether an AI could learn to make those same fine-grained distinctions. Not perfectly. But well enough to be useful at scale.
That became the project.

The Problem I Was Trying to Solve

Psychometric job-fit scoring is one of those processes that has always required a trained specialist. Someone reads a competency description like “Achievement Drive” or “Team Leadership,” figures out which personality traits predict success in that area, assigns weights to each trait, and encodes the result as a scoring formula.

It works. But it’s slow. Each new competency requires expert analysis, and organizations often need to assess candidates against dozens of competencies at a time. I wanted to know: could an AI do this mapping? And if so, how close could it get to what the human experts produce?

How the System Works

The pipeline has four stages, each solving a different piece of the puzzle.

Stage 1 is formula generation. A competency description goes into a large language model through a carefully engineered prompt. The model reads the text, identifies the personality traits that matter most, and outputs a weighted formula. Think of it as a recipe: “this competency requires a lot of planning ability, a fair amount of self-confidence, and some proactivity.”

Stage 2 is where things get interesting. A single LLM call produces a reasonable but inconsistent formula. So I don’t call it once. I call it 50 times with the same input, then count how often each trait appears across all the runs. Some traits show up in 48 out of 50 runs. Others show up in 6. The ones that consistently appear are the signal. Everything else is noise. I built an aggregation pipeline that filters by a consistency threshold and produces a single consensus formula from all 50 runs.

Stage 3 is population benchmarking. The consensus formula gets scored against a database of over 18,000 real (anonimized, handpicked, human validated) personality profiles. This gives me the statistical distribution: where does any given person’s score fall relative to the broader population? I extract percentiles from this distribution to use in the final step.

Stage 4 is calibration. The AI’s raw scores don’t match scores from the human-calibrated system. They’re systematically biased. So I developed a correction formula using linear regression across 75 competency topics and a set of validated test subjects. The result is a simple linear transform that maps AI scores onto the human-calibrated scale.

The Consensus Method

This turned out to be the most valuable part of the whole project.

Most people use LLMs as one-shot tools: ask once, get an answer, move on. But formula generation is a judgment call. There’s no single right answer. Different runs produce slightly different formulas, not because the AI is broken, but because there are multiple valid interpretations of any competency description.
By running 50 iterations and aggregating, I turn a noisy judgment into a statistical signal. The traits that only appear when the model fixates on a specific bullet point get filtered out. The traits that the model consistently identifies, regardless of which part of the description it focuses on, survive.

I tested this approach with both 20 and 50 runs. 50 runs produced tighter consensus and more stable formulas, though with diminishing returns. The sweet spot depends on how much you’re willing to spend per topic.

What I Learned About Prompt Engineering

The prompt went through two major versions, and the difference in quality was measurable.

Version 1 was simple: “You’re a psychology expert, convert this statement into a formula.” It produced reasonable output but the formulas were noisy. The AI would map each bullet point in a description to a separate personality trait, producing bloated formulas where half the traits were tangentially relevant at best.

Version 2 was a ground-up rebuild with several specific improvements.
The persona got sharper. Instead of a generic expert, I defined someone who “prioritizes construct precision over comprehensiveness.” That last bit turned out to be the single most impactful change. It told the AI to find fewer, better traits rather than listing everything remotely relevant.

I added a structured reading sequence that forces the AI to extract the core concept before selecting traits. One step asks the AI to identify the primary verb in the description. Turns out, the main verb almost always maps directly to the dominant trait. “Plans” points to planning ability. “Persuades” points to influence. Simple, but the AI doesn’t do it automatically unless you tell it to.

Weight anchoring was another big win. Without explicit criteria for what “weight 2 vs weight 3” means, the AI interprets it differently every run. I added concrete definitions and expected distributions. Variance in weighting dropped noticeably.

I also added anti-examples based on real errors from the V1 output. Seven categories of mistakes the AI commonly made, with explanations of why they’re wrong. Things like padding formulas with generic “nice to have” traits, confusing similar-sounding traits, or applying the wrong direction to inversely-scored traits.
Finally, I expanded each trait definition from one sentence to three or four, with behavioral examples and explicit cross-references to similar traits. The AI confuses overlapping traits far less when you spell out exactly how they differ.

The Calibration Rabbit Hole

With good formulas in hand, I expected the scores to roughly match the human system. They didn’t.
The AI formulas consistently produced a narrower score distribution. The human system spread people across a wide range. The AI compressed most scores into the middle. The ranking was usually right (the AI agreed on who scored higher and lower), but the absolute values were off.

This makes sense when you think about it. The AI selects traits based on what the competency description says. Human experts select traits based on decades of watching what actually predicts performance. The AI captures the concept well but misses the empirical calibration that comes from experience.
I solved this with a two-step correction. First, percentile stretching: mapping each raw score to its position in the benchmark population, then spreading that position across the target scale. I tested 16 different stretching methods, including symmetric and asymmetric approaches, to find which one best aligned with the human-calibrated output. Second, a linear regression correction applied on top of the stretching to handle any remaining systematic bias.

The asymmetric approach won. The score distributions skew high, so trimming the bottom percentile while keeping the maximum intact produced the best fit. Symmetric trimming cut into the useful range and made things worse.

Cost vs. Accuracy

I tested five different LLM providers on identical inputs and compared both accuracy and cost per request. The results were counterintuitive.

The most expensive model performed worst. The cheapest was second-worst. The best accuracy came from a mid-priced model at a fraction of the cost of the premium option. The accuracy gap between best and worst was less than one point on the scoring scale. The cost gap was nearly 40x.
The takeaway: for this type of structured judgment task, prompt quality and the consensus method matter more than raw model capability. A well-engineered prompt with a cheap model beats a mediocre prompt with an expensive one.

Where It Landed

After calibrating across 75 competency topics, the system produces scores that average about 5-6% deviation from human-calibrated scores. For well-defined competencies with clear, specific descriptions, the AI explains over 80% of the variance in human scores. For vague or overly broad descriptions, accuracy drops sharply.

That last point is the most important finding of the whole project. The quality of the input description is the single biggest factor in output accuracy. Bigger than model selection, bigger than run count, bigger than calibration method. A clear, focused competency description produces a reliable formula almost every time. A vague one produces garbage that no amount of statistical correction can fix.
The Production Pipeline
In production, the system doesn’t have the luxury of human-calibrated scores to validate against. So I built automatic quality signals that flag potentially unreliable formulas before they reach end users:

Consensus strength: If the top trait only appears in 55% of runs instead of 90%, the AI wasn’t confident. The input is probably vague.
Benchmark spread: If scoring 18,000 profiles produces a narrow range where everyone looks the same, the formula isn’t differentiating people. The description is probably too generic.
Formula complexity: Too few traits means the description was too narrow. Too many means it was too broad or the AI over-diversified.

Topics that fail these checks get flagged for human review with specific feedback about what might be wrong with the input description.

Tech Stack

AI Provider: Multi-model access through a routing API
Database: PostgreSQL (Supabase)
Benchmarking Pipeline: Python (local execution, no cloud timeouts)
Analysis: Spreadsheet-based with formula automation
Prompt Management: Database-driven, versioned

What’s Next

The current system handles formula generation, benchmarking, and calibration as an end-to-end pipeline. The next phase is closing the feedback loop: every time a human expert reviews and adjusts an AI-generated formula, that correction feeds back into the prompt engineering process. Over time, the gap between AI and human judgment should narrow.

The deeper question this project raised for me is about the nature of psychometric expertise itself. The AI can read a description and identify relevant personality traits with surprising accuracy. What it can’t do is bring decades of empirical observation about which traits actually matter in practice versus which ones just sound relevant on paper. That gap between semantic understanding and empirical wisdom is where the calibration correction lives, and it might be the most interesting problem to solve next.