strategy-classification methodology · v2

Behavioral fingerprints of 3 language models across 750 word-association probes.

Each model is shown matched pairs of stimulus words. The reasoning it uses to associate each word is classified by a panel of judge models into a fixed taxonomy of association strategies. The gap between how a model handles side-A vs side-B of a charged pair (gender, ethnicity, political, model-ingroup) — measured against a neutral-control floor — surfaces where its reasoning is asymmetric.

Strongest signal so far

google/gemini-2.5-flash shifts its association strategy most sharply on gender.

mean pair JS divergence: 0.473 · max pair JS: 0.725

read narrative synthesis →

Model × dimension matrix

Each cell is the mean pair-JS divergence — bigger = more asymmetric strategy use across that dimension's paired stimuli. Charged dimensions are the load-bearing measurement; neutral-control is the calibration floor (noise + small-N jitter).

Model	Experiment	ethnicity	gender	model-ingroup	political	neutral-control
deepseek/deepseek-chat	v2-cheap	0.285	0.280	0.419	0.204	0.152
google/gemini-2.5-flash	v2-cheap	0.258	0.473	0.367	0.334	0.440
meta-llama/llama-3.3-70b-instruct	v2-cheap	0.150	0.360	0.391	0.372	0.230

Methodology in one screen

Step 1 · Probe

Batched paired stimuli

Each trial: shuffled batch of paired stimulus words (10 per trial, 5 dimensions × 2 sides). Model returns reasoning + first-association word for each, in JSON.

Step 2 · Classify

Multi-judge strategy panel

A panel of independent judge models classifies each (stimulus, reasoning, word) triple into a fixed taxonomy: synonym, hypernym, meronym, figurative, hedging, refusal-adjacent, … 19 categories. Ensemble vote = primary tag.

Step 3 · Analyze

Pair asymmetry (L1) + justification check (L3)

JS divergence between the strategy distributions on side-A vs side-B of each pair. A separate judge call asks: does the reasoning actually justify the word? Spike = rationalization mismatch.

Step 4 · Synthesize

Final-judge narrative

A senior synthesizer model reads each per-model profile blinded to identity, then writes a sharp report. A meta-synthesizer compares across models.

Experiments

v2-cheap

2026-05-23T22:30:00Z

v2 strategy classification — cheap-model pilot

First real v2 run: paired-stimulus dimensions (gender, ethnicity, model-ingroup, political, neutral-control) × 3 cheap models × 5 trials per cell. Bias is measured as JS divergence between the strategy-category distributions used on side-A vs side-B of each pair. Neutral-control is the calibration floor — charged dimensions are expected to noticeably exceed it. Each probed model is judged by a *different* model (claude-haiku) to preserve the v1 blinding convention. Stress condition is baseline-only for this pilot; we layer monitored / unmonitored on the strongest cells in a follow-up.

models 3/3 trials 5 dims 5 charged JS 0.473 · neutral floor 0.440