The AI Slop Problem in Test Prep — and What Fixing It Actually Takes

Every test prep company in the last 18 months has shipped some version of "AI-generated practice questions." Most of it is unusable. Some of it is actively harmful — questions with incorrect answers, questions that test the wrong rule, questions with distractors that are not actually plausible under any reading of the fact pattern.

The industry has a name for this quietly: AI slop.

We spent a large amount of time this year trying to build a bar prep question generator that does not produce slop. The problem is harder than it looks. Here is what we learned.

Why AI-generated test prep questions tend to be bad

The default failure mode of a language model asked to write a bar exam question is to produce something that looks like a bar exam question but is not one. The surface features are right: two-paragraph fact pattern, four multiple choice answers, one is correct and three are distractors. But when you look closely, three problems tend to appear.

The tested rule is ambiguous or wrong. The question superficially tests hearsay but the correct answer actually depends on a completely different rule. Multiple graders would disagree on which rule is being tested.

The distractors are not plausible. On a real NCBE question, all four choices sound plausible on first read. A trained candidate can rule out two but the remaining two require careful analysis. Generated questions frequently have distractors that are obviously wrong — the correct answer is the only one that "sounds like" a legal conclusion.

The difficulty calibration is off. A generated question might feel like a bar question but be either far easier or far harder than a real one. Difficulty on real exams is deliberately controlled by the test authors. Language models do not naturally reproduce this calibration.

Any one of these problems makes the question borderline useless. All three together makes it worse than useless — it teaches candidates the wrong pattern.

What "calibration" actually means

The word calibration gets thrown around by every AI product. Almost none of them mean the same thing by it.

For test prep, calibration means: does a question that we produced pass a blind test against real NCBE items, judged by domain experts, at the same difficulty level and testing the same rule cleanly?

This is not a subjective test. You can run it. Here is what a real calibration process looks like:

Step 1: assemble a blinded set. Ten AI-generated questions and ten released NCBE questions, all reformatted identically so no visual tells give it away.

Step 2: expert scoring. Have three domain experts independently score each question on rule clarity (1 to 5), distractor plausibility (1 to 5), and estimated difficulty relative to NCBE.

Step 3: label detection. Ask each expert to label each question as "AI-generated" or "NCBE." If they cannot reliably distinguish, the questions are calibrated. If they get it right 90 percent of the time, you have work to do.

Most AI-generated bar prep questions on the market today would be identified as AI-generated at 95 to 100 percent accuracy in this test. That is the slop signal.

Where we ended up

After a few iterations of the generator, we got to the point where blinded expert reviewers rated our L5 generator output at roughly 50 percent identification accuracy — coin flip. That is a small useful result. It means a candidate practicing on our questions is genuinely practicing against NCBE-quality material, not against slop.

The path there involved:

Grounding every question in a specific rule from the 391-rule taxonomy
Rejecting generated questions that failed a rule-clarity check
Rejecting questions where any distractor was not plausible under some reading of the fact pattern
Difficulty calibration by comparing wrong-answer rates on our questions vs comparable NCBE released items among test candidates

None of that is glamorous. Most of it is tedious. It is the kind of quality work that will not show up in a demo but will show up in whether the product actually helps candidates pass the exam.

What buyers should ask

If a test prep product markets AI-generated content, three questions are fair to ask:

Do they publish a calibration methodology, and if so, what does the blind-expert study look like?

Do they publish the pass-rate outcomes of candidates using their AI-generated content, at least at aggregate level?

Can they show you a random sample of questions, not a curated one, so you can assess quality yourself?

If the answer is no to all three, the AI content is a marketing claim, not a quality feature. If the answer is yes to at least one, they are taking calibration seriously.

The uncomfortable truth

Producing high-quality AI-generated test prep content is a real technical problem. It is not solved by picking a better model. It is solved by combining a good model with a rigorous taxonomy, an honest quality bar, and the discipline to throw out most of what the model produces.

That is expensive to build and unglamorous to talk about. Companies that skip it can ship faster. They just ship slop. And the candidates who buy from them pay for it in exam outcomes.

Do not be one of them.