Framework

Acceptance criteria tell engineering: "This feature is done when..."

For AI features, AC must include:

  • Functional: MVP works (recommendations generate, predictions work)
  • Accuracy: Meets success threshold (80% accuracy, <200ms latency)
  • Safety: No unacceptable failure modes manifest (no hallucinations on edge cases)
  • Monitoring: Metrics dashboards live and alerting configured
  • Rollback: Kill-switch exists and tested

Examples

Bad AC: "Recommendations feature is complete"

Good AC:

  • Recommendations generate for 95% of users
  • Accuracy >= 75% on holdout test set
  • Latency < 300ms p99
  • Hallucinations rate < 0.5%
  • Monitoring dashboard live with alerts
  • One-click disable flag tested end-to-end
  • User thumbs-down button present and logging

The Problem: Vague AI Acceptance Criteria

Traditional acceptance criteria work for deterministic features:

  • "Search results appear in <200ms" ✓ Measurable
  • "Users can upload files up to 100MB" ✓ Clear boundary
  • "Customer name displays on profile" ✓ Testable

AI acceptance criteria using same structure fails:

  • "Recommendations are relevant" ✗ What's relevant? To whom?
  • "AI response is helpful" ✗ Helpful how? When is it not helpful?
  • "Model accuracy is good" ✗ Good = 60%? 80%? Compared to what?

Real-world cost: You ship. Engineering built what they thought "relevant" meant. Users get poor recommendations. Month 2: Rollback, rewrite, rebuild.


Framework: The 5-Layer Probabilistic AC Model

Layer 1: Functional AC

- Feature generates recommendations for 98%+ of users
- System handles edge case: new users (shows template recommendations)
- Response latency <500ms p99

Layer 2: Accuracy AC

- Accuracy: ≥75% on holdout test set (10% of data, held out from training)
- Precision: ≥80% (when model recommends X, users like X 80%+ of time)
- Click-through rate: ≥12% (users find recommendations relevant)

Layer 3: Safety AC

- Hallucination rate: <0.5% (never recommend non-existent product)
- Bias: Recommendations don't disproportionately favor luxury >20%
- Staleness: <5% recommendations are out-of-stock

Layer 4: Monitoring AC

- Accuracy dashboard live (shows daily accuracy trend)
- Alert if accuracy drops >5% from baseline (model drift)
- User feedback: thumbs-down ratio tracked (goal: <15%)

Layer 5: Rollback AC

- Kill-switch exists and tested (feature flag)
- Disable in <5 minutes if needed
- Fallback: show template recommendations instead

Real-World Example: Recommendations AC Done Right

Wrong AC (Vague):

"Recommendations are relevant"
"System is fast"
"Users like recommendations"

Result: 3% CTR (failure). Nobody defined success upfront.

Right AC (Specific):

Functional: 95%+ of users see recommendations
Accuracy: ≥12% CTR, measured on 10k holdout users
Safety: <1% hallucinations, ≥3 categories per user
Monitoring: CTR dashboard live day 1, alert if <10%
Rollback: Kill-switch <10 minutes

Result: 14% CTR (exceeds 12%). Launch clean. Month 2 drift detected by monitoring.

Anti-Pattern: "We'll Just Test It in Production"

The Problem: No AC defined. Launch, measure, realize too late it's broken.

The Fix:

  • Define AC in PRD (before engineering starts)
  • Test on holdout data (split data before training)
  • Staging tests + canary rollout (1% of users first)
  • Gate general availability on AC pass/fail

Actionable Steps

Step 1: Use the 5-Layer AC Template

Fill in values for your feature. Write 3-5 specific, measurable criteria per layer.

Step 2: Reserve a Golden Dataset

Before training: set aside 10% of data (holdout set). After training, test accuracy against it. If accuracy meets AC, ship. If not, retrain.

Step 3: Create Acceptance Test Plan

Write tests for Functional + Accuracy + Safety + Monitoring + Rollback. Gate: All 5 pass → Ship.

Step 4: Lock AC Before Engineering Starts

Don't modify AC mid-build. Lock it upfront. Forces you to think about what success actually means.

Step 5: Measure Against AC Post-Launch

Week 1: Check all 5 layers. If any fail, investigate. If all pass, keep running.


PMSynapse Connection

Probabilistic AC for AI is complex—you need to balance Functional + Accuracy + Safety + Monitoring + Rollback. PMSynapse's Spec Scorer auto-generates AI AC templates, catching gaps: "You defined accuracy but forgot monitoring. Here's the template." By integrating with your PRD, PMSynapse ensures your AI features ship with complete, testable acceptance criteria.


Key Takeaways

  • AI AC differs from deterministic AC. You need 5 layers: Functional + Accuracy + Safety + Monitoring + Rollback.

  • Ambiguous AC ships ambiguous features. "Relevant" isn't AC. "CTR ≥12%" is.

  • Test on holdout data before shipping. Reserve 10% of data, test accuracy on it, gate launch on pass/fail.

  • Safety isn't optional. Hallucinations, bias, staleness—define them as unacceptable upfront.

  • Monitoring + Rollback are part of AC. If you ship without dashboards or kill-switches, you failed AC.

Probabilistic Acceptance Criteria: Specifying Non-Deterministic AI Behavior

Article Type

SPOKE Article — Links back to pillar: /prd-writing-masterclass-ai-era

Target Word Count

2,500–3,500 words

Writing Guidance

Reference PRD Section 6.2.2 PAC directly. Cover: accuracy ranges, safety constraints, edge case coverage, and golden dataset creation. Provide before/after examples of vague vs. specific AI acceptance criteria. Soft-pitch: PMSynapse generates adversarial test cases from plain-English acceptance criteria.

Required Structure

1. The Hook (Empathy & Pain)

Open with an extremely relatable, specific scenario from PM life that connects to this topic. Use one of the PRD personas (Priya the Junior PM, Marcus the Mid-Level PM, Anika the VP of Product, or Raj the Freelance PM) where appropriate.

2. The Trap (Why Standard Advice Fails)

Explain why generic advice or common frameworks don't address the real complexity of this problem. Be specific about what breaks down in practice.

3. The Mental Model Shift

Introduce a new framework, perspective, or reframe that changes how the reader thinks about this topic. This should be genuinely insightful, not recycled advice.

4. Actionable Steps (3-5)

Provide concrete actions the reader can take tomorrow morning. Each step should be specific enough to execute without further research.

5. The Prodinja Angle (Soft-Pitch)

Conclude with how PMSynapse's autonomous PM Shadow capability connects to this topic. Keep it natural — no hard sell.

6. Key Takeaways

3-5 bullet points summarizing the article's core insights.

Internal Linking Requirements

  • Link to parent pillar: /blog/prd-writing-masterclass-ai-era
  • Link to 3-5 related spoke articles within the same pillar cluster
  • Link to at least 1 article from a different pillar cluster for cross-pollination

SEO Checklist

  • Primary keyword appears in H1, first paragraph, and at least 2 H2s
  • Meta title under 60 characters
  • Meta description under 155 characters and includes primary keyword
  • At least 3 external citations/references
  • All images have descriptive alt text
  • Table or framework visual included