Model Selection for PMs: A Decision Framework

Marcus, a mid-level PM at a fintech startup, is caught in a "Benchmark War."

His engineering team wants to use the latest open-source model (Llama-4-Ultra-Light) because it’s "cool," it’s cheaper, and it has a high score on the "MMLU" benchmark. His CEO, however, just read a tweet about Claude 3.6 Opus and wants to use that because it’s "the smartest model in the world."

Marcus is looking at his PRD for a "Loan Approval Risk Analyzer." He needs the model to be extremely accurate (High Stakes), follow 50 different compliance rules (Instruction Following), and respond in under 2 seconds (Latency).

"Which model wins?" the CEO asks.

Marcus’s answer: "Neither. Benchmarks are for researchers. Models are for Products. We don't need a model that can write poetry and pass the Bar Exam. We need a model that can follow our specific risk-logic without hallucinating."

In 2026, model selection isn't about which model is "best." It’s about which model is Best for the Job.

1. The Benchmark Trap: Why MMLU Doesn't Matter to Your PRD

Most PMs start their model selection by looking at leaderboard charts. They see that GPT-X is 2 points higher than Claude-Y on a global reasoning task and assume GPT-X is "better."

The problem? Benchmarks are broad and academic. Your product is narrow and specific. A model that is great at "General Reasoning" might be terrible at "Formatting JSON for a Fintech API."

The PM's Job: Stop looking at benchmarks. Start looking at Product-Specific Evals. (See our Guide to Eval Frameworks).

2. The 4-Dimension Model Selection Framework

When selecting a model, you should ignore the "Hype" and evaluate against these four dimensions:

Dimension 1: Capability (The "Brain" Tier)

High-Tier (GPT-4o, Claude 3.5 Opus): For complex reasoning, multi-step planning, and multi-modal tasks. Use these when the cost of an error is higher than the cost of inference.
Mid-Tier (GPT-4o-mini, Claude Haiku): For classification, summarization, and basic extraction. The "workhorse" tier.
Specialized/Small (Llama 8B, Mistral): For single-task highly-controlled outputs or local/on-prem requirements.

Dimension 2: Reliability (Instruction Following)

How well does the model obey your system prompt? Does it follow formatting constraints (JSON/Markdown) 100% of the time, or does it "deviate" when the input gets long?

Dimension 3: Operational Constraints (Latency & Cost)

Latency: How fast does the first token arrive?
Cost: What is the cost per 1M tokens? (See AI Trade-offs Guide).

Dimension 4: Compliance & Residency

Can you send your data to a third-party API (OpenAI/Anthropic)? Or do you need to host the model yourself in your own VPC (Llama/Mistral) for regulatory reasons?

3. The "Tiered" Selection Protocol

Don't just pick one model. Pick a Model Strategy.

Step 1: The "Gold" Baseline Build your prototype with the most powerful model available (the "Gold Model"). This tells you if the feature is even possible. If Claude Opus can’t do it, no other model will.

Step 2: The "Minimum Viable Model" (MVM) Search Once it works on the Gold Model, try to "down-model" to a cheaper/faster model. If GPT-4o-mini produces the same result on your Gold Set, ship with the mini.

Step 3: The Redundancy Plan In 2026, model providers go down, change their behavior (model drift), or get acquired. Your product should be Model Agnostic. Design your prompt and code so you can swap GPT for Claude in 10 minutes if the API starts acting up.

4. Proprietary vs. Open-Source: The PM Context

Proprietary (OpenAI, Anthropic):
- Pros: Higher quality, zero infrastructure overhead, "ahead" of the curve.
- Cons: Vendor lock-in, data privacy concerns, API rate limits.
Open-Source (Llama, Falcon, Mistral):
- Pros: Full data control, customizable via fine-tuning, no per-token API fees (if hosted yourself).
- Cons: Significant engineering overhead to host and scale, often 6-12 months "behind" the proprietary leaders in reasoning.

5. Model Drift: The Silent Product Failure

Models aren't static. OpenAI and Anthropic "update" their models constantly. A prompt that worked perfectly on Monday might start returning garbage on Friday because of a behavioral update at the provider level.

The Strategy: Version your models. If you’re using gpt-4o, pin it to the specific date version (e.g., gpt-4o-2024-08-06) and don't update until you've re-run your entire Eval suite.

6. The Prodinja Angle: Scoring Models as a Trade-off

Prodinja's RICE/Kano Prioritization is a concrete, advise-first example of the framework in this article: you enter your candidate models as options, score each against weighted criteria, and it surfaces a ranked comparison for you to review. Use the four dimensions above — capability, reliability, latency, and cost — as your criteria, and "Picking a model" becomes a defensible trade-off instead of a benchmark-chasing hunch.

You then capture the winning strategy — the Gold baseline, the down-modeled tier, and your version-pinning rule — in Prodinja's Spec Studio, so the decision lives inside the living PRD rather than a Slack thread. Nothing here routes traffic or picks a model for you; the scores and drafts are yours to weigh and edit. It nudges you from "Picking a model" toward "Designing an Inference Strategy."

For the foundational guide on managing the teams that will argue for their own favorite models, see the Complete Guide to Stakeholder Management and the AI PM Pillar Guide.

Key Takeaways

Ignore General Benchmarks: Use specific evals tailored to your product data.
Start Big, Scale Small: Prove feasibility with the "Gold Tier" then optimize for cost.
Stay Agnostic: Don't marry a single model provider. Build for swappability.
Pin Your Versions: Avoid "Model Drift" by using dated model identifiers.
The PM Owns the Decision: The engineer owns the hosting; the PM owns the Trade-off between Capability, Cost, and Compliance.

References & Further Reading

LMSYS Chatbot Arena (Industry Peer-to-Peer Benchmark)
The Economics of LLM Inference (Infrastructure Whitepaper)
Model Drift and Behavioral Rot (Research Paper)