Writing PRDs for Platform & Infrastructure: The Invisib…

The Problem

Your platform team says they need to rebuild the data pipeline. You ask "why"—the current one works.

They explain the codebase is unmaintainable, scaling will fail, and it'll take 3 months.

You say "let me defer this until we grow more." Six months later, you hit scale limits and panic.

You needed a PRD for infrastructure work. You didn't have one. Now you're crisis-driven instead of strategy-driven.

The Trap

Many PMs only write PRDs for user-facing features: "Add recommendations" or "Redesign onboarding."

Infrastructure PRDs are rare because they're invisible to users—there's no "launch marketing" moment.

But infrastructure failure is visible to users: slowdowns, outages, broken deployments. And it kills your velocity.

The Shift

Think of infrastructure as "the platform on which all features depend."

A data pipeline rebuilt is like laying new highways: users don't see it, but everything moves faster.

Infrastructure work needs PRDs just like features do.

Actionable Steps

1. Define the Business Case for Infrastructure

Infrastructure Project: Data Pipeline Rebuild

Problem: Current pipeline crashes at 500K events/day; we're at 2M,
growing 100% YoY. No existing pipeline scales past 3M. Teams wait
5-10 minutes for data queries.

Business Impact:
- Risk: Risk pipeline outage halts analytics for 24h (high visibility)
- Opportunity: New event types blocked until pipeline redesigned
- Velocity cost: 2 dev-weeks/quarter in firefighting + tuning
- User cost: QA can't run reliable tests; data 1-2h stale in product

Target:
- Scale to 100M events/day (3-year runway)
- Query latency: <1 second (currently 5-10 min)
- Uptime: 99.9% (currently 98%)

2. Specify Requirements (Not Implementation)

Don't write: "Use Kafka + ClickHouse + Fivetran."

Write:

Infrastructure Requirements:
- Throughput: 1M events/sec sustained
- Latency: Ingestion-to-query <30 seconds (99th percentile)
- Reliability: 99.9% uptime SLA
- Compliance: Data encrypted in transit + at rest
- Cost model: <$X per 1B events (competitive benchmark)

Constraints:
- Must support schema evolution (new event types, new attributes)
- Must integrate with existing auth layer
- Must be operationalizable by current ops team (no new specialties)

3. Create a Success Metric

Infrastructure projects need clear "done" criteria:

Success Metrics:
- Pipeline handles 100M events/day with 0 loss
- Query latency: 50th percentile <1 second, 99th <5 seconds
- Deployment cadence: Stay at 1x/day (no slowdown from new architecture)
- Operational cost: Within 10% of current spend

Risks:
- If query latency exceeds 5 seconds, performance implications for product
- If deployment slows, development velocity regresses
- If cost exceeds 15% of current, ROI becomes questionable

4. Define Phasing & Rollout

Phase 1 (Weeks 1-6): Build in parallel, validate on subset of data
- Build new pipeline; ingest 5% of events
- Run validation: results match old pipeline
- Run load tests: 10M events / day handling

Phase 2 (Weeks 7-8): Gradual cutover
- Switch 10% of events to new pipeline
- Monitor for 1 week; validate correctness
- Increase to 50%, then 100%

Phase 3 (Week 9): Decommission old pipeline
- Ensure nothing still queries old pipeline
- Archive old data (compliance hold)
- Deallocate old infrastructure (cost savings)

Rollback plan: If correctness issues found,
we can revert 50% traffic to old pipeline within 2 hours.

5. Document Constraints & Assumptions

Key Assumptions:
- Development team stays at 4 engineers (no headcount add)
- We can run old + new pipeline simultaneously for 2 weeks
- Dependencies on auth, logging, monitoring APIs stable

Tech Debt Cleared:
- Removes 30% of "why is this slow?" support tickets
- Unblocks: multi-region support, new analytics features

Key Takeaways

Infrastructure is not CapEx bureaucracy. Treat it like any other investment: clear ROI, success metrics, risk management.
Invisible work becomes visible when it breaks. Doing PRDs for infrastructure prevents surprises.
Bad infrastructure slows your competitive velocity. The startup with good data pipelines ships features 2x faster than one with technical debt.