When Infrastructure Becomes a Crisis

Scenario: Data Pipeline Outage

Day 1: Database crashes from 2M events/day. Analytics offline for 4 hours.

CEO: "How long until it's fixed?" Engineering: "We don't know. The current pipeline wasn't designed for this scale. We need to rebuild it."

CEO: "How long?" Engineering: "3 months."

CEO: "We don't have 3 months. Can't we just add more servers?" Engineering: "No. The architecture doesn't scale vertically."

Result:

  • 3 months of crisis mode
  • 30% of engineering time spent firefighting
  • No new features (all focus on stability)
  • Customer churn from unreliable product

If infrastructure was spec'd upfront:

  • 6 months prior: Identify scaling limit ahead of time
  • PRD created: "Pipeline must scale to 100M events/day"
  • Planned 3-month rebuild during low-impact quarter
  • Rebuild completed; scales to 10x current load
  • Crisis prevented

Framework: Infrastructure PRD Components

Business Case (Why Now?)

PROBLEM:
- Current system: 500K events/day capacity, already at 2M events/day
- Risk: Outage at current scale = analytics offline, customer-visible
- Growth forecast: 100% YoY = 4M events/day next year (8x over capacity)
- Technical debt: Unmaintainable codebase (60% of oncall time debugging)

COST OF DOING NOTHING:
- Immediate: 3-4 oncall incidents/quarter (engineers unavailable for features)
- 6 months: System hits wall at 3M events/day, manual partitioning workaround
- 12 months: Full system failure, no recovery path (data loss risk)
- Team impact: 30-40% of engineering cycles spent firefighting instead of features

COST OF REBUILDING NOW:
- 3 months, 3 engineers, ~$300K loaded cost
- Zero new feature development during rebuild
- But: 3-year runway (100M events/day capacity), 50% oncall reduction

ROI:
- 3 months cost vs. 12 months of crisis firefighting + customer churn
- Payoff: First 6 months after rebuild (50% less oncall = 1 engineer freed)
- 3-year value: $2M+ in prevented outages + engineering velocity

Non-Functional Requirements (NFRs)

SCALABILITY:
- Throughput: 1M events/sec sustained (100x current peak)
- Storage: 10TB/year data retention (compressed)
- Horizontal scaling: Add nodes without redeployment

PERFORMANCE:
- Event ingestion latency: <100ms P99 (event received → in database)
- Query latency: <1 second for analytics queries (currently 5-10 min)
- Dashboard refresh: <2 seconds (currently 10-30 sec with waits)

RELIABILITY:
- Uptime SLA: 99.9% (43 minutes downtime/month)
- Recovery time objective (RTO): <30 minutes for any component failure
- Recovery point objective (RPO): <5 minutes (max data loss if disaster)
- Failover: Automatic (no manual intervention for common failures)

SECURITY & COMPLIANCE:
- Encryption: TLS in transit, AES-256 at rest
- Audit logs: All access to data logged for compliance
- Retention: Enforce data deletion policies automatically
- Access control: Team-based, no individual data access

OPERATIONAL:
- Monitoring: Real-time dashboards for throughput, latency, errors
- Alerting: Page on-call engineer for SLA breaches
- Cost: <$X per billion events (benchmark: $0.50/1B)
- Maintenance window: No more than 2 hours/quarter

Architecture Decision Records (ADRs)

DECISION 1: Use Event Streaming (Kafka) vs. Database Queue

OPTION A: Kafka-based (Recommended)
+ Can handle 1M events/sec (battle-tested at scale)
+ Replayable (can rebuild views if needed)
+ Decouples producers from consumers
- New operational overhead (ZooKeeper, broker management)
- $X/month cost

OPTION B: Database Queue (PostgreSQL/RabbitMQ)
+ Simpler (one less system to operate)
- Doesn't scale past 100K events/sec
- Loses data on database crash

DECISION: Kafka
Reason: Scalability requirement (1M/sec) rules out database queue.
Operational overhead is acceptable given scale.

---

DECISION 2: OLAP Warehouse (ClickHouse vs. Redshift vs. Snowflake)

OPTION A: ClickHouse (Recommended)
+ Best cost per TB for analytics queries
+ Self-hosted (no vendor lock-in)
- Operationally complex (requires expertise)
- No managed option in our cloud (AWS)

OPTION B: Redshift
+ Managed AWS service (easier ops)
+ Good query performance
- Higher cost ($X vs. $Y/month)
- Less suitable for event-scale analytics

OPTION C: Snowflake
+ Easiest operations (fully managed)
- Highest cost (~3x ClickHouse)
- Overkill for our scale

DECISION: ClickHouse
Reason: Cost efficiency is key. We'll hire/train on ClickHouse operations.
Playbook: Hire one contractor for 6-month ops handoff.

---

DECISION 3: Batch vs. Real-Time Analytics

OPTION A: Real-time (Kafka Streams / Flink)
+ Dashboards update instantly
- High operational complexity
- Overkill for most use cases (dashboard refresh every 30 sec is fine)

OPTION B: Batch (Spark / dbt)
+ Simpler (cron jobs + SQL)
+ Easier debugging
- Latency: 5-10 minutes between event and analytics visible
- Good enough for most use cases

DECISION: Batch (1-minute refresh)
Reason: Operational simplicity wins. 1-minute latency acceptable for analytics.
Real-time added if product later needs <30 sec refresh.

Measurable Success Criteria

PERFORMANCE:
✓ Event ingestion latency: <100ms P99 (measure with timestamp comparison)
✓ Query latency: <1 second for daily dashboard queries (measure with APM)
✓ Dashboard refresh: Complete <2 seconds (measure user-facing timing)

RELIABILITY:
✓ Uptime: >99.9% (tracked with synthetic monitoring)
✓ Mean time to recovery: <30 minutes for component failure (tracked in incidents)
✓ Data loss incidents: Zero (track backup recovery tests)

OPERATIONAL:
✓ On-call: <2 incidents/quarter (down from 3-4/quarter)
✓ Mean time to incident resolution: <1 hour (currently 4-6 hours)
✓ Unplanned maintenance: <30 min/quarter (currently 5-10 hours)

COST:
✓ Cost per billion events: <$X (benchmark: $0.50)
✓ Cloud spend: <$Y/month (down from $Z/month with optimization)

TEAM VELOCITY:
✓ Feature team velocity: +20% (freed from firefighting)
✓ Oncall load: 50% reduction in pages (fewer incidents)
✓ Engineer satisfaction: NPS improved (less toil)

Risks & Mitigation

RISK 1: Rebuild takes longer than 3 months → Further delays

Probability: Medium (new technology, learning curve)
Impact: High (pushes problem further down road)

Mitigation:
- Plan for 4-month estimate (add 33% buffer)
- Parallel testing: Run old + new pipeline, compare results
- Rollback plan: Keep old pipeline running for 30 days post-launch
- Success gate: Must handle 2x current load before switching

---

RISK 2: New system is operationally complex → High on-call load

Probability: Medium (ClickHouse + Kafka have steep learning curve)
Impact: High (defeats purpose of reducing on-call)

Mitigation:
- Hire ClickHouse expert for 6-month ops transfer
- Document all runbooks before launch
- Post-launch: 4-week "intensive ops" period with 2x staffing
- Playbook: If complex, evaluate managed alternatives (Fivetran)

---

RISK 3: Old pipeline failure during migration → Data loss

Probability: Low (with proper safeguards)
Impact: Critical (customer trust destroyed)

Mitigation:
- Parallel run: Events streamed to both old + new for 2 weeks
- Validation: New pipeline produces identical results to old
- Audit: Daily data volume comparison (old vs. new)
- Rollback: If mismatch detected, revert to old immediately

Rollout Plan

PHASE 1: Development & Testing (Week 1-8)

- Build Kafka cluster (1 week)
- Build ClickHouse cluster (1 week)
- Develop ETL pipelines (3 weeks)
- Testing & debugging (2 weeks)
- Exit criteria: New pipeline processes 10M test events, identical to old

PHASE 2: Shadow Traffic (Week 9-10)

- Duplicate all production events to new pipeline
- Monitor for discrepancies (data validation, latency, errors)
- Keep old pipeline as source-of-truth
- Exit criteria: 1 week of shadow traffic with 100% data match

PHASE 3: Cutover (Week 11)

- Thursday night: Switch analytics queries to new pipeline
- Friday morning: Monitor dashboards (all team members on standby)
- Saturday-Sunday: Run in parallel; ready to rollback
- Monday morning: Turn off old pipeline (if successful)
- Exit criteria: No customer reports of incorrect analytics

PHASE 4: Optimization & Cleanup (Week 12+)

- Performance tuning (query optimization, compression)
- Remove old pipeline infrastructure
- Document all runbooks and procedures
- Train on-call team on new system

Real-World Example: Data Infrastructure PRD

Bad Infrastructure PRD (No Structure)

PROJECT: Rebuild Data Pipeline

SCOPE:
Redesign our data pipeline to be faster and more scalable.

TIMELINE:
3 months

TEAM:
3 engineers

STATUS:
In progress

Problems:

  • No business case (why now?)
  • No requirements (what's "faster"?)
  • No success metrics (how do we know it's done?)
  • No risks (what could go wrong?)
  • No rollout plan (how do we migrate without breaking things?)

Result:

  • 6 months later: "Still 20% done"
  • Budget blown; nobody knows if it's on track
  • Business can't make decisions

Good Infrastructure PRD (Structured)

PROJECT: Data Pipeline Rebuild (Kafka + ClickHouse)

BUSINESS CASE:

Current pipeline: 500K events/day capacity, already at 2M. Scaling to 4M next year will fail.
Cost of inaction: System outages; 30-40% of eng time firefighting.
Cost of rebuild: $300K (3 eng × 3 months).
ROI: 6 months (payback from freed engineering time + prevented incidents).

REQUIREMENTS:

Throughput: 1M events/sec (100x scaling)
Query latency: <1 second (currently 5-10 min)
Uptime: 99.9%
Cost: <$0.50 per 1B events

ARCHITECTURE:

Kafka cluster: 3 brokers, 5 replicas per partition
ClickHouse warehouse: 3 nodes, 2TB storage per node
ETL: dbt models (Spark for transformations)
Monitoring: Datadog dashboards + PagerDuty alerts

RISKS & MITIGATIONS:

Risk: Migration loses data
Mitigation: Shadow run 1 week; 100% validation before cutover

Risk: New system is complex to operate
Mitigation: Hire ClickHouse expert; 4-week intensive ops transfer

TIMELINE:

Week 1-8: Development + shadow testing
Week 9-10: Parallel run + validation
Week 11: Cutover (rollback-ready)
Week 12+: Optimization + cleanup

SUCCESS METRICS:

[Dashboard queries]
Query latency: <1 second (currently 5-10 min) ✓
Event ingestion: <100ms P99 ✓
Uptime: 99.9%+ ✓

[Operations metrics]
On-call incidents: <2/quarter (down from 4/quarter) ✓
Mean time to resolution: <1 hour (down from 4-6 hours) ✓

[Team impact]
Feature velocity: +20% (freed engineering) ✓
Engineer satisfaction: Less toil, higher NPS ✓

APPROVAL:

- Engineering Lead: Reviewed architecture, approved
- Security: Reviewed encryption & compliance, approved
- Finance: Reviewed budget, approved ($300K)

Target Launch: Week 12

Anti-Pattern: "Infrastructure Will Be Fine"

The Problem:

  • PM ignores infrastructure work as "not revenue-generating"
  • Teams patch + patch + patch
  • 2 years later: Complete crisis + multi-month rebuild
  • Crisis rebuild disrupts all product development

The Fix:

  • Treat infrastructure as a product
  • PRD it like any other feature
  • Plan 10-20% of engineering cycles for infrastructure
  • Prevent crisis-driven situations

Actionable Steps

Step 1: Identify Infrastructure Debt

Audit: List all systems that are creaking

- Data pipeline: Crashes at 2M events/day (at risk now)
- API server: Single point of failure (50/50 disaster risk)
- Search index: Queries take 10 seconds (acceptable but slow)
- Deployment process: Manual, error-prone (operational overhead)

Risk ranking: High (pipeline), High (API), Medium (search), Low (deployment)

Step 2: Write Business Case

INFRASTRUCTURE PROJECT: Data Pipeline Rebuild

Business problem: Current system at 80% capacity; will fail in 6 months.
Business impact if nothing: Outages, analytics offline, $X in customer churn.
Business benefit of fixing now: 3-year runway, 50% less on-call load, +20% feature velocity.
Cost: $300K (3 eng × 3 months)
ROI: Payback in 6 months

Recommendation: Prioritize now (do not defer)

Step 3: Define Requirements (Not Implementation)

Requirements (WHAT):
- Scale to 100M events/day
- Query latency <1 second
- 99.9% uptime SLA

Implementation (HOW, engineer's choice):
- Option A: Kafka + ClickHouse
- Option B: Redshift
- Option C: Snowflake

PM specifies WHAT; engineers propose HOW.

Step 4: Build Success Metrics

Before:
- Event ingestion: 5-10 minutes
- Query latency: 5-10 minutes
- Uptime: 98%

After:
- Event ingestion: <100ms
- Query latency: <1 second
- Uptime: 99.9%

Measure:
- Dashboard analytics (automated tracking)
- On-call incidents (reduced from 4/quarter to 1-2/quarter)
- Feature velocity (freed engineering cycles)

Step 5: Plan Rollout Carefully

Shadow run: New system processes data alongside old, validated
Cutover: Switch analytics to new with rollback ready
Cleanup: Old system runs standby for 2 weeks, then retired
Documentation: Runbooks written before launch
Training: On-call trained on new system before going live

PMSynapse Connection

Infrastructure PRDs are often sloppy because they're "invisible." PMSynapse's Platform Template auto-generates infrastructure PRDs: "Project name, business case, requirements, success metrics, risk mitigation, rollout plan." By standardizing infrastructure specs, PMSynapse ensures platform work gets the rigor it deserves.


Key Takeaways

  • Infrastructure is a product. The "user" is your engineering team, but it still deserves a PRD.

  • Business case is everything. If you can't articulate why infrastructure work matters to the business, defer it.

  • Specify requirements, not implementation. "Scale to 1M events/sec" (requirement) vs. "Use Kafka" (implementation choice).

  • Parallel testing prevents disasters. Shadow run, validate, then cutover with rollback ready.

  • Infrastructure prevents crises. Plan it upfront; don't react to outages.

Constraints:

  • Must support schema evolution (new event types, new attributes)
  • Must integrate with existing auth layer
  • Must be operationalizable by current ops team (no new specialties)

### 3. Create a Success Metric

Infrastructure projects need clear "done" criteria:

Success Metrics:

  • Pipeline handles 100M events/day with 0 loss
  • Query latency: 50th percentile <1 second, 99th <5 seconds
  • Deployment cadence: Stay at 1x/day (no slowdown from new architecture)
  • Operational cost: Within 10% of current spend

Risks:

  • If query latency exceeds 5 seconds, performance implications for product
  • If deployment slows, development velocity regresses
  • If cost exceeds 15% of current, ROI becomes questionable

### 4. Define Phasing & Rollout

Phase 1 (Weeks 1-6): Build in parallel, validate on subset of data

  • Build new pipeline; ingest 5% of events
  • Run validation: results match old pipeline
  • Run load tests: 10M events / day handling

Phase 2 (Weeks 7-8): Gradual cutover

  • Switch 10% of events to new pipeline
  • Monitor for 1 week; validate correctness
  • Increase to 50%, then 100%

Phase 3 (Week 9): Decommission old pipeline

  • Ensure nothing still queries old pipeline
  • Archive old data (compliance hold)
  • Deallocate old infrastructure (cost savings)

Rollback plan: If correctness issues found, we can revert 50% traffic to old pipeline within 2 hours.


### 5. Document Constraints & Assumptions

Key Assumptions:

  • Development team stays at 4 engineers (no headcount add)
  • We can run old + new pipeline simultaneously for 2 weeks
  • Dependencies on auth, logging, monitoring APIs stable

Tech Debt Cleared:

  • Removes 30% of "why is this slow?" support tickets
  • Unblocks: multi-region support, new analytics features

## Key Takeaways

- **Infrastructure is not CapEx bureaucracy.** Treat it like any other investment: clear ROI, success metrics, risk management.
- **Invisible work becomes visible when it breaks.** Doing PRDs for infrastructure prevents surprises.
- **Bad infrastructure slows your competitive velocity.** The startup with good data pipelines ships features 2x faster than one with technical debt.

# Writing PRDs for Platform & Infrastructure: The Invisible Product

## Article Type

**SPOKE Article** — Links back to pillar: /prd-writing-masterclass-ai-era

## Target Word Count

2,500–3,500 words

## Writing Guidance

Cover: how to define internal 'users,' developer experience as a product requirement, platform metrics, and making the business case for work that's invisible to customers. Soft-pitch: PMSynapse helps PMs translate platform investment into business impact narratives.

## Required Structure

### 1. The Hook (Empathy & Pain)

Open with an extremely relatable, specific scenario from PM life that connects to this topic. Use one of the PRD personas (Priya the Junior PM, Marcus the Mid-Level PM, Anika the VP of Product, or Raj the Freelance PM) where appropriate.

### 2. The Trap (Why Standard Advice Fails)

Explain why generic advice or common frameworks don't address the real complexity of this problem. Be specific about what breaks down in practice.

### 3. The Mental Model Shift

Introduce a new framework, perspective, or reframe that changes how the reader thinks about this topic. This should be genuinely insightful, not recycled advice.

### 4. Actionable Steps (3-5)

Provide concrete actions the reader can take tomorrow morning. Each step should be specific enough to execute without further research.

### 5. The Prodinja Angle (Soft-Pitch)

Conclude with how PMSynapse's autonomous PM Shadow capability connects to this topic. Keep it natural — no hard sell.

### 6. Key Takeaways

3-5 bullet points summarizing the article's core insights.

## Internal Linking Requirements

- Link to parent pillar: /blog/prd-writing-masterclass-ai-era
- Link to 3-5 related spoke articles within the same pillar cluster
- Link to at least 1 article from a different pillar cluster for cross-pollination

## SEO Checklist

- [ ] Primary keyword appears in H1, first paragraph, and at least 2 H2s
- [ ] Meta title under 60 characters
- [ ] Meta description under 155 characters and includes primary keyword
- [ ] At least 3 external citations/references
- [ ] All images have descriptive alt text
- [ ] Table or framework visual included