When Infrastructure Becomes a Crisis
Scenario: Data Pipeline Outage
Day 1: Database crashes from 2M events/day. Analytics offline for 4 hours.
CEO: "How long until it's fixed?" Engineering: "We don't know. The current pipeline wasn't designed for this scale. We need to rebuild it."
CEO: "How long?" Engineering: "3 months."
CEO: "We don't have 3 months. Can't we just add more servers?" Engineering: "No. The architecture doesn't scale vertically."
Result:
- 3 months of crisis mode
- 30% of engineering time spent firefighting
- No new features (all focus on stability)
- Customer churn from unreliable product
If infrastructure was spec'd upfront:
- 6 months prior: Identify scaling limit ahead of time
- PRD created: "Pipeline must scale to 100M events/day"
- Planned 3-month rebuild during low-impact quarter
- Rebuild completed; scales to 10x current load
- Crisis prevented
Framework: Infrastructure PRD Components
Business Case (Why Now?)
PROBLEM:
- Current system: 500K events/day capacity, already at 2M events/day
- Risk: Outage at current scale = analytics offline, customer-visible
- Growth forecast: 100% YoY = 4M events/day next year (8x over capacity)
- Technical debt: Unmaintainable codebase (60% of oncall time debugging)
COST OF DOING NOTHING:
- Immediate: 3-4 oncall incidents/quarter (engineers unavailable for features)
- 6 months: System hits wall at 3M events/day, manual partitioning workaround
- 12 months: Full system failure, no recovery path (data loss risk)
- Team impact: 30-40% of engineering cycles spent firefighting instead of features
COST OF REBUILDING NOW:
- 3 months, 3 engineers, ~$300K loaded cost
- Zero new feature development during rebuild
- But: 3-year runway (100M events/day capacity), 50% oncall reduction
ROI:
- 3 months cost vs. 12 months of crisis firefighting + customer churn
- Payoff: First 6 months after rebuild (50% less oncall = 1 engineer freed)
- 3-year value: $2M+ in prevented outages + engineering velocity
Non-Functional Requirements (NFRs)
SCALABILITY:
- Throughput: 1M events/sec sustained (100x current peak)
- Storage: 10TB/year data retention (compressed)
- Horizontal scaling: Add nodes without redeployment
PERFORMANCE:
- Event ingestion latency: <100ms P99 (event received → in database)
- Query latency: <1 second for analytics queries (currently 5-10 min)
- Dashboard refresh: <2 seconds (currently 10-30 sec with waits)
RELIABILITY:
- Uptime SLA: 99.9% (43 minutes downtime/month)
- Recovery time objective (RTO): <30 minutes for any component failure
- Recovery point objective (RPO): <5 minutes (max data loss if disaster)
- Failover: Automatic (no manual intervention for common failures)
SECURITY & COMPLIANCE:
- Encryption: TLS in transit, AES-256 at rest
- Audit logs: All access to data logged for compliance
- Retention: Enforce data deletion policies automatically
- Access control: Team-based, no individual data access
OPERATIONAL:
- Monitoring: Real-time dashboards for throughput, latency, errors
- Alerting: Page on-call engineer for SLA breaches
- Cost: <$X per billion events (benchmark: $0.50/1B)
- Maintenance window: No more than 2 hours/quarter
Architecture Decision Records (ADRs)
DECISION 1: Use Event Streaming (Kafka) vs. Database Queue
OPTION A: Kafka-based (Recommended)
+ Can handle 1M events/sec (battle-tested at scale)
+ Replayable (can rebuild views if needed)
+ Decouples producers from consumers
- New operational overhead (ZooKeeper, broker management)
- $X/month cost
OPTION B: Database Queue (PostgreSQL/RabbitMQ)
+ Simpler (one less system to operate)
- Doesn't scale past 100K events/sec
- Loses data on database crash
DECISION: Kafka
Reason: Scalability requirement (1M/sec) rules out database queue.
Operational overhead is acceptable given scale.
---
DECISION 2: OLAP Warehouse (ClickHouse vs. Redshift vs. Snowflake)
OPTION A: ClickHouse (Recommended)
+ Best cost per TB for analytics queries
+ Self-hosted (no vendor lock-in)
- Operationally complex (requires expertise)
- No managed option in our cloud (AWS)
OPTION B: Redshift
+ Managed AWS service (easier ops)
+ Good query performance
- Higher cost ($X vs. $Y/month)
- Less suitable for event-scale analytics
OPTION C: Snowflake
+ Easiest operations (fully managed)
- Highest cost (~3x ClickHouse)
- Overkill for our scale
DECISION: ClickHouse
Reason: Cost efficiency is key. We'll hire/train on ClickHouse operations.
Playbook: Hire one contractor for 6-month ops handoff.
---
DECISION 3: Batch vs. Real-Time Analytics
OPTION A: Real-time (Kafka Streams / Flink)
+ Dashboards update instantly
- High operational complexity
- Overkill for most use cases (dashboard refresh every 30 sec is fine)
OPTION B: Batch (Spark / dbt)
+ Simpler (cron jobs + SQL)
+ Easier debugging
- Latency: 5-10 minutes between event and analytics visible
- Good enough for most use cases
DECISION: Batch (1-minute refresh)
Reason: Operational simplicity wins. 1-minute latency acceptable for analytics.
Real-time added if product later needs <30 sec refresh.
Measurable Success Criteria
PERFORMANCE:
✓ Event ingestion latency: <100ms P99 (measure with timestamp comparison)
✓ Query latency: <1 second for daily dashboard queries (measure with APM)
✓ Dashboard refresh: Complete <2 seconds (measure user-facing timing)
RELIABILITY:
✓ Uptime: >99.9% (tracked with synthetic monitoring)
✓ Mean time to recovery: <30 minutes for component failure (tracked in incidents)
✓ Data loss incidents: Zero (track backup recovery tests)
OPERATIONAL:
✓ On-call: <2 incidents/quarter (down from 3-4/quarter)
✓ Mean time to incident resolution: <1 hour (currently 4-6 hours)
✓ Unplanned maintenance: <30 min/quarter (currently 5-10 hours)
COST:
✓ Cost per billion events: <$X (benchmark: $0.50)
✓ Cloud spend: <$Y/month (down from $Z/month with optimization)
TEAM VELOCITY:
✓ Feature team velocity: +20% (freed from firefighting)
✓ Oncall load: 50% reduction in pages (fewer incidents)
✓ Engineer satisfaction: NPS improved (less toil)
Risks & Mitigation
RISK 1: Rebuild takes longer than 3 months → Further delays
Probability: Medium (new technology, learning curve)
Impact: High (pushes problem further down road)
Mitigation:
- Plan for 4-month estimate (add 33% buffer)
- Parallel testing: Run old + new pipeline, compare results
- Rollback plan: Keep old pipeline running for 30 days post-launch
- Success gate: Must handle 2x current load before switching
---
RISK 2: New system is operationally complex → High on-call load
Probability: Medium (ClickHouse + Kafka have steep learning curve)
Impact: High (defeats purpose of reducing on-call)
Mitigation:
- Hire ClickHouse expert for 6-month ops transfer
- Document all runbooks before launch
- Post-launch: 4-week "intensive ops" period with 2x staffing
- Playbook: If complex, evaluate managed alternatives (Fivetran)
---
RISK 3: Old pipeline failure during migration → Data loss
Probability: Low (with proper safeguards)
Impact: Critical (customer trust destroyed)
Mitigation:
- Parallel run: Events streamed to both old + new for 2 weeks
- Validation: New pipeline produces identical results to old
- Audit: Daily data volume comparison (old vs. new)
- Rollback: If mismatch detected, revert to old immediately
Rollout Plan
PHASE 1: Development & Testing (Week 1-8)
- Build Kafka cluster (1 week)
- Build ClickHouse cluster (1 week)
- Develop ETL pipelines (3 weeks)
- Testing & debugging (2 weeks)
- Exit criteria: New pipeline processes 10M test events, identical to old
PHASE 2: Shadow Traffic (Week 9-10)
- Duplicate all production events to new pipeline
- Monitor for discrepancies (data validation, latency, errors)
- Keep old pipeline as source-of-truth
- Exit criteria: 1 week of shadow traffic with 100% data match
PHASE 3: Cutover (Week 11)
- Thursday night: Switch analytics queries to new pipeline
- Friday morning: Monitor dashboards (all team members on standby)
- Saturday-Sunday: Run in parallel; ready to rollback
- Monday morning: Turn off old pipeline (if successful)
- Exit criteria: No customer reports of incorrect analytics
PHASE 4: Optimization & Cleanup (Week 12+)
- Performance tuning (query optimization, compression)
- Remove old pipeline infrastructure
- Document all runbooks and procedures
- Train on-call team on new system
Real-World Example: Data Infrastructure PRD
Bad Infrastructure PRD (No Structure)
PROJECT: Rebuild Data Pipeline
SCOPE:
Redesign our data pipeline to be faster and more scalable.
TIMELINE:
3 months
TEAM:
3 engineers
STATUS:
In progress
Problems:
- No business case (why now?)
- No requirements (what's "faster"?)
- No success metrics (how do we know it's done?)
- No risks (what could go wrong?)
- No rollout plan (how do we migrate without breaking things?)
Result:
- 6 months later: "Still 20% done"
- Budget blown; nobody knows if it's on track
- Business can't make decisions
Good Infrastructure PRD (Structured)
PROJECT: Data Pipeline Rebuild (Kafka + ClickHouse)
BUSINESS CASE:
Current pipeline: 500K events/day capacity, already at 2M. Scaling to 4M next year will fail.
Cost of inaction: System outages; 30-40% of eng time firefighting.
Cost of rebuild: $300K (3 eng × 3 months).
ROI: 6 months (payback from freed engineering time + prevented incidents).
REQUIREMENTS:
Throughput: 1M events/sec (100x scaling)
Query latency: <1 second (currently 5-10 min)
Uptime: 99.9%
Cost: <$0.50 per 1B events
ARCHITECTURE:
Kafka cluster: 3 brokers, 5 replicas per partition
ClickHouse warehouse: 3 nodes, 2TB storage per node
ETL: dbt models (Spark for transformations)
Monitoring: Datadog dashboards + PagerDuty alerts
RISKS & MITIGATIONS:
Risk: Migration loses data
Mitigation: Shadow run 1 week; 100% validation before cutover
Risk: New system is complex to operate
Mitigation: Hire ClickHouse expert; 4-week intensive ops transfer
TIMELINE:
Week 1-8: Development + shadow testing
Week 9-10: Parallel run + validation
Week 11: Cutover (rollback-ready)
Week 12+: Optimization + cleanup
SUCCESS METRICS:
[Dashboard queries]
Query latency: <1 second (currently 5-10 min) ✓
Event ingestion: <100ms P99 ✓
Uptime: 99.9%+ ✓
[Operations metrics]
On-call incidents: <2/quarter (down from 4/quarter) ✓
Mean time to resolution: <1 hour (down from 4-6 hours) ✓
[Team impact]
Feature velocity: +20% (freed engineering) ✓
Engineer satisfaction: Less toil, higher NPS ✓
APPROVAL:
- Engineering Lead: Reviewed architecture, approved
- Security: Reviewed encryption & compliance, approved
- Finance: Reviewed budget, approved ($300K)
Target Launch: Week 12
Anti-Pattern: "Infrastructure Will Be Fine"
The Problem:
- PM ignores infrastructure work as "not revenue-generating"
- Teams patch + patch + patch
- 2 years later: Complete crisis + multi-month rebuild
- Crisis rebuild disrupts all product development
The Fix:
- Treat infrastructure as a product
- PRD it like any other feature
- Plan 10-20% of engineering cycles for infrastructure
- Prevent crisis-driven situations
Actionable Steps
Step 1: Identify Infrastructure Debt
Audit: List all systems that are creaking
- Data pipeline: Crashes at 2M events/day (at risk now)
- API server: Single point of failure (50/50 disaster risk)
- Search index: Queries take 10 seconds (acceptable but slow)
- Deployment process: Manual, error-prone (operational overhead)
Risk ranking: High (pipeline), High (API), Medium (search), Low (deployment)
Step 2: Write Business Case
INFRASTRUCTURE PROJECT: Data Pipeline Rebuild
Business problem: Current system at 80% capacity; will fail in 6 months.
Business impact if nothing: Outages, analytics offline, $X in customer churn.
Business benefit of fixing now: 3-year runway, 50% less on-call load, +20% feature velocity.
Cost: $300K (3 eng × 3 months)
ROI: Payback in 6 months
Recommendation: Prioritize now (do not defer)
Step 3: Define Requirements (Not Implementation)
Requirements (WHAT):
- Scale to 100M events/day
- Query latency <1 second
- 99.9% uptime SLA
Implementation (HOW, engineer's choice):
- Option A: Kafka + ClickHouse
- Option B: Redshift
- Option C: Snowflake
PM specifies WHAT; engineers propose HOW.
Step 4: Build Success Metrics
Before:
- Event ingestion: 5-10 minutes
- Query latency: 5-10 minutes
- Uptime: 98%
After:
- Event ingestion: <100ms
- Query latency: <1 second
- Uptime: 99.9%
Measure:
- Dashboard analytics (automated tracking)
- On-call incidents (reduced from 4/quarter to 1-2/quarter)
- Feature velocity (freed engineering cycles)
Step 5: Plan Rollout Carefully
Shadow run: New system processes data alongside old, validated
Cutover: Switch analytics to new with rollback ready
Cleanup: Old system runs standby for 2 weeks, then retired
Documentation: Runbooks written before launch
Training: On-call trained on new system before going live
PMSynapse Connection
Infrastructure PRDs are often sloppy because they're "invisible." PMSynapse's Platform Template auto-generates infrastructure PRDs: "Project name, business case, requirements, success metrics, risk mitigation, rollout plan." By standardizing infrastructure specs, PMSynapse ensures platform work gets the rigor it deserves.
Key Takeaways
-
Infrastructure is a product. The "user" is your engineering team, but it still deserves a PRD.
-
Business case is everything. If you can't articulate why infrastructure work matters to the business, defer it.
-
Specify requirements, not implementation. "Scale to 1M events/sec" (requirement) vs. "Use Kafka" (implementation choice).
-
Parallel testing prevents disasters. Shadow run, validate, then cutover with rollback ready.
-
Infrastructure prevents crises. Plan it upfront; don't react to outages.
Constraints:
- Must support schema evolution (new event types, new attributes)
- Must integrate with existing auth layer
- Must be operationalizable by current ops team (no new specialties)
### 3. Create a Success Metric
Infrastructure projects need clear "done" criteria:
Success Metrics:
- Pipeline handles 100M events/day with 0 loss
- Query latency: 50th percentile <1 second, 99th <5 seconds
- Deployment cadence: Stay at 1x/day (no slowdown from new architecture)
- Operational cost: Within 10% of current spend
Risks:
- If query latency exceeds 5 seconds, performance implications for product
- If deployment slows, development velocity regresses
- If cost exceeds 15% of current, ROI becomes questionable
### 4. Define Phasing & Rollout
Phase 1 (Weeks 1-6): Build in parallel, validate on subset of data
- Build new pipeline; ingest 5% of events
- Run validation: results match old pipeline
- Run load tests: 10M events / day handling
Phase 2 (Weeks 7-8): Gradual cutover
- Switch 10% of events to new pipeline
- Monitor for 1 week; validate correctness
- Increase to 50%, then 100%
Phase 3 (Week 9): Decommission old pipeline
- Ensure nothing still queries old pipeline
- Archive old data (compliance hold)
- Deallocate old infrastructure (cost savings)
Rollback plan: If correctness issues found, we can revert 50% traffic to old pipeline within 2 hours.
### 5. Document Constraints & Assumptions
Key Assumptions:
- Development team stays at 4 engineers (no headcount add)
- We can run old + new pipeline simultaneously for 2 weeks
- Dependencies on auth, logging, monitoring APIs stable
Tech Debt Cleared:
- Removes 30% of "why is this slow?" support tickets
- Unblocks: multi-region support, new analytics features
## Key Takeaways
- **Infrastructure is not CapEx bureaucracy.** Treat it like any other investment: clear ROI, success metrics, risk management.
- **Invisible work becomes visible when it breaks.** Doing PRDs for infrastructure prevents surprises.
- **Bad infrastructure slows your competitive velocity.** The startup with good data pipelines ships features 2x faster than one with technical debt.
# Writing PRDs for Platform & Infrastructure: The Invisible Product
## Article Type
**SPOKE Article** — Links back to pillar: /prd-writing-masterclass-ai-era
## Target Word Count
2,500–3,500 words
## Writing Guidance
Cover: how to define internal 'users,' developer experience as a product requirement, platform metrics, and making the business case for work that's invisible to customers. Soft-pitch: PMSynapse helps PMs translate platform investment into business impact narratives.
## Required Structure
### 1. The Hook (Empathy & Pain)
Open with an extremely relatable, specific scenario from PM life that connects to this topic. Use one of the PRD personas (Priya the Junior PM, Marcus the Mid-Level PM, Anika the VP of Product, or Raj the Freelance PM) where appropriate.
### 2. The Trap (Why Standard Advice Fails)
Explain why generic advice or common frameworks don't address the real complexity of this problem. Be specific about what breaks down in practice.
### 3. The Mental Model Shift
Introduce a new framework, perspective, or reframe that changes how the reader thinks about this topic. This should be genuinely insightful, not recycled advice.
### 4. Actionable Steps (3-5)
Provide concrete actions the reader can take tomorrow morning. Each step should be specific enough to execute without further research.
### 5. The Prodinja Angle (Soft-Pitch)
Conclude with how PMSynapse's autonomous PM Shadow capability connects to this topic. Keep it natural — no hard sell.
### 6. Key Takeaways
3-5 bullet points summarizing the article's core insights.
## Internal Linking Requirements
- Link to parent pillar: /blog/prd-writing-masterclass-ai-era
- Link to 3-5 related spoke articles within the same pillar cluster
- Link to at least 1 article from a different pillar cluster for cross-pollination
## SEO Checklist
- [ ] Primary keyword appears in H1, first paragraph, and at least 2 H2s
- [ ] Meta title under 60 characters
- [ ] Meta description under 155 characters and includes primary keyword
- [ ] At least 3 external citations/references
- [ ] All images have descriptive alt text
- [ ] Table or framework visual included