Pulsebook

Development practices, workflows, and technical architecture guide for Equevu

Incident Report Template

Purpose

This template ensures consistent documentation of incidents, facilitates learning from failures, and helps prevent recurrence. All P1 and P2 incidents require a report within 48 hours of resolution.

🚨 Incident Report Template

1. Incident Summary

Date & Time (Start – End):

Reported By:

Incident ID / Ticket #:

Severity Level: (P1, P2, etc.)

Status: (Resolved / Ongoing)

2. Impact

Affected Systems/Services:

Users Impacted: (e.g. “20% of customers could not log in”)

Business Impact: (e.g. downtime, data loss, revenue loss, SLA breach)

3. Incident Timeline

Time (UTC)	Event / Action	Responsible
10:05	Monitoring alert triggered	System
10:07	Engineer paged	On-call
10:15	Database restarted	Ops Team
10:45	Service restored	Ops Team

4. Root Cause

Technical Cause: (e.g. “Redis memory exhaustion due to unbounded queue growth”)

Contributing Factors: (e.g. “Alert thresholds were too high, missing early warnings”)

5. Resolution

Actions Taken: (step-by-step how it was resolved)

Verification: (tests/checks proving system is stable again)

6. Follow-Up / Preventive Actions

Short-Term Fixes: (e.g. increase Redis memory limit, restart policy)

Long-Term Fixes: (e.g. implement backpressure, improve monitoring)

Owner + Deadline:

Action Item	Owner	Deadline	Status

7. Lessons Learned

What went well:

What could be improved:

Severity Levels

P1 - Critical

Complete service outage
Payment processing down
Data breach or security incident
50% users affected
Response Time: Immediate
Report Due: 24 hours

P2 - High

Partial service degradation
< 50% users affected
Key features unavailable
Performance severely degraded
Response Time: < 30 minutes
Report Due: 48 hours

P3 - Medium

Minor feature issues
Workaround available
< 10% users affected
Response Time: < 4 hours
Report Due: 1 week

P4 - Low

Cosmetic issues
Documentation errors
No user impact
Response Time: Next business day
Report Due: Optional

Best Practices

During the Incident

Communicate early and often - Even if you don’t have all the answers
Document everything - Times, actions, decisions
Focus on resolution first - Analysis comes later
Escalate when needed - Don’t hesitate to wake people up for P1s

Writing the Report

Be blameless - Focus on systems and processes, not individuals
Be specific - Include exact error messages, metrics, and timestamps
Be honest - Document what actually happened, not what should have happened
Be actionable - Every lesson learned should have a corresponding action

Follow-Up

Share widely - Send to all engineering teams
Track actions - Ensure preventive measures are implemented
Review quarterly - Look for patterns across incidents
Update runbooks - Incorporate learnings into documentation

Distribution

Who Gets the Report

Engineering team
Product team
CEO (P1 incidents)
Operations Manager (P1 and P2 incidents)
Customer Support (if customer-facing)

Where to Store

Shared drive: /incidents/YYYY/MM/
Teams: Post summary in Incidents channel
JIRA: Link to incident ticket

Example Report

1. Incident Summary

Date & Time: 2024-01-15 10:05 UTC – 10:45 UTC

Reported By: Monitoring System / John Doe

Incident ID: INC-2024-001

Severity Level: P1

Status: Resolved

2. Impact

Affected Systems/Services: Payment API, Transaction Processing

Users Impacted: 100% of users unable to complete payments

Business Impact: $15,000 in lost transactions, 40 minutes downtime

3. Incident Timeline

Time (UTC)	Event / Action	Responsible
10:05	Payment API alerts - response time > 10s	Monitoring
10:07	On-call engineer paged	PagerDuty
10:10	Database connection pool exhausted identified	John Doe
10:15	Database connections increased from 100 to 500	John Doe
10:20	Partial recovery, 50% success rate	System
10:30	Application servers restarted	Jane Smith
10:45	Full service restored, monitoring normal	System

4. Root Cause

Technical Cause: Database connection pool exhausted due to connection leak in new payment validation code deployed at 09:00 UTC

Contributing Factors:

Load testing didn’t simulate production payment patterns
Connection pool metrics not included in deployment checklist
Gradual rollout not implemented for payment service

5. Resolution

Actions Taken:

Increased connection pool size as immediate mitigation
Identified connection leak in payment validation module
Rolled back to previous version
Verified all connections properly closed
Re-deployed fixed version with connection management

Verification:

Connection pool usage stable at 40%
Payment success rate at 99.9%
Response times < 200ms

6. Follow-Up / Preventive Actions

Short-Term Fixes:

Add connection pool monitoring to dashboard
Implement connection timeout (48 hours)

Long-Term Fixes:

Implement connection pooling best practices guide
Add connection leak detection to CI/CD pipeline
Implement canary deployments for payment service

Action Item	Owner	Deadline	Status
Add pool monitoring	John Doe	2024-01-20	In Progress
Connection timeout	Jane Smith	2024-01-22	Pending
Canary deployment	Tech Lead	2024-02-01	Planned

7. Lessons Learned

What went well:

Monitoring detected issue quickly
Team responded within SLA
Rollback procedure worked smoothly

What could be improved:

Need better load testing for payment flows
Connection pool metrics should be part of standard monitoring
Gradual rollout would have limited impact