Database Runbooks Every SRE Team Should Maintain

A practical checklist of the database runbooks every SRE team should keep current for failover, storage, replication, locking, and restore validation.

A good database runbook is not a document you write once for compliance and forget. It is a working guide that helps an on-call engineer make safe decisions under pressure, especially when time, context, and staffing are limited. This checklist-based guide covers the database runbooks every SRE team should keep current, with practical guidance for failover, disk growth, replication lag, lock contention, and restore validation. The goal is simple: give you a reusable database operations checklist that shortens incident response, reduces avoidable mistakes, and stays useful as your architecture changes.

Overview

This article gives you a practical database runbook framework rather than a vendor-specific playbook. Whether you run PostgreSQL, MySQL, a managed service, or a mixed estate, the same operational truth applies: incidents become expensive when teams must reconstruct procedures from memory.

A strong database runbook should answer five questions quickly:

What problem are we trying to confirm? Symptoms, thresholds, and known false positives.
Who is allowed to act? Role, escalation path, and approval requirements.
What is the safest first action? Immediate containment steps before risky changes.
How do we know if the action worked? Clear success and rollback criteria.
What should be recorded? Timestamps, commands, incident notes, and follow-up tasks.

For SRE teams, a useful database incident runbook usually has the same structure:

Purpose and scope so engineers know when to use it.
Prerequisites such as access, credentials, dashboards, and maintenance windows.
Detection signals from alerts, logs, metrics, and user-facing symptoms.
Triage steps to distinguish root cause from downstream noise.
Execution steps with ordered actions and decision points.
Verification to confirm service health, data integrity, and client recovery.
Rollback or fallback path if the primary action fails.
Post-incident notes to capture fixes for future updates.

If your current documentation is long but rarely used, it may be missing the one trait runbooks need most: speed. During an incident, engineers need exact commands, links, thresholds, owners, and stop conditions. They do not need a textbook.

It also helps to separate runbooks into two categories:

Stabilization runbooks for immediate action during an incident.
Validation and recovery runbooks for proving the system is healthy again.

That split matters because many teams document failover but do not document how to verify that applications, replicas, backups, and write paths are actually back to normal.

Checklist by scenario

Below is a database reliability playbook for the scenarios most teams should document and rehearse. Each runbook should be short enough to use during an incident and specific enough to avoid improvisation.

1. Failover runbook

This is the runbook many teams think they have covered, but often it only describes the promotion step and skips the surrounding risk checks.

Include in the runbook:

Conditions that justify failover, such as primary unavailability, sustained write errors, storage failure, or node-level isolation.
How to distinguish database failure from network, DNS, proxy, or application pool issues.
Expected RPO and RTO assumptions for this service.
The authoritative source for current primary and replica topology.
Step-by-step promotion or managed failover procedure.
Application connection steps after failover, including connection pool resets and DNS expectations.
How to prevent accidental split-brain or writes to an old primary.
Post-failover verification: replication status, write acceptance, read latency, error rates, and backlog recovery.

What to record: time of failover decision, data loss window if any, instance IDs, commands executed, and any clients that required manual intervention.

If you rely on a managed platform, align this runbook with your provider's documented behavior and your own operational assumptions. The details often vary more than teams expect. For planning around managed environments, it helps to review Database-as-a-Service SLAs Compared: Backups, HA, RPO, and RTO Explained and Managed MySQL Services Compared: Replication, Backups, and Performance Limits.

2. Disk growth and storage exhaustion runbook

Storage incidents are often gradual until they become urgent. A practical database runbook should help an engineer decide whether the right response is cleanup, scale-up, throttling, or emergency traffic control.

Include in the runbook:

Where to check disk consumption by data files, indexes, WAL or binlogs, temp files, backups, and logs.
Thresholds for warning, urgent action, and emergency action.
Known causes of rapid growth: runaway logs, long transactions, unvacuumed tables, failed retention jobs, large migrations, replication backlog, or snapshot accumulation.
Safe immediate actions versus risky actions. For example, deleting files at the OS layer may be unsafe depending on the engine and file type.
Whether storage can be expanded online and what secondary effects to expect.
How to reduce write pressure or defer noncritical jobs.
Communication steps if customer-facing degradation is likely.

Useful verification checks: free space trend stabilizes, checkpoint or flush behavior normalizes, replication resumes, and latency returns toward baseline.

This runbook should link directly to your capacity dashboards. If the incident exposed poor forecasting, update your observability and planning process. For deeper instrumentation ideas, see Best Database Observability Tools for Query Performance and Capacity Planning.

3. Replication lag runbook

Replication lag is easy to detect and easy to mishandle. The wrong first action can make data divergence or application inconsistency worse.

Include in the runbook:

How lag is measured in your environment: time-based delay, queue size, byte backlog, or transaction apply delay.
Whether read traffic is served from replicas and what stale-read tolerance exists.
The top known causes in your system: heavy writes, long-running queries, lock waits, schema changes, network issues, underprovisioned replicas, or replica-specific maintenance.
Safe triage order: confirm source write rate, replica health, I/O saturation, apply worker state, lock contention, and long transactions.
Temporary mitigations such as rerouting reads, reducing nonessential jobs, or pausing expensive backfills.
Criteria for rebuilding a replica versus waiting for catch-up.

Critical note: the runbook should state clearly when engineers must stop using replicas for latency-sensitive or read-after-write-sensitive requests.

It is also useful to document any dependencies on proxies and poolers, since these can hide or amplify replica problems. Related reading: Best Database Connection Poolers and Proxies for Cloud Applications.

4. Lock contention and blocking runbook

Lock incidents often look like a general outage at first: rising latency, timeout spikes, and backed-up workers. A good runbook helps the on-call engineer move from symptoms to specific blockers quickly.

Include in the runbook:

How to identify blocking sessions, waiting queries, and long-lived transactions.
Dashboards or queries for active sessions, wait events, deadlocks, and transaction age.
Known risky application patterns: large transactions, migration jobs during peak traffic, missing indexes, broad updates, and retry storms.
Decision criteria for killing a session, terminating a query, throttling clients, or pausing a deployment.
Special caution for maintenance commands that may escalate locks.
How to confirm the blocker is gone and backlog is draining.

What this runbook should not assume: that killing the longest-running query is always safe. In many systems, the real question is which transaction is holding the lock that matters, and what client behavior will happen after termination.

If schema changes frequently trigger lock issues, connect this runbook to your change management process and schema auditing workflow. See Best Tools for Database Schema Drift Detection and Change Auditing and GitOps for Databases: What You Can Safely Automate and What Still Needs Guardrails.

5. Restore validation runbook

Backups are not the same as recovery. Every SRE database checklist should include a restore validation runbook because many teams discover gaps only when they urgently need data back.

Include in the runbook:

What kinds of restores you support: full restore, point-in-time recovery, table-level recovery, or logical export/import.
Where restore artifacts come from and how integrity is checked.
How to provision an isolated restore target safely.
How to validate restored data beyond successful job completion: row counts, schema version, application login, critical query checks, and sample business transactions.
How to compare restore time against expected recovery objectives.
How to document missing data windows, if applicable.
How to decommission temporary restore environments securely.

Best practice: make this runbook usable for planned drills, not just emergencies. A restore validation procedure that only exists in theory is a major reliability gap.

For backup assumptions and vendor limitations, keep a reference to Database Backup Tools and Managed Snapshots: What to Check Before You Rely on Them.

6. Access and credential emergency runbook

Not every database incident is a performance failure. Sometimes the issue is operational access: expired credentials, failed secret rotation, missing break-glass access, or a proxy that cannot authenticate.

Include in the runbook:

How to verify whether the issue is credential-related versus network or permission-related.
Break-glass access procedures and approval requirements.
How to rotate or reissue credentials safely.
Which applications or jobs depend on the affected secret.
How to invalidate old credentials and confirm recovery.

This is one of the most overlooked runbooks because access problems are often treated as one-off admin tasks. In reality, they can block incident response itself. Related reading: Secrets Management for Databases: Vault, Cloud-Native Options, and Rotation Tradeoffs.

7. Performance regression triage runbook

Even if it is not one of your formal emergency scenarios, keep a lightweight runbook for sudden latency or throughput degradation.

Include in the runbook:

Baseline metrics for query latency, CPU, memory, I/O, cache hit rate, and connection count.
How to compare current query patterns against known baselines.
Recent changes to check first: deployments, schema changes, background jobs, failovers, parameter changes, and traffic shape shifts.
Whether to scale vertically, add replicas, tune pool limits, or revert a release.

When this runbook identifies a need for repeatable benchmarking, teams should maintain current load test procedures as well. See Database Load Testing Tools: How to Benchmark Throughput, Latency, and Connection Limits.

What to double-check

This section gives you the cross-cutting checks that belong in almost every database operations checklist, regardless of the scenario.

Access works before the incident. Dashboards, shells, admin consoles, bastions, and break-glass credentials should be tested routinely.
Ownership is explicit. The runbook should name the responder role, escalation target, and required approver for risky actions.
Commands match current tooling. Old hostnames, deprecated CLIs, and renamed dashboards are common runbook failures.
Alert links are embedded. Engineers should not have to search for the right dashboard in the middle of an incident.
Success criteria are concrete. “Looks healthy” is not enough. Use measurable checks such as replica catch-up, error rate drop, or write confirmation.
Rollback steps exist. If a promotion, parameter change, or traffic shift fails, the runbook should say what to do next.
Application behavior is included. Database incidents often persist because connection pools, caches, or retry loops are not reset or tuned after the database recovers.
Compliance and logging needs are not forgotten. Some actions require audit trails, especially around privileged access or restored data.

One practical improvement is to put a short “stop and confirm” block before any destructive action. For example:

Confirm the target instance and environment.
Confirm whether customer traffic is still hitting the old primary.
Confirm the last known good backup or restore point.
Confirm whether a safer mitigation exists.

That small pause prevents many avoidable production mistakes.

Common mistakes

Most database incident runbooks fail in familiar ways. If you want your runbooks to be used, avoid these patterns.

Documenting only the happy path. Real incidents involve partial failure, missing access, or conflicting symptoms. Include branches for ambiguity.
Assuming one engine or one topology forever. Managed services, proxies, and replica layouts change. Your runbook should say what assumptions it depends on.
Leaving out verification. A failover is not complete when a button is clicked. It is complete when clients can read and write safely and follow-on systems recover.
Writing for experts only. Runbooks should help a tired secondary responder, not just the engineer who designed the system.
Ignoring application-side recovery. Pools, retries, leader election, and stale DNS can keep the outage going after the database is healthy.
Keeping runbooks outside the incident workflow. If they are buried in an old wiki or private folder, they will not be used.
Never testing restore procedures. Backup confidence without restore evidence is not reliability.
Failing to update after incidents. The freshest lesson is usually the most valuable runbook improvement.

Another common mistake is making runbooks too long. If a page takes ten minutes to scan, engineers will skip it. Put deep reference material below the fold, but keep the operational steps near the top.

When to revisit

Your database runbook should be treated as a living asset. The practical rule is simple: update it whenever the environment, tooling, or recovery assumptions change.

At minimum, revisit each runbook:

Before seasonal planning cycles when capacity, traffic patterns, and staffing plans are being reviewed.
When workflows or tools change such as new failover automation, observability platforms, backup systems, connection poolers, or secret rotation flows.
After every notable incident especially when responders had to improvise or bypass the documented path.
After topology changes such as moving to managed databases, adding regions, or changing replica strategy.
After major schema or application shifts that affect lock behavior, write volume, or restore complexity.
On a fixed review cadence such as quarterly for critical services and semiannually for lower-risk systems.

A simple maintenance loop works well:

Pick the top five database runbooks your team relies on.
Assign one owner and one reviewer for each.
Run a short tabletop or drill against one scenario each quarter.
Update commands, thresholds, links, and screenshots immediately after the exercise.
Track the last-reviewed date in the document header.

If you want a practical starting point, create these five first: failover, disk growth, replication lag, lock contention, and restore validation. Those cover a large share of the incidents that push teams into stressful, high-risk decisions. From there, add access emergencies, schema-change rollback, and performance regression triage.

The best database runbook is the one your team can trust at 3 a.m. Build it around observable signals, safe first actions, exact verification steps, and regular review. If your systems change, your runbooks should change with them.

Database Runbooks Every SRE Team Should Maintain

Overview

Checklist by scenario

1. Failover runbook

2. Disk growth and storage exhaustion runbook

3. Replication lag runbook

4. Lock contention and blocking runbook

5. Restore validation runbook

6. Access and credential emergency runbook

7. Performance regression triage runbook

What to double-check

Common mistakes

When to revisit

Related Topics

Datastore.cloud Editorial

Up Next

Database Access Governance: Tools for Temporary Access, Approval Flows, and Audit Logs

Multi-Region Database Patterns: Read Replicas, Active-Active, and Conflict Handling

Kubernetes Storage Classes for Stateful Databases: Performance and Risk Tradeoffs