Notion Template + Comms Scripts
Marketing Automation Incident Response Kit
Step-by-step playbook for diagnosing SFMC/Braze outages, communicating with stakeholders, and restoring automations without panic.
Last refreshed
Why you need an incident response kit
Marketing automation failures do not announce themselves politely. A Journey Builder stall at 2 AM, an API credential rotation that silently breaks your data feed, or an AI copilot generating off-brand copy at scale — these situations demand a structured response, not a scramble through Slack threads trying to figure out who owns what. The cost of improvising is measured in missed sends, compliance exposure, damaged sender reputation, and eroded executive confidence.
This kit is a single command-center document that your team opens the moment something breaks. It contains severity classifications, decision trees for the most common incident types, pre-written stakeholder communications, pause and resume procedures, deliverability recovery steps, AI copilot incident protocols, post-incident review templates, and a quarterly audit process to keep everything current.
The goal is simple: cut triage time from hours to minutes, keep leadership and frontline teams aligned while you fix the issue, and capture learnings so every incident makes your operation stronger.
Incident classification and severity matrix
Not every issue deserves a war room. Misclassifying severity wastes time on low-impact problems and under-responds to critical ones. Use the matrix below as the single source of truth for how your team triages.
| Severity | Definition | Example Incidents | Response Time | Auto-Trigger Actions |
|---|---|---|---|---|
| Sev 1 — Critical | Customer data loss, regulatory breach, or complete platform outage | PII exposure in a send, SFMC platform-wide outage, consent data corruption | Immediate (< 15 min) | Page leadership + legal, pause all sends, open incident bridge |
| Sev 2 — High | Critical journeys failing, > 50% of sends stuck, or revenue-impacting degradation | Journey Builder stall affecting welcome series, API failures blocking transactional sends, Braze push queue backed up | < 30 min | Engage engineering bridge, send customer-facing status update, escalate to vendor support |
| Sev 3 — Medium | Degraded performance, single journey affected, or small audience impacted | Delayed data sync for one segment, intermittent API timeouts, AI subject-line latency | < 2 hours | Ops lead + CRM admins notified, monitor via dashboard, log for next standup |
| Sev 4 — Low | Cosmetic issues, minor data delays, or non-customer-facing bugs | Dashboard rendering errors, internal test sends failing, non-critical automation warnings | Next business day | Log in backlog, assign owner, fix during normal sprint |
Each incident card in the kit links to the correct severity play with the exact steps: who to page, what logs to pull, which AI guardrails to disable, and how to restore throttles.
Classification decision flow
When an alert fires or someone reports an issue, walk through these questions in order:
- Is customer data exposed or at risk of regulatory violation? If yes, Sev 1.
- Are revenue-generating sends (transactional, welcome, cart abandonment) blocked or failing? If yes, Sev 2.
- Is the issue isolated to a single journey or a non-revenue segment? If yes, Sev 3.
- Is the issue cosmetic or internal-only? If yes, Sev 4.
If you are unsure, escalate up one severity level. It is easier to downgrade than to explain why a Sev 2 sat unattended for four hours.
Decision trees for top incidents
The following decision trees cover the twelve most common incident types. Each tree tells you what to check first, what to do if that check fails, and when to escalate. Print these or bookmark them — you will not have time to read documentation from scratch when things break.
API failures (SFMC, Braze, Iterable)
- Check platform status page. If vendor confirms outage, skip to stakeholder comms and monitor.
- If status page is green, check your API credentials. Rotate or re-authenticate if expired.
- Verify IP allowlists and firewall rules have not changed.
- Check rate limits. If you are being throttled, reduce batch sizes and implement exponential backoff.
- If none of the above, capture error codes and request IDs, then open a vendor support ticket with severity level matching your internal classification.
Data sync delays
- Check ETL job status in your orchestration tool (Airflow, dbt, Make, or Zapier).
- If the job failed, inspect logs for schema drift, permission errors, or timeout issues.
- If the job succeeded but data is stale, check the destination platform’s import queue.
- Verify webhook endpoints are accepting payloads (test with a manual POST).
- If latency exceeds your SLA by more than 2x, escalate to Sev 2 and notify downstream journey owners.
Journey Builder stalls
- Open the journey in SFMC or Braze and check for contacts stuck in wait steps or decision splits.
- Review entry source status — is the data extension, API event, or segment still populating?
- Check for platform-side throttling or send limits that may be gating progression.
- If the journey references an external API call (content lookup, AI-generated copy), verify that endpoint is responding.
- If contacts are stuck and cannot be un-stuck, clone the journey, re-inject the stuck population, and document the workaround.
AI copilot misfires (off-brand copy, hallucinated content)
- Immediately pause any journey or automation using the AI-generated content.
- Pull the specific prompt, model version, and output from your prompt registry logs.
- Check guardrail coverage — did the QA layer catch the issue? If not, identify the gap.
- Replace the AI-generated content with a pre-approved fallback template.
- File a guardrail incident report (template included in the kit) and schedule a prompt review within 48 hours.
Deliverability drops
- Check your sender reputation dashboard (Sender Score, Google Postmaster Tools, Microsoft SNDS).
- Review recent send volumes for unexpected spikes that may have triggered ISP throttling.
- Inspect bounce logs for new block rules, authentication failures (SPF, DKIM, DMARC), or list quality issues.
- If reputation damage is confirmed, switch to the deliverability recovery protocol in the next section.
- Notify stakeholders that send volumes will be temporarily reduced during the warm-up period.
Consent or suppression failures
- Treat as Sev 1 immediately. Sending to opted-out contacts is a regulatory and brand risk.
- Pause all non-transactional sends.
- Audit the suppression sync pipeline end-to-end: source of truth, ETL job, platform-side list.
- Identify the exact population that was impacted and prepare a disclosure plan with legal.
- After remediation, add a synthetic subscriber test that validates suppression enforcement before every batch send.
Stakeholder communication templates
Clear communication during an incident is as important as the technical fix. Use these templates verbatim or adapt them to your org’s tone. The kit includes ready-to-paste versions for Slack, email, and status page updates.
Initial notification (within 15 minutes of classification)
Subject: [Sev X] Marketing Automation Incident — [Short Description]
Body: We identified an issue affecting [specific system or journey]. Current impact: [describe what is broken and who is affected]. The ops team is actively investigating. Next update in [30/60] minutes. No action required from your team at this time. Incident lead: [Name].
Ongoing update (every 30-60 minutes for Sev 1-2)
Subject: [Sev X] Update — [Short Description]
Body: Status: [Investigating / Identified / Mitigating / Resolved]. What we know: [root cause or current hypothesis]. What we have done: [actions taken]. Next steps: [what happens next]. ETA to resolution: [estimate or “unknown, next update at HH:MM”]. Incident lead: [Name].
Resolution notification
Subject: [Sev X] Resolved — [Short Description]
Body: The issue affecting [system/journey] has been resolved as of [timestamp]. Root cause: [brief explanation]. Impact: [number of contacts affected, sends delayed, etc.]. Immediate remediation: [what was done]. A post-incident review is scheduled for [date]. Full report will be shared within [48/72] hours.
Audience-specific talking points
| Audience | What They Care About | Key Message |
|---|---|---|
| Executive leadership | Revenue impact, brand risk, timeline | ”We contained the impact to [X], no revenue loss / estimated [Y] impact. Root cause identified, fix deployed.” |
| RevOps / Sales | Lead flow, pipeline data integrity | ”Lead delivery resumed at [time]. No duplicate or missing records. If you see gaps, flag them by [deadline].” |
| Compliance / Legal | Regulatory exposure, data handling | ”No PII exposed. Consent records intact. We are documenting the incident for audit trail purposes.” |
| Customer Support | Customer-facing impact, talking points | ”Customers may have experienced [delay/missing email]. Use this script if they contact us: [script].” |
Pause and resume procedures
Pausing automations is the single most important containment action, but doing it wrong creates a second incident. Follow these procedures to pause safely and resume without data loss.
Before you pause
- Identify which automations must continue running (transactional receipts, password resets, legally required communications).
- Document every automation you plan to pause, including the journey name, platform, entry source, and current population in-flight.
- Notify downstream teams (Support, Sales) that automated communications will be temporarily halted.
Pause checklist
| Step | SFMC | Braze | Iterable |
|---|---|---|---|
| Pause journey entry | Set Journey to “Paused” in Journey Builder | Stop Campaign via API or Dashboard | Pause Workflow in Workflows tab |
| Halt triggered sends | Pause Triggered Send Definition | Disable API-triggered Campaign | Pause Triggered Campaign |
| Stop data imports | Pause Automation Studio activities | Disable Segment Sync | Pause ETL connector |
| Disable AI content | Remove dynamic content blocks or switch to static fallback | Disable Liquid personalization referencing AI endpoints | Replace Handlebars helpers calling AI APIs |
| Confirm sends stopped | Check Tracking > Sends for zero activity | Check Message Activity for zero dispatches | Check Campaign Analytics for zero sends |
Resume checklist
- Verify the root cause is resolved and validated in a test environment.
- Re-enable data imports first and confirm data is flowing correctly for at least 15 minutes.
- Send a test to your synthetic subscriber list (internal seed list) to confirm content, personalization, and suppression are working.
- Resume journeys in order of business priority: transactional first, then revenue-critical (welcome, cart abandonment), then engagement-focused.
- Monitor send rates, bounce rates, and delivery metrics for the first hour. If anything looks abnormal, pause again and re-investigate.
- Send a “systems normal” notification to all stakeholders.
Handling in-flight contacts
Contacts who were mid-journey when you paused need special attention. Depending on the platform:
- SFMC: Contacts in wait steps will resume automatically when the journey is un-paused. Contacts in decision splits may need to be re-evaluated if underlying data changed during the outage.
- Braze: Canvas users in delay steps will continue on schedule after resume. Check for users who may have exited due to timeout settings.
- Iterable: Workflow participants will pick up where they left off. Audit the “waiting” count before and after resume to confirm no one was dropped.
Deliverability recovery protocol
If an incident damaged your sender reputation, do not resume full-volume sends immediately. ISPs have long memories and short fuses. Follow this warm-up protocol.
Assessment
- Pull Sender Score, Google Postmaster Tools reputation, and Microsoft SNDS data.
- Check for new blocklist entries (Spamhaus, Barracuda, SURBL).
- Review bounce categorization: are you seeing blocks (5xx) or deferrals (4xx)?
- Identify which sending IPs and domains are affected.
Recovery warm-up schedule
| Day | Volume (% of normal) | Audience | Action if bounce rate > 5% |
|---|---|---|---|
| 1 | 10% | Most engaged 30-day segment | Halt, re-check authentication, contact ISP relations |
| 2 | 20% | Most engaged 60-day segment | Reduce volume by half, investigate specific ISPs blocking |
| 3 | 35% | Engaged 90-day segment | Hold at current level for 24 additional hours |
| 4 | 50% | Engaged 90-day + re-engaged segment | Continue only if bounce rate < 2% |
| 5 | 75% | Full active segment minus disengaged | Monitor complaint rates closely |
| 6-7 | 100% | Full active audience | Resume normal cadence, continue monitoring for 2 weeks |
Authentication audit
Run this checklist after any deliverability incident:
- SPF record includes all sending IPs and services.
- DKIM keys are valid, not expired, and signing correctly (verify with a test send to mail-tester.com or a similar tool).
- DMARC policy is set to at least
p=quarantineand you are receiving aggregate reports. - BIMI record is intact (if applicable).
- Return-Path domain aligns with your From domain.
AI copilot incident procedures
AI-generated content introduces a category of incidents that did not exist three years ago. Copilots writing subject lines, body copy, product recommendations, or dynamic content blocks can produce outputs that are off-brand, factually wrong, or in violation of regulatory requirements. These incidents require their own protocols.
Incident types
| Type | Description | Severity Default |
|---|---|---|
| Off-brand tone | AI output uses language that does not match brand voice guidelines | Sev 3 |
| Hallucinated content | AI fabricates product features, pricing, or offers that do not exist | Sev 2 |
| Regulatory violation | AI output includes claims that violate FTC, CAN-SPAM, GDPR, or industry-specific rules | Sev 1 |
| Prompt injection | External input manipulates AI output in unexpected ways | Sev 1 |
| Model unavailability | AI API is down, causing content generation to fail or fall back to empty templates | Sev 2 |
Response steps for AI content incidents
- Contain: Pause any active send using the AI-generated content. If the content already shipped, assess the blast radius (how many recipients, which segments, which channels).
- Capture: Log the exact prompt, model, model version, temperature setting, guardrail configuration, and output. Screenshot or export the full API response.
- Classify: Determine severity using the AI incident type table above.
- Communicate: Use the stakeholder templates, adding a section that explains the AI component in plain language for non-technical audiences.
- Correct: Replace AI content with a pre-approved static fallback. If the send already went out, determine whether a correction or retraction is needed.
- Close: Update your prompt registry with the failure case, add it to your guardrail test suite, and schedule a prompt review.
Guardrail checklist
After any AI content incident, verify these guardrails are in place and functioning:
- Human-in-the-loop approval gate exists before AI copy goes to production sends.
- Tone and compliance validators run on every AI output before it enters the send queue.
- Fallback content is configured for every AI-generated block so that model unavailability does not produce blank emails.
- Prompt versions are tracked in version control with change history and reviewers.
- AI QA results are logged with timestamps, model IDs, and pass/fail status.
Post-incident review template
Every Sev 1 and Sev 2 incident gets a formal post-incident review within 72 hours. Sev 3 incidents are reviewed in the next team standup. The review is not about blame — it is about building a better system.
Review agenda (60 minutes)
| Block | Duration | Content |
|---|---|---|
| Timeline reconstruction | 15 min | Walk through the incident from first alert to resolution, minute by minute |
| Root cause analysis | 15 min | Identify the contributing factors (not just the trigger) using the “5 Whys” method |
| Impact assessment | 10 min | Quantify: contacts affected, sends delayed/failed, revenue impact, reputation damage |
| Action items | 15 min | Define specific remediation tasks with owners, deadlines, and success criteria |
| Process improvements | 5 min | Identify changes to runbooks, monitoring, or staffing that would prevent recurrence |
Metrics to capture
- Time to detect (TTD): How long between the incident starting and someone noticing.
- Time to classify (TTC): How long between detection and correct severity assignment.
- Time to mitigate (TTM): How long between classification and containment (e.g., pausing sends).
- Time to resolve (TTR): How long between mitigation and full resolution.
- Blast radius: Number of contacts, sends, or journeys affected.
- Revenue impact: Estimated lost or delayed revenue from failed sends.
- Reputation impact: Change in Sender Score, blocklist additions, or complaint rate.
Post-incident report format
Use this structure for the written report that gets shared with stakeholders:
- Summary: One paragraph describing what happened, when, and the business impact.
- Timeline: Chronological log of events with timestamps.
- Root cause: What broke and why. Include contributing factors.
- Impact: Quantified metrics from the list above.
- Actions taken: What was done to resolve the incident.
- Action items: Table with task, owner, deadline, and status columns.
- Lessons learned: What the team will do differently next time.
Tabletop exercises
An incident response kit that nobody has practiced is just a document. Run tabletop exercises quarterly so your team builds muscle memory for real incidents. The kit includes three ready-to-run scenarios with facilitator notes, sample data, and curveballs.
Exercise format (60 minutes)
- Setup (5 min): Facilitator distributes the scenario card. Participants do not see it in advance.
- Initial response (15 min): Team classifies the incident, assigns roles, and drafts initial communications using the kit.
- Curveball (5 min): Facilitator introduces a complication (e.g., the vendor status page is wrong, a second system fails, legal asks for an immediate disclosure).
- Continued response (15 min): Team adjusts their plan and drafts updated communications.
- Debrief (20 min): Group reviews what went well, what was confusing, and what needs to change in the kit.
Sample scenarios
Scenario A — The silent data feed: Your SFMC data extension that powers the welcome journey has not updated in 6 hours, but no alerts fired because the monitoring only checks for job failures, not data freshness. New signups are entering the journey with stale profile data, receiving incorrect content.
Scenario B — The AI subject line gone wrong: Your AI copilot generated a subject line for a promotional send that contains a pricing claim your company does not actually offer. The send went out to 200,000 contacts two hours ago. Customer support is getting calls.
Scenario C — The cascading API failure: Your Braze-to-Snowflake sync fails, which causes your suppression list to go stale, which causes a re-engagement campaign to send to contacts who unsubscribed yesterday. Legal is asking questions.
Scoring rubric
After each exercise, score the team on these dimensions (1-5 scale):
| Dimension | What “5” Looks Like |
|---|---|
| Speed of classification | Correct severity assigned within 5 minutes |
| Role clarity | Everyone knew their job without discussion |
| Communication quality | Stakeholder messages were clear, accurate, and timely |
| Kit usage | Team used the kit effectively, did not improvise unnecessarily |
| Adaptability | Team handled the curveball without losing composure |
Track scores over time. If a dimension consistently scores below 3, that section of the kit needs revision or the team needs targeted training.
Tooling setup
The kit includes ready-to-import configurations for monitoring, alerting, and incident coordination. Set these up before your first real incident, not during one.
Monitoring dashboards
| Tool | What to Monitor | Alert Threshold |
|---|---|---|
| Datadog / Splunk | API latency, error rates, queue depth | Latency > 2x baseline for 5 min, error rate > 5%, queue depth > 10x normal |
| Google Postmaster Tools | Domain reputation, spam rate, authentication | Reputation drop to “Low” or “Bad”, spam rate > 0.3% |
| Platform-native (SFMC/Braze) | Send volumes, journey progression, bounce rates | Send volume drops > 50% vs. same-day-last-week, bounce rate > 5% |
| Uptime monitor (Pingdom, UptimeRobot) | API endpoints, webhook receivers, AI model endpoints | Any downtime or response time > 5 seconds |
Alerting and paging
- Configure PagerDuty, Opsgenie, or your paging tool with escalation policies that match the severity matrix.
- Sev 1 pages the on-call ops lead immediately and auto-escalates to the director if not acknowledged within 10 minutes.
- Sev 2 pages the on-call ops lead and posts to the incident Slack channel.
- Sev 3 and Sev 4 post to Slack only; no paging.
Incident coordination
- Dedicated Slack channel (e.g.,
#incident-marketing-ops) that stays quiet until an incident is declared. - Zapier or Make recipe that creates an incident record the moment someone triggers the “panic button” — capturing timestamp, reporter, initial description, and severity.
- Shared Google Doc or Notion page that serves as the live incident log during active response.
- Video bridge link (Zoom, Google Meet) pre-configured and pinned in the incident channel so you never waste time creating one.
Quarterly audit process
Incident response kits decay. People change roles, platforms update their APIs, new automations get built without corresponding runbook entries, and AI models get swapped out. Run this audit every quarter to keep the kit current.
Audit checklist
| Item | Check | Owner |
|---|---|---|
| Stakeholder contact info | All names, phone numbers, Slack handles, and email addresses are current | Ops Lead |
| Escalation paths | Severity-to-person mapping reflects current org chart | Ops Lead |
| Decision trees | All twelve incident types still match your current stack and integrations | Senior CRM Admin |
| Communication templates | Templates reference correct product names, status page URLs, and legal contacts | Content + Legal |
| Pause/resume procedures | Platform-specific steps are accurate for current platform versions | Platform Admin |
| AI guardrail inventory | Prompt registry, model versions, and fallback templates are up to date | AI Ops Lead |
| Monitoring thresholds | Alert thresholds still make sense given current send volumes and architecture | Data Engineering |
| Tabletop exercise log | At least one exercise has been run since the last audit | Ops Lead |
| Post-incident action items | All open action items from previous incidents are resolved or have active owners | Ops Lead |
| Tool access | Everyone on the incident response roster has access to monitoring tools, paging systems, and platform admin | Ops Lead |
Audit output
The audit produces three deliverables:
- Updated kit: All sections revised with current information, pushed to the shared Notion template.
- Gap report: List of items that were out of date, who updated them, and any systemic issues (e.g., “We keep losing track of AI model version changes — we need a notification hook”).
- Next exercise scenario: Based on gaps found in the audit, draft a new tabletop scenario that tests the weakest area.
Scheduling
Block 90 minutes on the calendar at the start of each quarter. Invite the ops lead, senior CRM admin, a compliance representative, and anyone who has been on-call in the past quarter. The audit should feel routine, not burdensome. If it takes more than 90 minutes, that itself is a finding — the kit has grown too complex and needs simplification.
Kit highlights
- Decision tree covering the top 12 incidents (API failures, data delays, Journey Builder stalls, AI copy misfires, deliverability drops, consent failures, and more).
- Pre-written stakeholder comms for execs, RevOps, Compliance, and Support with audience-specific talking points.
- Checklist for pausing and resuming automations with guardrails for unsubscribes and transactional flows.
- Post-incident review template with metrics, contributing factors, and backlog items.
- Tabletop exercise scenarios with facilitator guides, curveballs, and scoring rubrics.
- Quarterly audit checklist so the kit stays current as your stack evolves.
Deployment checklist
- Copy the Notion template and pre-fill stakeholder contact info plus escalation paths.
- Import the monitoring dashboard configurations into Datadog, Splunk, or your observability tool.
- Set up the Zapier or Make incident-capture recipe and test it end-to-end.
- Configure paging escalation policies to match the severity matrix.
- Run the included tabletop exercise with your team within the first two weeks.
- Schedule the quarterly audit on the calendar now — not after the first incident.
- Connect the comms scripts to your status page, Slack channels, or paging tools using the provided recipes.
- After each real incident, complete the postmortem template and push action items into your ops backlog automatically.
How it works
Drop your info below and we’ll email the download link (along with any follow-up resources) straight to you.
To keep lead magnets exclusive, the link is not available without submitting the form.