Skip to content
Engage Evolution

Notion Template + Comms Scripts

Marketing Automation Incident Response Kit

Step-by-step playbook for diagnosing SFMC/Braze outages, communicating with stakeholders, and restoring automations without panic.

Last refreshed

Why you need an incident response kit

Marketing automation failures do not announce themselves politely. A Journey Builder stall at 2 AM, an API credential rotation that silently breaks your data feed, or an AI copilot generating off-brand copy at scale — these situations demand a structured response, not a scramble through Slack threads trying to figure out who owns what. The cost of improvising is measured in missed sends, compliance exposure, damaged sender reputation, and eroded executive confidence.

This kit is a single command-center document that your team opens the moment something breaks. It contains severity classifications, decision trees for the most common incident types, pre-written stakeholder communications, pause and resume procedures, deliverability recovery steps, AI copilot incident protocols, post-incident review templates, and a quarterly audit process to keep everything current.

The goal is simple: cut triage time from hours to minutes, keep leadership and frontline teams aligned while you fix the issue, and capture learnings so every incident makes your operation stronger.

Incident classification and severity matrix

Not every issue deserves a war room. Misclassifying severity wastes time on low-impact problems and under-responds to critical ones. Use the matrix below as the single source of truth for how your team triages.

SeverityDefinitionExample IncidentsResponse TimeAuto-Trigger Actions
Sev 1 — CriticalCustomer data loss, regulatory breach, or complete platform outagePII exposure in a send, SFMC platform-wide outage, consent data corruptionImmediate (< 15 min)Page leadership + legal, pause all sends, open incident bridge
Sev 2 — HighCritical journeys failing, > 50% of sends stuck, or revenue-impacting degradationJourney Builder stall affecting welcome series, API failures blocking transactional sends, Braze push queue backed up< 30 minEngage engineering bridge, send customer-facing status update, escalate to vendor support
Sev 3 — MediumDegraded performance, single journey affected, or small audience impactedDelayed data sync for one segment, intermittent API timeouts, AI subject-line latency< 2 hoursOps lead + CRM admins notified, monitor via dashboard, log for next standup
Sev 4 — LowCosmetic issues, minor data delays, or non-customer-facing bugsDashboard rendering errors, internal test sends failing, non-critical automation warningsNext business dayLog in backlog, assign owner, fix during normal sprint

Each incident card in the kit links to the correct severity play with the exact steps: who to page, what logs to pull, which AI guardrails to disable, and how to restore throttles.

Classification decision flow

When an alert fires or someone reports an issue, walk through these questions in order:

  1. Is customer data exposed or at risk of regulatory violation? If yes, Sev 1.
  2. Are revenue-generating sends (transactional, welcome, cart abandonment) blocked or failing? If yes, Sev 2.
  3. Is the issue isolated to a single journey or a non-revenue segment? If yes, Sev 3.
  4. Is the issue cosmetic or internal-only? If yes, Sev 4.

If you are unsure, escalate up one severity level. It is easier to downgrade than to explain why a Sev 2 sat unattended for four hours.

Decision trees for top incidents

The following decision trees cover the twelve most common incident types. Each tree tells you what to check first, what to do if that check fails, and when to escalate. Print these or bookmark them — you will not have time to read documentation from scratch when things break.

API failures (SFMC, Braze, Iterable)

  1. Check platform status page. If vendor confirms outage, skip to stakeholder comms and monitor.
  2. If status page is green, check your API credentials. Rotate or re-authenticate if expired.
  3. Verify IP allowlists and firewall rules have not changed.
  4. Check rate limits. If you are being throttled, reduce batch sizes and implement exponential backoff.
  5. If none of the above, capture error codes and request IDs, then open a vendor support ticket with severity level matching your internal classification.

Data sync delays

  1. Check ETL job status in your orchestration tool (Airflow, dbt, Make, or Zapier).
  2. If the job failed, inspect logs for schema drift, permission errors, or timeout issues.
  3. If the job succeeded but data is stale, check the destination platform’s import queue.
  4. Verify webhook endpoints are accepting payloads (test with a manual POST).
  5. If latency exceeds your SLA by more than 2x, escalate to Sev 2 and notify downstream journey owners.

Journey Builder stalls

  1. Open the journey in SFMC or Braze and check for contacts stuck in wait steps or decision splits.
  2. Review entry source status — is the data extension, API event, or segment still populating?
  3. Check for platform-side throttling or send limits that may be gating progression.
  4. If the journey references an external API call (content lookup, AI-generated copy), verify that endpoint is responding.
  5. If contacts are stuck and cannot be un-stuck, clone the journey, re-inject the stuck population, and document the workaround.

AI copilot misfires (off-brand copy, hallucinated content)

  1. Immediately pause any journey or automation using the AI-generated content.
  2. Pull the specific prompt, model version, and output from your prompt registry logs.
  3. Check guardrail coverage — did the QA layer catch the issue? If not, identify the gap.
  4. Replace the AI-generated content with a pre-approved fallback template.
  5. File a guardrail incident report (template included in the kit) and schedule a prompt review within 48 hours.

Deliverability drops

  1. Check your sender reputation dashboard (Sender Score, Google Postmaster Tools, Microsoft SNDS).
  2. Review recent send volumes for unexpected spikes that may have triggered ISP throttling.
  3. Inspect bounce logs for new block rules, authentication failures (SPF, DKIM, DMARC), or list quality issues.
  4. If reputation damage is confirmed, switch to the deliverability recovery protocol in the next section.
  5. Notify stakeholders that send volumes will be temporarily reduced during the warm-up period.
  1. Treat as Sev 1 immediately. Sending to opted-out contacts is a regulatory and brand risk.
  2. Pause all non-transactional sends.
  3. Audit the suppression sync pipeline end-to-end: source of truth, ETL job, platform-side list.
  4. Identify the exact population that was impacted and prepare a disclosure plan with legal.
  5. After remediation, add a synthetic subscriber test that validates suppression enforcement before every batch send.

Stakeholder communication templates

Clear communication during an incident is as important as the technical fix. Use these templates verbatim or adapt them to your org’s tone. The kit includes ready-to-paste versions for Slack, email, and status page updates.

Initial notification (within 15 minutes of classification)

Subject: [Sev X] Marketing Automation Incident — [Short Description]

Body: We identified an issue affecting [specific system or journey]. Current impact: [describe what is broken and who is affected]. The ops team is actively investigating. Next update in [30/60] minutes. No action required from your team at this time. Incident lead: [Name].

Ongoing update (every 30-60 minutes for Sev 1-2)

Subject: [Sev X] Update — [Short Description]

Body: Status: [Investigating / Identified / Mitigating / Resolved]. What we know: [root cause or current hypothesis]. What we have done: [actions taken]. Next steps: [what happens next]. ETA to resolution: [estimate or “unknown, next update at HH:MM”]. Incident lead: [Name].

Resolution notification

Subject: [Sev X] Resolved — [Short Description]

Body: The issue affecting [system/journey] has been resolved as of [timestamp]. Root cause: [brief explanation]. Impact: [number of contacts affected, sends delayed, etc.]. Immediate remediation: [what was done]. A post-incident review is scheduled for [date]. Full report will be shared within [48/72] hours.

Audience-specific talking points

AudienceWhat They Care AboutKey Message
Executive leadershipRevenue impact, brand risk, timeline”We contained the impact to [X], no revenue loss / estimated [Y] impact. Root cause identified, fix deployed.”
RevOps / SalesLead flow, pipeline data integrity”Lead delivery resumed at [time]. No duplicate or missing records. If you see gaps, flag them by [deadline].”
Compliance / LegalRegulatory exposure, data handling”No PII exposed. Consent records intact. We are documenting the incident for audit trail purposes.”
Customer SupportCustomer-facing impact, talking points”Customers may have experienced [delay/missing email]. Use this script if they contact us: [script].”

Pause and resume procedures

Pausing automations is the single most important containment action, but doing it wrong creates a second incident. Follow these procedures to pause safely and resume without data loss.

Before you pause

  1. Identify which automations must continue running (transactional receipts, password resets, legally required communications).
  2. Document every automation you plan to pause, including the journey name, platform, entry source, and current population in-flight.
  3. Notify downstream teams (Support, Sales) that automated communications will be temporarily halted.

Pause checklist

StepSFMCBrazeIterable
Pause journey entrySet Journey to “Paused” in Journey BuilderStop Campaign via API or DashboardPause Workflow in Workflows tab
Halt triggered sendsPause Triggered Send DefinitionDisable API-triggered CampaignPause Triggered Campaign
Stop data importsPause Automation Studio activitiesDisable Segment SyncPause ETL connector
Disable AI contentRemove dynamic content blocks or switch to static fallbackDisable Liquid personalization referencing AI endpointsReplace Handlebars helpers calling AI APIs
Confirm sends stoppedCheck Tracking > Sends for zero activityCheck Message Activity for zero dispatchesCheck Campaign Analytics for zero sends

Resume checklist

  1. Verify the root cause is resolved and validated in a test environment.
  2. Re-enable data imports first and confirm data is flowing correctly for at least 15 minutes.
  3. Send a test to your synthetic subscriber list (internal seed list) to confirm content, personalization, and suppression are working.
  4. Resume journeys in order of business priority: transactional first, then revenue-critical (welcome, cart abandonment), then engagement-focused.
  5. Monitor send rates, bounce rates, and delivery metrics for the first hour. If anything looks abnormal, pause again and re-investigate.
  6. Send a “systems normal” notification to all stakeholders.

Handling in-flight contacts

Contacts who were mid-journey when you paused need special attention. Depending on the platform:

  • SFMC: Contacts in wait steps will resume automatically when the journey is un-paused. Contacts in decision splits may need to be re-evaluated if underlying data changed during the outage.
  • Braze: Canvas users in delay steps will continue on schedule after resume. Check for users who may have exited due to timeout settings.
  • Iterable: Workflow participants will pick up where they left off. Audit the “waiting” count before and after resume to confirm no one was dropped.

Deliverability recovery protocol

If an incident damaged your sender reputation, do not resume full-volume sends immediately. ISPs have long memories and short fuses. Follow this warm-up protocol.

Assessment

  1. Pull Sender Score, Google Postmaster Tools reputation, and Microsoft SNDS data.
  2. Check for new blocklist entries (Spamhaus, Barracuda, SURBL).
  3. Review bounce categorization: are you seeing blocks (5xx) or deferrals (4xx)?
  4. Identify which sending IPs and domains are affected.

Recovery warm-up schedule

DayVolume (% of normal)AudienceAction if bounce rate > 5%
110%Most engaged 30-day segmentHalt, re-check authentication, contact ISP relations
220%Most engaged 60-day segmentReduce volume by half, investigate specific ISPs blocking
335%Engaged 90-day segmentHold at current level for 24 additional hours
450%Engaged 90-day + re-engaged segmentContinue only if bounce rate < 2%
575%Full active segment minus disengagedMonitor complaint rates closely
6-7100%Full active audienceResume normal cadence, continue monitoring for 2 weeks

Authentication audit

Run this checklist after any deliverability incident:

  • SPF record includes all sending IPs and services.
  • DKIM keys are valid, not expired, and signing correctly (verify with a test send to mail-tester.com or a similar tool).
  • DMARC policy is set to at least p=quarantine and you are receiving aggregate reports.
  • BIMI record is intact (if applicable).
  • Return-Path domain aligns with your From domain.

AI copilot incident procedures

AI-generated content introduces a category of incidents that did not exist three years ago. Copilots writing subject lines, body copy, product recommendations, or dynamic content blocks can produce outputs that are off-brand, factually wrong, or in violation of regulatory requirements. These incidents require their own protocols.

Incident types

TypeDescriptionSeverity Default
Off-brand toneAI output uses language that does not match brand voice guidelinesSev 3
Hallucinated contentAI fabricates product features, pricing, or offers that do not existSev 2
Regulatory violationAI output includes claims that violate FTC, CAN-SPAM, GDPR, or industry-specific rulesSev 1
Prompt injectionExternal input manipulates AI output in unexpected waysSev 1
Model unavailabilityAI API is down, causing content generation to fail or fall back to empty templatesSev 2

Response steps for AI content incidents

  1. Contain: Pause any active send using the AI-generated content. If the content already shipped, assess the blast radius (how many recipients, which segments, which channels).
  2. Capture: Log the exact prompt, model, model version, temperature setting, guardrail configuration, and output. Screenshot or export the full API response.
  3. Classify: Determine severity using the AI incident type table above.
  4. Communicate: Use the stakeholder templates, adding a section that explains the AI component in plain language for non-technical audiences.
  5. Correct: Replace AI content with a pre-approved static fallback. If the send already went out, determine whether a correction or retraction is needed.
  6. Close: Update your prompt registry with the failure case, add it to your guardrail test suite, and schedule a prompt review.

Guardrail checklist

After any AI content incident, verify these guardrails are in place and functioning:

  • Human-in-the-loop approval gate exists before AI copy goes to production sends.
  • Tone and compliance validators run on every AI output before it enters the send queue.
  • Fallback content is configured for every AI-generated block so that model unavailability does not produce blank emails.
  • Prompt versions are tracked in version control with change history and reviewers.
  • AI QA results are logged with timestamps, model IDs, and pass/fail status.

Post-incident review template

Every Sev 1 and Sev 2 incident gets a formal post-incident review within 72 hours. Sev 3 incidents are reviewed in the next team standup. The review is not about blame — it is about building a better system.

Review agenda (60 minutes)

BlockDurationContent
Timeline reconstruction15 minWalk through the incident from first alert to resolution, minute by minute
Root cause analysis15 minIdentify the contributing factors (not just the trigger) using the “5 Whys” method
Impact assessment10 minQuantify: contacts affected, sends delayed/failed, revenue impact, reputation damage
Action items15 minDefine specific remediation tasks with owners, deadlines, and success criteria
Process improvements5 minIdentify changes to runbooks, monitoring, or staffing that would prevent recurrence

Metrics to capture

  • Time to detect (TTD): How long between the incident starting and someone noticing.
  • Time to classify (TTC): How long between detection and correct severity assignment.
  • Time to mitigate (TTM): How long between classification and containment (e.g., pausing sends).
  • Time to resolve (TTR): How long between mitigation and full resolution.
  • Blast radius: Number of contacts, sends, or journeys affected.
  • Revenue impact: Estimated lost or delayed revenue from failed sends.
  • Reputation impact: Change in Sender Score, blocklist additions, or complaint rate.

Post-incident report format

Use this structure for the written report that gets shared with stakeholders:

  1. Summary: One paragraph describing what happened, when, and the business impact.
  2. Timeline: Chronological log of events with timestamps.
  3. Root cause: What broke and why. Include contributing factors.
  4. Impact: Quantified metrics from the list above.
  5. Actions taken: What was done to resolve the incident.
  6. Action items: Table with task, owner, deadline, and status columns.
  7. Lessons learned: What the team will do differently next time.

Tabletop exercises

An incident response kit that nobody has practiced is just a document. Run tabletop exercises quarterly so your team builds muscle memory for real incidents. The kit includes three ready-to-run scenarios with facilitator notes, sample data, and curveballs.

Exercise format (60 minutes)

  1. Setup (5 min): Facilitator distributes the scenario card. Participants do not see it in advance.
  2. Initial response (15 min): Team classifies the incident, assigns roles, and drafts initial communications using the kit.
  3. Curveball (5 min): Facilitator introduces a complication (e.g., the vendor status page is wrong, a second system fails, legal asks for an immediate disclosure).
  4. Continued response (15 min): Team adjusts their plan and drafts updated communications.
  5. Debrief (20 min): Group reviews what went well, what was confusing, and what needs to change in the kit.

Sample scenarios

Scenario A — The silent data feed: Your SFMC data extension that powers the welcome journey has not updated in 6 hours, but no alerts fired because the monitoring only checks for job failures, not data freshness. New signups are entering the journey with stale profile data, receiving incorrect content.

Scenario B — The AI subject line gone wrong: Your AI copilot generated a subject line for a promotional send that contains a pricing claim your company does not actually offer. The send went out to 200,000 contacts two hours ago. Customer support is getting calls.

Scenario C — The cascading API failure: Your Braze-to-Snowflake sync fails, which causes your suppression list to go stale, which causes a re-engagement campaign to send to contacts who unsubscribed yesterday. Legal is asking questions.

Scoring rubric

After each exercise, score the team on these dimensions (1-5 scale):

DimensionWhat “5” Looks Like
Speed of classificationCorrect severity assigned within 5 minutes
Role clarityEveryone knew their job without discussion
Communication qualityStakeholder messages were clear, accurate, and timely
Kit usageTeam used the kit effectively, did not improvise unnecessarily
AdaptabilityTeam handled the curveball without losing composure

Track scores over time. If a dimension consistently scores below 3, that section of the kit needs revision or the team needs targeted training.

Tooling setup

The kit includes ready-to-import configurations for monitoring, alerting, and incident coordination. Set these up before your first real incident, not during one.

Monitoring dashboards

ToolWhat to MonitorAlert Threshold
Datadog / SplunkAPI latency, error rates, queue depthLatency > 2x baseline for 5 min, error rate > 5%, queue depth > 10x normal
Google Postmaster ToolsDomain reputation, spam rate, authenticationReputation drop to “Low” or “Bad”, spam rate > 0.3%
Platform-native (SFMC/Braze)Send volumes, journey progression, bounce ratesSend volume drops > 50% vs. same-day-last-week, bounce rate > 5%
Uptime monitor (Pingdom, UptimeRobot)API endpoints, webhook receivers, AI model endpointsAny downtime or response time > 5 seconds

Alerting and paging

  • Configure PagerDuty, Opsgenie, or your paging tool with escalation policies that match the severity matrix.
  • Sev 1 pages the on-call ops lead immediately and auto-escalates to the director if not acknowledged within 10 minutes.
  • Sev 2 pages the on-call ops lead and posts to the incident Slack channel.
  • Sev 3 and Sev 4 post to Slack only; no paging.

Incident coordination

  • Dedicated Slack channel (e.g., #incident-marketing-ops) that stays quiet until an incident is declared.
  • Zapier or Make recipe that creates an incident record the moment someone triggers the “panic button” — capturing timestamp, reporter, initial description, and severity.
  • Shared Google Doc or Notion page that serves as the live incident log during active response.
  • Video bridge link (Zoom, Google Meet) pre-configured and pinned in the incident channel so you never waste time creating one.

Quarterly audit process

Incident response kits decay. People change roles, platforms update their APIs, new automations get built without corresponding runbook entries, and AI models get swapped out. Run this audit every quarter to keep the kit current.

Audit checklist

ItemCheckOwner
Stakeholder contact infoAll names, phone numbers, Slack handles, and email addresses are currentOps Lead
Escalation pathsSeverity-to-person mapping reflects current org chartOps Lead
Decision treesAll twelve incident types still match your current stack and integrationsSenior CRM Admin
Communication templatesTemplates reference correct product names, status page URLs, and legal contactsContent + Legal
Pause/resume proceduresPlatform-specific steps are accurate for current platform versionsPlatform Admin
AI guardrail inventoryPrompt registry, model versions, and fallback templates are up to dateAI Ops Lead
Monitoring thresholdsAlert thresholds still make sense given current send volumes and architectureData Engineering
Tabletop exercise logAt least one exercise has been run since the last auditOps Lead
Post-incident action itemsAll open action items from previous incidents are resolved or have active ownersOps Lead
Tool accessEveryone on the incident response roster has access to monitoring tools, paging systems, and platform adminOps Lead

Audit output

The audit produces three deliverables:

  1. Updated kit: All sections revised with current information, pushed to the shared Notion template.
  2. Gap report: List of items that were out of date, who updated them, and any systemic issues (e.g., “We keep losing track of AI model version changes — we need a notification hook”).
  3. Next exercise scenario: Based on gaps found in the audit, draft a new tabletop scenario that tests the weakest area.

Scheduling

Block 90 minutes on the calendar at the start of each quarter. Invite the ops lead, senior CRM admin, a compliance representative, and anyone who has been on-call in the past quarter. The audit should feel routine, not burdensome. If it takes more than 90 minutes, that itself is a finding — the kit has grown too complex and needs simplification.

Kit highlights

  • Decision tree covering the top 12 incidents (API failures, data delays, Journey Builder stalls, AI copy misfires, deliverability drops, consent failures, and more).
  • Pre-written stakeholder comms for execs, RevOps, Compliance, and Support with audience-specific talking points.
  • Checklist for pausing and resuming automations with guardrails for unsubscribes and transactional flows.
  • Post-incident review template with metrics, contributing factors, and backlog items.
  • Tabletop exercise scenarios with facilitator guides, curveballs, and scoring rubrics.
  • Quarterly audit checklist so the kit stays current as your stack evolves.

Deployment checklist

  1. Copy the Notion template and pre-fill stakeholder contact info plus escalation paths.
  2. Import the monitoring dashboard configurations into Datadog, Splunk, or your observability tool.
  3. Set up the Zapier or Make incident-capture recipe and test it end-to-end.
  4. Configure paging escalation policies to match the severity matrix.
  5. Run the included tabletop exercise with your team within the first two weeks.
  6. Schedule the quarterly audit on the calendar now — not after the first incident.
  7. Connect the comms scripts to your status page, Slack channels, or paging tools using the provided recipes.
  8. After each real incident, complete the postmortem template and push action items into your ops backlog automatically.

How it works

Drop your info below and we’ll email the download link (along with any follow-up resources) straight to you.

To keep lead magnets exclusive, the link is not available without submitting the form.