Marketing Automation Incident Response Kit

Step-by-step playbook for diagnosing SFMC/Braze outages, communicating with stakeholders, and restoring automations without panic.

Why you need an incident response kit

Marketing automation failures do not announce themselves politely. A Journey Builder stall at 2 AM, an API credential rotation that silently breaks your data feed, or an AI copilot generating off-brand copy at scale — these situations demand a structured response, not a scramble through Slack threads trying to figure out who owns what. The cost of improvising is measured in missed sends, compliance exposure, damaged sender reputation, and eroded executive confidence.

This kit is a single command-center document that your team opens the moment something breaks. It contains severity classifications, decision trees for the most common incident types, pre-written stakeholder communications, pause and resume procedures, deliverability recovery steps, AI copilot incident protocols, post-incident review templates, and a quarterly audit process to keep everything current.

The goal is simple: cut triage time from hours to minutes, keep leadership and frontline teams aligned while you fix the issue, and capture learnings so every incident makes your operation stronger.

Incident classification and severity matrix

Not every issue deserves a war room. Misclassifying severity wastes time on low-impact problems and under-responds to critical ones. Use the matrix below as the single source of truth for how your team triages.

Severity	Definition	Example Incidents	Response Time	Auto-Trigger Actions
Sev 1 — Critical	Customer data loss, regulatory breach, or complete platform outage	PII exposure in a send, SFMC platform-wide outage, consent data corruption	Immediate (< 15 min)	Page leadership + legal, pause all sends, open incident bridge
Sev 2 — High	Critical journeys failing, > 50% of sends stuck, or revenue-impacting degradation	Journey Builder stall affecting welcome series, API failures blocking transactional sends, Braze push queue backed up	< 30 min	Engage engineering bridge, send customer-facing status update, escalate to vendor support
Sev 3 — Medium	Degraded performance, single journey affected, or small audience impacted	Delayed data sync for one segment, intermittent API timeouts, AI subject-line latency	< 2 hours	Ops lead + CRM admins notified, monitor via dashboard, log for next standup
Sev 4 — Low	Cosmetic issues, minor data delays, or non-customer-facing bugs	Dashboard rendering errors, internal test sends failing, non-critical automation warnings	Next business day	Log in backlog, assign owner, fix during normal sprint

Each incident card in the kit links to the correct severity play with the exact steps: who to page, what logs to pull, which AI guardrails to disable, and how to restore throttles.

Classification decision flow

When an alert fires or someone reports an issue, walk through these questions in order:

Is customer data exposed or at risk of regulatory violation? If yes, Sev 1.
Are revenue-generating sends (transactional, welcome, cart abandonment) blocked or failing? If yes, Sev 2.
Is the issue isolated to a single journey or a non-revenue segment? If yes, Sev 3.
Is the issue cosmetic or internal-only? If yes, Sev 4.

If you are unsure, escalate up one severity level. It is easier to downgrade than to explain why a Sev 2 sat unattended for four hours.

Decision trees for top incidents

The following decision trees cover the twelve most common incident types. Each tree tells you what to check first, what to do if that check fails, and when to escalate. Print these or bookmark them — you will not have time to read documentation from scratch when things break.

API failures (SFMC, Braze, Iterable)

Check platform status page. If vendor confirms outage, skip to stakeholder comms and monitor.
If status page is green, check your API credentials. Rotate or re-authenticate if expired.
Verify IP allowlists and firewall rules have not changed.
Check rate limits. If you are being throttled, reduce batch sizes and implement exponential backoff.
If none of the above, capture error codes and request IDs, then open a vendor support ticket with severity level matching your internal classification.

Data sync delays

Check ETL job status in your orchestration tool (Airflow, dbt, Make, or Zapier).
If the job failed, inspect logs for schema drift, permission errors, or timeout issues.
If the job succeeded but data is stale, check the destination platform’s import queue.
Verify webhook endpoints are accepting payloads (test with a manual POST).
If latency exceeds your SLA by more than 2x, escalate to Sev 2 and notify downstream journey owners.

Journey Builder stalls

Open the journey in SFMC or Braze and check for contacts stuck in wait steps or decision splits.
Review entry source status — is the data extension, API event, or segment still populating?
Check for platform-side throttling or send limits that may be gating progression.
If the journey references an external API call (content lookup, AI-generated copy), verify that endpoint is responding.
If contacts are stuck and cannot be un-stuck, clone the journey, re-inject the stuck population, and document the workaround.

AI copilot misfires (off-brand copy, hallucinated content)

Immediately pause any journey or automation using the AI-generated content.
Pull the specific prompt, model version, and output from your prompt registry logs.
Check guardrail coverage — did the QA layer catch the issue? If not, identify the gap.
Replace the AI-generated content with a pre-approved fallback template.
File a guardrail incident report (template included in the kit) and schedule a prompt review within 48 hours.

Deliverability drops

Check your sender reputation dashboard (Sender Score, Google Postmaster Tools, Microsoft SNDS).
Review recent send volumes for unexpected spikes that may have triggered ISP throttling.
Inspect bounce logs for new block rules, authentication failures (SPF, DKIM, DMARC), or list quality issues.
If reputation damage is confirmed, switch to the deliverability recovery protocol in the next section.
Notify stakeholders that send volumes will be temporarily reduced during the warm-up period.

Treat as Sev 1 immediately. Sending to opted-out contacts is a regulatory and brand risk.
Pause all non-transactional sends.
Audit the suppression sync pipeline end-to-end: source of truth, ETL job, platform-side list.
Identify the exact population that was impacted and prepare a disclosure plan with legal.
After remediation, add a synthetic subscriber test that validates suppression enforcement before every batch send.

Stakeholder communication templates

Clear communication during an incident is as important as the technical fix. Use these templates verbatim or adapt them to your org’s tone. The kit includes ready-to-paste versions for Slack, email, and status page updates.

Initial notification (within 15 minutes of classification)

Subject: [Sev X] Marketing Automation Incident — [Short Description]

Body: We identified an issue affecting [specific system or journey]. Current impact: [describe what is broken and who is affected]. The ops team is actively investigating. Next update in [30/60] minutes. No action required from your team at this time. Incident lead: [Name].

Ongoing update (every 30-60 minutes for Sev 1-2)

Subject: [Sev X] Update — [Short Description]

Body: Status: [Investigating / Identified / Mitigating / Resolved]. What we know: [root cause or current hypothesis]. What we have done: [actions taken]. Next steps: [what happens next]. ETA to resolution: [estimate or “unknown, next update at HH:MM”]. Incident lead: [Name].

Resolution notification

Subject: [Sev X] Resolved — [Short Description]

Body: The issue affecting [system/journey] has been resolved as of [timestamp]. Root cause: [brief explanation]. Impact: [number of contacts affected, sends delayed, etc.]. Immediate remediation: [what was done]. A post-incident review is scheduled for [date]. Full report will be shared within [48/72] hours.

Audience-specific talking points

Audience	What They Care About	Key Message
Executive leadership	Revenue impact, brand risk, timeline	”We contained the impact to [X], no revenue loss / estimated [Y] impact. Root cause identified, fix deployed.”
RevOps / Sales	Lead flow, pipeline data integrity	”Lead delivery resumed at [time]. No duplicate or missing records. If you see gaps, flag them by [deadline].”
Compliance / Legal	Regulatory exposure, data handling	”No PII exposed. Consent records intact. We are documenting the incident for audit trail purposes.”
Customer Support	Customer-facing impact, talking points	”Customers may have experienced [delay/missing email]. Use this script if they contact us: [script].”

Pause and resume procedures

Pausing automations is the single most important containment action, but doing it wrong creates a second incident. Follow these procedures to pause safely and resume without data loss.

Before you pause

Identify which automations must continue running (transactional receipts, password resets, legally required communications).
Document every automation you plan to pause, including the journey name, platform, entry source, and current population in-flight.
Notify downstream teams (Support, Sales) that automated communications will be temporarily halted.

Pause checklist

Step	SFMC	Braze	Iterable
Pause journey entry	Set Journey to “Paused” in Journey Builder	Stop Campaign via API or Dashboard	Pause Workflow in Workflows tab
Halt triggered sends	Pause Triggered Send Definition	Disable API-triggered Campaign	Pause Triggered Campaign
Stop data imports	Pause Automation Studio activities	Disable Segment Sync	Pause ETL connector
Disable AI content	Remove dynamic content blocks or switch to static fallback	Disable Liquid personalization referencing AI endpoints	Replace Handlebars helpers calling AI APIs
Confirm sends stopped	Check Tracking > Sends for zero activity	Check Message Activity for zero dispatches	Check Campaign Analytics for zero sends

Resume checklist

Verify the root cause is resolved and validated in a test environment.
Re-enable data imports first and confirm data is flowing correctly for at least 15 minutes.
Send a test to your synthetic subscriber list (internal seed list) to confirm content, personalization, and suppression are working.
Resume journeys in order of business priority: transactional first, then revenue-critical (welcome, cart abandonment), then engagement-focused.
Monitor send rates, bounce rates, and delivery metrics for the first hour. If anything looks abnormal, pause again and re-investigate.
Send a “systems normal” notification to all stakeholders.

Handling in-flight contacts

Contacts who were mid-journey when you paused need special attention. Depending on the platform:

SFMC: Contacts in wait steps will resume automatically when the journey is un-paused. Contacts in decision splits may need to be re-evaluated if underlying data changed during the outage.
Braze: Canvas users in delay steps will continue on schedule after resume. Check for users who may have exited due to timeout settings.
Iterable: Workflow participants will pick up where they left off. Audit the “waiting” count before and after resume to confirm no one was dropped.

Deliverability recovery protocol

If an incident damaged your sender reputation, do not resume full-volume sends immediately. ISPs have long memories and short fuses. Follow this warm-up protocol.

Assessment

Pull Sender Score, Google Postmaster Tools reputation, and Microsoft SNDS data.
Check for new blocklist entries (Spamhaus, Barracuda, SURBL).
Review bounce categorization: are you seeing blocks (5xx) or deferrals (4xx)?
Identify which sending IPs and domains are affected.

Recovery warm-up schedule

Day	Volume (% of normal)	Audience	Action if bounce rate > 5%
1	10%	Most engaged 30-day segment	Halt, re-check authentication, contact ISP relations
2	20%	Most engaged 60-day segment	Reduce volume by half, investigate specific ISPs blocking
3	35%	Engaged 90-day segment	Hold at current level for 24 additional hours
4	50%	Engaged 90-day + re-engaged segment	Continue only if bounce rate < 2%
5	75%	Full active segment minus disengaged	Monitor complaint rates closely
6-7	100%	Full active audience	Resume normal cadence, continue monitoring for 2 weeks

Authentication audit

Run this checklist after any deliverability incident:

SPF record includes all sending IPs and services.
DKIM keys are valid, not expired, and signing correctly (verify with a test send to mail-tester.com or a similar tool).
DMARC policy is set to at least p=quarantine and you are receiving aggregate reports.
BIMI record is intact (if applicable).
Return-Path domain aligns with your From domain.

AI copilot incident procedures

AI-generated content introduces a category of incidents that did not exist three years ago. Copilots writing subject lines, body copy, product recommendations, or dynamic content blocks can produce outputs that are off-brand, factually wrong, or in violation of regulatory requirements. These incidents require their own protocols.

Incident types

Type	Description	Severity Default
Off-brand tone	AI output uses language that does not match brand voice guidelines	Sev 3
Hallucinated content	AI fabricates product features, pricing, or offers that do not exist	Sev 2
Regulatory violation	AI output includes claims that violate FTC, CAN-SPAM, GDPR, or industry-specific rules	Sev 1
Prompt injection	External input manipulates AI output in unexpected ways	Sev 1
Model unavailability	AI API is down, causing content generation to fail or fall back to empty templates	Sev 2

Response steps for AI content incidents

Contain: Pause any active send using the AI-generated content. If the content already shipped, assess the blast radius (how many recipients, which segments, which channels).
Capture: Log the exact prompt, model, model version, temperature setting, guardrail configuration, and output. Screenshot or export the full API response.
Classify: Determine severity using the AI incident type table above.
Communicate: Use the stakeholder templates, adding a section that explains the AI component in plain language for non-technical audiences.
Correct: Replace AI content with a pre-approved static fallback. If the send already went out, determine whether a correction or retraction is needed.
Close: Update your prompt registry with the failure case, add it to your guardrail test suite, and schedule a prompt review.

Guardrail checklist

After any AI content incident, verify these guardrails are in place and functioning:

Human-in-the-loop approval gate exists before AI copy goes to production sends.
Tone and compliance validators run on every AI output before it enters the send queue.
Fallback content is configured for every AI-generated block so that model unavailability does not produce blank emails.
Prompt versions are tracked in version control with change history and reviewers.
AI QA results are logged with timestamps, model IDs, and pass/fail status.

Post-incident review template

Every Sev 1 and Sev 2 incident gets a formal post-incident review within 72 hours. Sev 3 incidents are reviewed in the next team standup. The review is not about blame — it is about building a better system.

Review agenda (60 minutes)

Block	Duration	Content
Timeline reconstruction	15 min	Walk through the incident from first alert to resolution, minute by minute
Root cause analysis	15 min	Identify the contributing factors (not just the trigger) using the “5 Whys” method
Impact assessment	10 min	Quantify: contacts affected, sends delayed/failed, revenue impact, reputation damage
Action items	15 min	Define specific remediation tasks with owners, deadlines, and success criteria
Process improvements	5 min	Identify changes to runbooks, monitoring, or staffing that would prevent recurrence

Metrics to capture

Time to detect (TTD): How long between the incident starting and someone noticing.
Time to classify (TTC): How long between detection and correct severity assignment.
Time to mitigate (TTM): How long between classification and containment (e.g., pausing sends).
Time to resolve (TTR): How long between mitigation and full resolution.
Blast radius: Number of contacts, sends, or journeys affected.
Revenue impact: Estimated lost or delayed revenue from failed sends.
Reputation impact: Change in Sender Score, blocklist additions, or complaint rate.

Post-incident report format

Use this structure for the written report that gets shared with stakeholders:

Summary: One paragraph describing what happened, when, and the business impact.
Timeline: Chronological log of events with timestamps.
Root cause: What broke and why. Include contributing factors.
Impact: Quantified metrics from the list above.
Actions taken: What was done to resolve the incident.
Action items: Table with task, owner, deadline, and status columns.
Lessons learned: What the team will do differently next time.

Tabletop exercises

An incident response kit that nobody has practiced is just a document. Run tabletop exercises quarterly so your team builds muscle memory for real incidents. The kit includes three ready-to-run scenarios with facilitator notes, sample data, and curveballs.

Exercise format (60 minutes)

Setup (5 min): Facilitator distributes the scenario card. Participants do not see it in advance.
Initial response (15 min): Team classifies the incident, assigns roles, and drafts initial communications using the kit.
Curveball (5 min): Facilitator introduces a complication (e.g., the vendor status page is wrong, a second system fails, legal asks for an immediate disclosure).
Continued response (15 min): Team adjusts their plan and drafts updated communications.
Debrief (20 min): Group reviews what went well, what was confusing, and what needs to change in the kit.

Sample scenarios

Scenario A — The silent data feed: Your SFMC data extension that powers the welcome journey has not updated in 6 hours, but no alerts fired because the monitoring only checks for job failures, not data freshness. New signups are entering the journey with stale profile data, receiving incorrect content.

Scenario B — The AI subject line gone wrong: Your AI copilot generated a subject line for a promotional send that contains a pricing claim your company does not actually offer. The send went out to 200,000 contacts two hours ago. Customer support is getting calls.

Scenario C — The cascading API failure: Your Braze-to-Snowflake sync fails, which causes your suppression list to go stale, which causes a re-engagement campaign to send to contacts who unsubscribed yesterday. Legal is asking questions.

Scoring rubric

After each exercise, score the team on these dimensions (1-5 scale):

Dimension	What “5” Looks Like
Speed of classification	Correct severity assigned within 5 minutes
Role clarity	Everyone knew their job without discussion
Communication quality	Stakeholder messages were clear, accurate, and timely
Kit usage	Team used the kit effectively, did not improvise unnecessarily
Adaptability	Team handled the curveball without losing composure

Track scores over time. If a dimension consistently scores below 3, that section of the kit needs revision or the team needs targeted training.

Tooling setup

The kit includes ready-to-import configurations for monitoring, alerting, and incident coordination. Set these up before your first real incident, not during one.

Monitoring dashboards

Tool	What to Monitor	Alert Threshold
Datadog / Splunk	API latency, error rates, queue depth	Latency > 2x baseline for 5 min, error rate > 5%, queue depth > 10x normal
Google Postmaster Tools	Domain reputation, spam rate, authentication	Reputation drop to “Low” or “Bad”, spam rate > 0.3%
Platform-native (SFMC/Braze)	Send volumes, journey progression, bounce rates	Send volume drops > 50% vs. same-day-last-week, bounce rate > 5%
Uptime monitor (Pingdom, UptimeRobot)	API endpoints, webhook receivers, AI model endpoints	Any downtime or response time > 5 seconds

Alerting and paging

Configure PagerDuty, Opsgenie, or your paging tool with escalation policies that match the severity matrix.
Sev 1 pages the on-call ops lead immediately and auto-escalates to the director if not acknowledged within 10 minutes.
Sev 2 pages the on-call ops lead and posts to the incident Slack channel.
Sev 3 and Sev 4 post to Slack only; no paging.

Incident coordination

Dedicated Slack channel (e.g., #incident-marketing-ops) that stays quiet until an incident is declared.
Zapier or Make recipe that creates an incident record the moment someone triggers the “panic button” — capturing timestamp, reporter, initial description, and severity.
Shared Google Doc or Notion page that serves as the live incident log during active response.
Video bridge link (Zoom, Google Meet) pre-configured and pinned in the incident channel so you never waste time creating one.

Quarterly audit process

Incident response kits decay. People change roles, platforms update their APIs, new automations get built without corresponding runbook entries, and AI models get swapped out. Run this audit every quarter to keep the kit current.

Audit checklist

Item	Check	Owner
Stakeholder contact info	All names, phone numbers, Slack handles, and email addresses are current	Ops Lead
Escalation paths	Severity-to-person mapping reflects current org chart	Ops Lead
Decision trees	All twelve incident types still match your current stack and integrations	Senior CRM Admin
Communication templates	Templates reference correct product names, status page URLs, and legal contacts	Content + Legal
Pause/resume procedures	Platform-specific steps are accurate for current platform versions	Platform Admin
AI guardrail inventory	Prompt registry, model versions, and fallback templates are up to date	AI Ops Lead
Monitoring thresholds	Alert thresholds still make sense given current send volumes and architecture	Data Engineering
Tabletop exercise log	At least one exercise has been run since the last audit	Ops Lead
Post-incident action items	All open action items from previous incidents are resolved or have active owners	Ops Lead
Tool access	Everyone on the incident response roster has access to monitoring tools, paging systems, and platform admin	Ops Lead

Audit output

The audit produces three deliverables:

Updated kit: All sections revised with current information, pushed to the shared Notion template.
Gap report: List of items that were out of date, who updated them, and any systemic issues (e.g., “We keep losing track of AI model version changes — we need a notification hook”).
Next exercise scenario: Based on gaps found in the audit, draft a new tabletop scenario that tests the weakest area.

Scheduling

Block 90 minutes on the calendar at the start of each quarter. Invite the ops lead, senior CRM admin, a compliance representative, and anyone who has been on-call in the past quarter. The audit should feel routine, not burdensome. If it takes more than 90 minutes, that itself is a finding — the kit has grown too complex and needs simplification.

Kit highlights

Decision tree covering the top 12 incidents (API failures, data delays, Journey Builder stalls, AI copy misfires, deliverability drops, consent failures, and more).
Pre-written stakeholder comms for execs, RevOps, Compliance, and Support with audience-specific talking points.
Checklist for pausing and resuming automations with guardrails for unsubscribes and transactional flows.
Post-incident review template with metrics, contributing factors, and backlog items.
Tabletop exercise scenarios with facilitator guides, curveballs, and scoring rubrics.
Quarterly audit checklist so the kit stays current as your stack evolves.

Deployment checklist

Copy the Notion template and pre-fill stakeholder contact info plus escalation paths.
Import the monitoring dashboard configurations into Datadog, Splunk, or your observability tool.
Set up the Zapier or Make incident-capture recipe and test it end-to-end.
Configure paging escalation policies to match the severity matrix.
Run the included tabletop exercise with your team within the first two weeks.
Schedule the quarterly audit on the calendar now — not after the first incident.
Connect the comms scripts to your status page, Slack channels, or paging tools using the provided recipes.
After each real incident, complete the postmortem template and push action items into your ops backlog automatically.