Email Security: phishing, BEC, and the DMARC forwarder

Email Security deep-dive for Cloudflare One: MX inline vs API journaling, the DMARC forwarder/subdomain trap, homoglyph FP calibration, user-report → retract under 1h.

· 13 min read · Đọc bản tiếng Việt
Cloudflare Email Security (formerly Area 1): MX inline vs API journaling deployment, DMARC forwarder/subdomain traps, homoglyph detection FP calibration, and user-report → retract automation under one hour

TL;DR

Email Security (Cloudflare Email Security, formerly Area 1, acquired 2022) blocks phishing, BEC, impersonation, and malware-bearing email. According to the Verizon 2024 DBIR, 68% of breaches involve a human element, most of it via email. The FBI IC3 2023 report: $2.95B in BEC losses in 2023, averaging $125K per incident.

This post is written post-mortem style — pitfalls the CF docs do not mention:

  • The DMARC forwarder trap: vendor auto-forwarding mail → SPF alignment fails → legitimate email ends up quarantined for 12 months in a pct=10 stage. How to handle it.
  • Subdomain alignment: mail.company.com (Mailchimp) vs company.com (corporate) — alignment failures drop legitimate marketing email.
  • Homoglyph detection FP calibration: catches 60% of BEC attempts but with 8% FP on legitimate lookalike domains. How to tune the threshold.
  • Retract window of 30 days: a new IoC is published 3 days after the initial miss → you can still retract. Ignoring this leaves a 3-day leak window.
  • Credential compromise hunt: post-phish, query Access logs for failed logins with stolen creds. Incident response in depth.

This closes a 20-part series. The final section has a reflection and an overall recipe.


Who this is for

  • Security engineers who own the email stack, evaluating vendors (CF Email Security vs Proofpoint vs Mimecast vs Microsoft Defender for O365).
  • Microsoft 365 / Google Workspace admins setting up DMARC enforcement.
  • CISOs post-BEC incident, looking for a playbook to prevent recurrence.

Prerequisites:


Email — the #1 threat vector, with numbers

Opening with numbers inline (not reference-only):

  • Verizon 2024 DBIR: 68% of breaches involve a human element; phishing is the top initial-access vector (ahead of CVE exploitation).
  • FBI IC3 2023 Annual Report: $2.95B in BEC losses, 21,832 complaints. Average $135K per complaint in 2023, up from $120K in 2022.
  • Microsoft Digital Defense Report 2024: 156,000 BEC attempts per day observed across the Microsoft email platform (scaled).
  • Proofpoint State of the Phish 2024: 71% of orgs experienced a successful phishing attack in the past year.

These are public reports, cited inline for credibility — not buried in a reference list at the bottom.

Implication: email security is not an optional layer. If your org has nothing beyond the native M365 / Google Workspace built-in filter, this is your largest exposure.


The four threat vectors — with FP calibration

4 email threat vectors with detection signals

1. Commodity phishing

  • Mass template, known campaign.
  • Block rate: 95-99% automated. The 1-5% that get through are novel variants.
  • FP rate: under 1%. Low ambiguity, easy to block.

2. BEC (Business Email Compromise)

  • Targeted, often no link or attachment. Text-only “CEO, please wire $50K to a vendor”.
  • Detection: NLP patterns (urgency, financial action request, display-name anomaly) + first-time-sender + DMARC alignment failures.
  • Calibration needed: “urgency + money + request” fires on legitimate finance-team email. FP rate 3-7% early.
  • Tuning: whitelist known vendor patterns (e.g., recurring “invoice from accounting@vendor.com”) → reduce FP.

3. Impersonation (homoglyph / display-name)

  • comp4ny.com vs company.com — Levenshtein distance 1.
  • “John Smith CEO” as a display name from random@gmail.com.

FP calibration — this is where the tool actually becomes painful:

Threshold Levenshtein ≤ 2 catches:

  • True positive: 60% of BEC attempts.
  • False positive: 8% of legitimate email from similarly-named legitimate partners.

Real FP examples:

  • marketing@company-inc.com vs marketing@company.com (subsidiary).
  • support@apple.com vs support@app1e.com (phish) — OK, caught.
  • no-reply@github.com vs no-reply@gitlab.com (both legit tools) — Levenshtein 3, but generates weekly alerts because both are recent senders.

Tuning approach: whitelist after two FPs from the same legitimate domain within 30 days. Quarterly review the whitelist for stale entries.

4. Malware payload

  • Attachment: Office macro, ISO with an EXE inside, HTML smuggling.
  • Weaponised link: time-of-click analysis (URL benign at delivery, malicious at click).
  • Detection: sandbox detonation + URL reputation + file-hash IOC.
  • FP: under 2% with a mature threat intel feed. Main FP source: a legitimate exec sharing a rare file format.

Deployment mode — opinion: API first

MX inline (pre-delivery proxy) vs API journaling (post-delivery retract)

Same pattern as CASB: documentation recommends “use both,” but the real question is which one to start with.

I pick API journaling first. Reasons:

1. Lower setup risk

MX inline = changing the DNS MX record. A misconfiguration drops email entirely for 30+ minutes of TTL. Cautious enterprise rollout = 2-4 weeks.

API journaling = OAuth admin grant. Worst case on misconfiguration: the scan does not run — mail flow stays intact. Rollback is trivial.

2. Retract capability

The Email Security engine updates threat intel continuously. A malicious IoC is discovered two days after initial delivery. API mode = retract from the inbox retroactively. MX-only mode = the email is already in the inbox, untouchable.

Example: campaign A bypasses the engine in week 1. On day 3, CF threat intel is updated. API retracts 450 emails across 200 mailboxes in 10 minutes. MX-only? The email has been read, and some victims have already clicked the link.

3. Coverage is fundamentally different

API mode sees email after delivery — including internal mail (employee → employee), shared mailboxes, and distribution lists. MX mode only sees inbound from the internet.

Internal phishing (compromised account → coworker) happens after the initial breach. API catches it; MX misses it.

When MX inline wins

  • Hard compliance requirement that “no malicious email shall reach any mailbox” — certain regulated industries (financial, defence).
  • Customers not on cloud email (on-prem Exchange). API journaling does not support on-prem.
  • Extreme latency sensitivity — MX inline scan is 100ms-1s in the SMTP session; acceptable for most, a deal-breaker for a trading desk.

Production: both, staged

  • Weeks 1-4: enable API journaling. Monitor retract events.
  • Weeks 5-8: plan the MX inline DNS cutover. Stage it for a Saturday (low email volume).
  • Weeks 9+: both active. Defence in depth.

DMARC — the forwarder and subdomain traps

DMARC SPF DKIM stack: SPF auth + DKIM sig + DMARC policy

SPF + DKIM + DMARC are foundational. Most blog posts list the phases (none → quarantine → reject) and stop there. This is where the real pain starts.

The forwarder trap — big damage

Scenario: Alice sets up a Google forward from alice@company.com to alice@gmail.com (personal, to check from mobile). Mail flow:

  1. External sender vendor@partner.com sends to Alice.
  2. partner.com has SPF; SPF-alignment passes at the company.com MX.
  3. Google Workspace forwards to alice@gmail.com.
  4. Gmail receives mail with “From: vendor@partner.com” but the envelope sender is alice@company.com. SPF checks against company.com as sender — company.com’s SPF does not list Google’s Gmail outbound IPs.
  5. SPF fails. DKIM, if properly aligned, may pass.
  6. If DMARC is p=reject: Gmail rejects legitimate mail.

Result: Alice’s forward rule is broken if DMARC is strict.

Bigger picture: many orgs have hundreds of inbox auto-forwards (HR recruitment, support distribution, personal convenience). Rolling out DMARC p=reject breaks all of them.

Solution options:

  • ARC (Authenticated Received Chain) — RFC 8617, the forwarder signs an ARC-Seal. The receiver verifies the full forward chain. Google + Yahoo + Microsoft support it. Gmail-as-forwarder supports ARC, which reduces the problem.
  • Skip forwarding, use IMAP pull — on the personal side, pull from the company mailbox instead of forwarding. Breaks convenience but is technically correct.
  • DMARC pct=90 permanently — skip the strict 100% to tolerate edge cases. Not ideal, but pragmatic.

Real org experience: I supported one enterprise where DMARC p=reject was deployed in week 16 → 85 forwarded-mail tickets on day one. Rolled back to p=quarantine. ARC adoption is now quarterly, tracking vendor readiness.

The subdomain alignment trap

Scenario: corporate domain is company.com. Marketing uses Mailchimp to send newsletters from news@mail.company.com (subdomain).

Mailchimp SPF/DKIM signs as mail.company.com. Alignment check:

  • Relaxed mode (aspf=r, adkim=r): mail.company.com aligns with company.com (parent). Pass.
  • Strict mode (aspf=s, adkim=s): mail.company.comcompany.com. Fail.

The default aspf=r, adkim=r is safe. Setting strict (the misguided “tighter = better”) breaks Mailchimp and every SaaS email tool.

Opinion: relaxed mode for both aspf and adkim unless there is a specific reason. Do not default to strict.

DMARC deployment timeline in the real world

PhaseDurationPolicyAction
1. Audit4-8 weeksp=noneCollect RUA, identify every sender
2. Fix4-12 weeksp=noneWork with vendors to fix SPF/DKIM
3. Transition4-8 weeksp=quarantine, pct=10→50→100Gradual, monitor spam folder
4. Enforceongoingp=reject, pct=100Full enforce + ARC adoption

Total: 4-8 months for enterprises with vendor sprawl. Rushing = breaking legitimate mail.

2024 mandate: Gmail and Yahoo require bulk senders to have DMARC p=none at minimum, DKIM signed. Not optional.

RUA analysis pitfall

RUA reports flood in. 10-50 reports/day from receivers. Without a tool, unusable.

Tools I use:

  • Postmark DMARC Digests (free for small volume, paid beyond) — daily summary + identifies senders failing alignment.
  • Dmarcian (paid) — advanced, compliance reporting.
  • Cloudflare DMARC Management (in the Cloudflare ecosystem, integrates naturally).

Raw RUA XML is technically readable but practically not. Use a tool.


User reporting — workflow to SLA

User Phish Alert button → SOC triage → retract → IoC feed

The metrics I actually watch

Click-through rate (user receives phish → clicks): target under 3%. Industry baseline 5-10%. Good user training drops it to 2%.

Report rate (user receives phish → reports via the button): target over 25%. A sign of a healthy culture. Low report rate (under 10%) → user apathy, retrain.

Time-to-retract: detect → removed. Target under 1 hour for any campaign affecting more than 10 users.

Repeat-click victim: same user clicked phish more than once in 12 months. Target them for individual training, do not punish.

SOC automation — 3 tiers

Tier 0 — Fully automated (high-confidence IoC):

  • User report matches a known campaign signature.
  • Auto-retract from all mailboxes.
  • Auto-block the URL at Gateway DNS + Network (Parts 11-13).
  • Auto-notify victim users.
  • Duration: under 5 minutes end-to-end.

Tier 1 — SOAR playbook (medium confidence):

  • Novel report, signature not matched.
  • SOAR (Palo Alto XSOAR, Splunk Phantom) runs an enrichment playbook: URL reputation check, WHOIS, sample detonation.
  • If the verdict is malicious → auto-execute Tier 0 actions.
  • Otherwise escalate to Tier 2.

Tier 2 — Human analyst (low confidence / sophisticated):

  • Spear-phish, targeted executive.
  • A human reviews headers, content, and context.
  • Decision + manual actions.
  • Duration: 15 minutes - 4 hours.

A mature org: 80% of reports resolved in Tier 0-1, 20% escalated to Tier 2.

Reward culture

A “thank the reporter” email with recognition counts more than punishing clickers. Leaderboard: top 10 reporters per quarter named, small prize. Cost $50/quarter, behaviour change is significant.

Anti-pattern: “you clicked a phish, attend mandatory training” — reinforces fear, users hide mistakes → more damage.


Incident response — the post-phish hunt

Incident response 6-phase: detect, triage, contain, investigate, communicate, post-mortem

The standard six-phase playbook is covered in many places. I want to highlight the investigate phase — where most playbooks stay shallow.

Credential theft scenario

Alice clicks a phish and enters credentials. The attacker now has alice@company.com’s password plus potentially a session token.

Hunt steps (SOC playbook, grounded):

1. Access logs — unusual login from an unknown IP

// Sentinel/KQL (correlation from Part 14)
Cloudflare_Access_CL
| where UserEmail == "alice@company.com"
| where TimeGenerated between (phish_click_time .. phish_click_time + 24h)
| where Action == "allowed"
| project IP, GeoCountry, DeviceID, TimeGenerated
| distinct IP, GeoCountry

Baseline Alice’s IPs historically. A new IP + new country + within 1h of the phish click = confirmed lateral.

2. MFA bypass check

The attacker may have captured the MFA code via the phish. Check whether the MFA step completed from the new IP.

Cloudflare_Access_CL
| where UserEmail == "alice@company.com" and Authenticator != ""
| where TimeGenerated between (phish_click_time .. +2h)
| project TimeGenerated, IP, Authenticator, Result

Login success with MFA from a new IP = credentials and MFA both compromised. Contain urgently.

3. Mailbox rule check

The attacker often sets an inbox rule to auto-forward alice@company.comhacker@hotmail.com to maintain persistence.

# M365 admin
Get-InboxRule -Mailbox alice@company.com | Where {$_.ForwardTo -or $_.RedirectTo}

Delete the rule. Revoke the session.

4. Data access audit

Alice’s access over the last 24h:

Cloudflare_Gateway_HTTP_CL
| where UserEmail == "alice@company.com"
| where TimeGenerated > phish_click_time
| summarize by Host
| where Host contains "salesforce" or "sharepoint" or "drive.google" or "github"

Any sensitive systems accessed? Determine the data exposure scope.

5. Credential rotation

  • Reset password (invalidates the old session).
  • Rotate MFA device (re-enroll).
  • Revoke refresh tokens across all apps.
  • Rotate any shared secrets Alice had access to (API keys, service tokens per Part 6).

Breach notification

If data access is confirmed:

  • Internal: security team, Alice’s manager, department head, leadership.
  • External: customer / partner if their data was touched. Depends on contract plus regulation.
  • Regulatory: GDPR 72h, state-specific (CA 30-90 days), HIPAA 60 days, PCI varies.

Opinion: default to “notification required.” Get legal involved immediately. Worst case = over-notify. Under-notify = fines + reputational damage.

Post-mortem template

  • Which control failed? Email Security miss? User training gap? MFA bypass tech?
  • Which control worked? Where in the chain was the attack caught?
  • Gap analysis: how close to worst case?
  • Action items: tool tune, training content update, process change.

Do not skip the post-mortem. The same attack repeats if the root cause is not addressed.


Outbound DLP via email — brief

Part 19 covered DLP. Email outbound is the same engine, email-specific enforcement.

outbound_email_policy:
  name: "Block PII to external"
  action: quarantine
  condition:
    all:
      - dlp_profile: PII_strict
      - message.external_recipient: true
  quarantine:
    hold_time: 1h
    notify_sender: "Your email is held — contains customer PII. Security review."
    review_queue: dlp-review@company.com

Top use cases:

  • Accidental “reply all” with a customer list attached.
  • Salary info CC’d to external by mistake.
  • Source code emailed to a personal address pre-resignation.

Rollout: same staged approach as Part 19 — log → warn → block.


When Email Security is overkill

  1. Orgs of 10-20 people, Google Workspace / M365 native filter is enough. Built-in rules catch 85-90% of commodity. Invest in user training instead.

  2. Heavily regulated industry on on-prem Exchange with an existing Proofpoint. Switching is a big migration, not a clear win. Evaluate on a feature-by-feature basis, not as a platform decision.

  3. Budget-starved startup. Free-tier Google Safe Browsing + M365 ATP Plan 1 cover the baseline. Add dedicated Email Security when revenue / user count justifies it.


Lessons I will keep

  1. Cite numbers inline, not just in the reference list. “68% of breaches (Verizon 2024 DBIR)” reads more credible than “see ref 3.”
  2. The DMARC forwarder trap is the biggest operational risk — underestimated by 3× the time budget.
  3. ARC adoption is the path forward — push vendors to adopt; Google and Yahoo are driving.
  4. User report metric matters more than click-through. High report rate = healthy culture.
  5. The post-phish hunt is where the SOC excels. Most teams respond with “retract email, done.” The real work is the credential hunt + the 24h access audit.
  6. Over-notify > under-notify for breach decisions. Get legal involved early.

Closing and series wrap-up

Email is the #1 attack surface of any org. According to Verizon 2024 and FBI IC3, no other security layer has a higher ROI. Email Security tool, DMARC deployment, user training, and an incident response playbook together = 95%+ of phishing blocked.

Production recipe:

  • API journaling + MX inline combined.
  • DMARC staged over 4-8 months to p=reject, with ARC.
  • Phish Alert button + recognition culture.
  • SOC automation in 3 tiers (auto, SOAR, human).
  • Post-phish hunt over a 24h window after every confirmed click.
  • Outbound DLP focused on PII / secret leakage.

Series wrap-up — 20 parts, 6 blocks

I started this series to answer “what even is Cloudflare One?” after talking with three security teams evaluating it and one team rolling it out. This is the 20-post answer.

BlockPartsMain message
1. Foundation1-3Four-layer mental model, SASE/SSE/Zero Trust stripped of the marketing
2. Access4-7ZTNA replacing VPN — blast radius, identity-first, service-to-service, lifecycle
3. Connectivity8-10Edge-first path — Tunnel outbound-only, WARP per-device, Magic WAN per-site
4. Policy & Filtering11-13Three-tier SWG (DNS / Network / HTTP) with DoH bypass as the key gap
5. Observability & Ops14-16No observability = prevention only. Logs + DEX + posture
6. Advanced Security17-20The containment layer when prevention is uncertain — RBI, CASB, DLP, Email

Three meta-lessons that apply across the series:

  1. Staged rollout is not optional — DLP log→warn→block, DMARC none→quarantine→reject, tiered posture. Block-first = helpdesk storm + user bypass.
  2. FP calibration is the most important skill. DLP, CASB, Email Security, RBI all have high FP in week one. Tuning capacity = team success.
  3. User experience walks alongside security. DEX, education, exception processes with expiration. No one wins if users bypass.

If I had to summarise the series in one sentence:

Zero Trust is identity + device + network + data control + visibility combined. No single product covers 100%. A good tool integrates smoothly — Cloudflare One is a strong candidate, not the only one.

Thanks for reading this far. Feedback, corrections, and other real-world stories are welcome via contact.


References

In this series: