End-to-end logs pipeline: Logpush, R2, SIEM correlation

Logs deep dive for Cloudflare One: datasets, Logpush destinations (R2/S3/Splunk/Sentinel), cross-layer correlation, tiered retention, cost control, sample SIEM detection rules.

· 10 min read · Đọc bản tiếng Việt
End-to-end Cloudflare One logs pipeline: Logpush from Access/Gateway/Tunnel into R2/S3/Splunk/Sentinel/Datadog, tiered hot/warm/cold retention strategy, and cross-layer SIEM correlation rules

TL;DR

Logs are the observability foundation of Cloudflare One. The native dashboard only keeps 30 days and makes cross-layer correlation hard. Production needs:

  • Logpush — streaming export to destinations (R2, S3, Splunk, Sentinel, Datadog, …).
  • Tiered retention — hot 30 days (CF), warm 1 year (R2/S3), cold 7 years (Glacier).
  • Cross-layer correlation — join DNS + Network + HTTP + Access logs by UserID to catch multi-stage attacks.
  • Detection rules in the SIEM.
  • Cost control — sampling, compression, lifecycle policy.

This post covers:

  • The datasets Cloudflare One exposes.
  • Logpush mechanics — batching, format, destinations.
  • R2 as a data lake — setup + query with Athena/DuckDB.
  • SIEM integration patterns — Splunk, Sentinel, Elastic.
  • Correlation rules for DoH bypass, credential stuffing, session hijack.
  • Cost and retention.

The thesis:

Logs aren’t an afterthought — they are the backbone of Zero Trust. Without a SIEM pipeline and cross-layer correlation, you have prevention but no detection. Attackers try every vector; you need to see all of them to detect a pattern.

This is Part 14 of the Cloudflare One Handbook and opens the Observability & Ops block (Parts 14–16).


Who this is for

  • Security engineers building SIEM integration.
  • SOC analysts writing cross-layer detection rules.
  • Platform engineers setting up a data-lake observability layer.

Recommended prior reading:

  • Any of Parts 11, 12, 13 — Logpush has been mentioned only at the surface.
  • Part 9 — WARP (identity enrichment for logs).

After this post you will:

  • Know every Cloudflare One dataset and the important fields in each.
  • Be able to set up Logpush to R2 and a SIEM in the right format.
  • Write cross-layer correlation rules that detect attack chains.
  • Plan retention + cost without blowing the budget.

What this post does not cover

  • General Cloudflare logs outside Zero Trust (CDN, WAF, Workers — a different post).
  • Deep SIEM tuning (Splunk SPL, KQL optimisation) — brief mention only.
  • Compliance-specific reporting (PCI, HIPAA) — retention is mentioned, detailed frameworks are not.
  • Analytics Engine — separate post.

Concepts

  • Dataset — a log stream Cloudflare exposes; each dataset has a fixed schema (gateway_dns, gateway_network, gateway_http, access_requests, …).
  • Logpush — the Cloudflare service that pushes log batches to a destination over HTTP / S3.
  • Destination — where logs land: R2, S3, GCS, Azure Blob, Splunk HEC, Sumo Logic HTTP, Datadog, New Relic.
  • NDJSON — newline-delimited JSON, Logpush’s default format.
  • Logpull — legacy pull API (deprecated for most datasets).
  • SIEM — Security Information and Event Management: Splunk, Sentinel, Elastic, Sumo, Datadog.
  • Data lake — bulk storage (R2/S3) + a query engine (Athena, DuckDB, BigQuery).

Cloudflare One datasets

Cloudflare One log datasets map: Access Requests, Gateway DNS, Gateway Network, Gateway HTTP, Audit Logs, Device Posture, Zero Trust DEX — each dataset's schema and use.

DatasetContentVolume (enterprise, typical)
access_requestsEvery ZTNA auth attempt — user, app, decision, identity10K–100K/day
gateway_dnsEvery DNS query through Gateway10M–500M/day
gateway_networkL4 TCP/UDP connection events50M–1B/day
gateway_httpHTTP requests (when decryption is on)10M–500M/day
audit_logsAdmin / config changes100–10K/day
device_posture_resultsDevice posture check results1M–50M/day
zero_trust_dex_test_resultsDEX (Digital Experience Monitoring) tests10K–1M/day
access_loginZTNA login events (sessions)1K–100K/day
casb_findingsCASB scan findings on connected SaaS100–10K/day

Dataset characteristics

  • Access logs — lower volume, high value. Essential for forensics.
  • Gateway DNS — broad coverage, moderate volume. Baseline monitoring.
  • Gateway Network — highest volume. Expensive. Sample aggressively.
  • Gateway HTTP — medium-to-high volume. Depends on the decrypt scope.
  • Audit — must ship 100% (compliance).

Logpush mechanics

Logpush flow: Cloudflare edge generates events → batched every 5 minutes or 5MB → compressed (gzip) → pushed via HTTP/S3 to a destination → lands as NDJSON files partitioned by date/time.

Batch behaviour

  • Batch trigger: 5-minute tick OR 5 MB size (whichever first).
  • Max batch: 5 MB compressed → ~50 MB uncompressed.
  • Delivery: at-least-once (rare duplicates; the SIEM must dedupe by event ID).
  • Ordering: NOT guaranteed inside a batch; use the timestamp field.

Format options

{
  "output_options": {
    "output_type": "ndjson",           // or "csv"
    "timestamp_format": "rfc3339",     // or "unix"
    "field_names": [],                 // empty = all
    "field_delimiter": ",",            // csv only
    "record_delimiter": "\n"
  }
}

Destination types

  • Object storage: R2, S3, GCS, Azure Blob. Path template: dataset/{DATE}/{TIME}.
  • HTTP webhook: generic — Splunk HEC, Datadog, Sumo, custom.
  • Native SIEM: Sentinel connector (Microsoft ecosystem), official Splunk app.

Path templating

r2://gateway-logs/{dataset}/year={YEAR}/month={MONTH}/day={DAY}/hour={HOUR}/{UUID}.json.gz

A Hive-partitioned layout → Athena/DuckDB queries are efficient. year/month/day partitioning is the standard.

Filter at source

{
  "filter": "{\"where\":{\"key\":\"Action\",\"operator\":\"eq\",\"value\":\"block\"}}"
}

Example: push only block events — reduces volume 80–90% for the DNS dataset.


R2 as a data lake

Why R2 over S3

  • Egress free — querying R2 data costs no egress (S3 does).
  • Native Cloudflare integration — direct Logpush, no IAM friction.
  • Cheap storage ~$15/TB/month, comparable to S3, but free egress is the big win.

Setup

  1. Create an R2 bucket:
wrangler r2 bucket create gateway-logs
  1. Create an API token for Logpush:

Dashboard → R2 → Manage R2 API Tokens → create a token with Object Read & Write.

  1. Configure Logpush:
curl -X POST \
  "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/logpush/jobs" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" \
  --data @- <<'EOF'
{
  "name": "gateway-dns-to-r2",
  "dataset": "gateway_dns",
  "destination_conf": "r2://gateway-logs/dns/year={YEAR}/month={MONTH}/day={DAY}/?account-id=${ACCOUNT_ID}&access-key-id=${R2_KEY}&secret-access-key=${R2_SECRET}",
  "output_options": {
    "output_type": "ndjson",
    "timestamp_format": "rfc3339"
  },
  "enabled": true
}
EOF
  1. Verify:
wrangler r2 object list gateway-logs --prefix dns/

Batches appear within 5–10 minutes.

Query with DuckDB (local)

-- Install the httpfs extension
INSTALL httpfs;
LOAD httpfs;

-- Configure R2 credentials
SET s3_endpoint='<account-id>.r2.cloudflarestorage.com';
SET s3_access_key_id='<r2-key>';
SET s3_secret_access_key='<r2-secret>';
SET s3_url_style='path';

-- Query
SELECT
  Email,
  COUNT(*) as blocked_count,
  ARRAY_AGG(DISTINCT DNSQuestion) as domains
FROM read_json('s3://gateway-logs/dns/year=2026/month=05/day=15/*.json.gz')
WHERE Action = 'block'
GROUP BY Email
ORDER BY blocked_count DESC
LIMIT 20;

Runs locally — no cluster required.

Query with Athena (AWS)

R2 is S3-compatible; Athena queries it directly if the bucket is public or a cross-account role is configured. Pattern:

  1. A Glue Crawler scans R2 partitions → catalog table.
  2. Athena SQL runs over the catalog.
  3. Use CTAS (Create Table As Select) for an aggregate layer.

Cost: Athena at $5/TB scanned — partition pruning matters.


SIEM integration patterns

SIEM integration: Logpush → native connectors (Splunk HEC, Sentinel Data Connector, Datadog HTTP, Sumo HTTP Source). Correlation across datasets in the SIEM's query language. Alert routing to PagerDuty or a SOC workflow.

Splunk

Option A — HEC direct:

{
  "destination_conf": "splunk://hec.splunk.company.com/services/collector/raw?header_Authorization=Splunk%20<token>&channel=<uuid>&header_Content-Type=application%2Fjson&insecure-skip-verify=false",
  "dataset": "gateway_dns"
}

Option B — S3 → Splunk pull (cost-effective for high volume):

Logpush → R2 → Splunk SmartStore or a custom pull script.

SPL example — DoH bypass detection:

index=cloudflare sourcetype=gateway_network
| where match(SNI, "dns\\.google|dns\\.quad9\\.net|doh\\.opendns\\.com")
| where Action="block"
| stats count by UserID, Email, SNI
| where count > 5
| sort -count

Microsoft Sentinel

Native Cloudflare connector (in Content Hub):

  1. Install the “Cloudflare (using Azure Function)” connector.
  2. Provide a Cloudflare API token + account ID.
  3. Data flows to the Log Analytics workspace.

Alternative: Logpush → Sentinel via Event Hub.

KQL example — cross-layer correlation:

let timeframe = 1h;
Cloudflare_Gateway_DNS_CL
| where TimeGenerated > ago(timeframe)
| where Action_s == "block"
| where Categories_s contains "Malware"
| project UserID=UserID_s, blockedDomain=DNSQuestion_s, t_dns=TimeGenerated
| join kind=inner (
    Cloudflare_Gateway_Network_CL
    | where TimeGenerated > ago(timeframe)
    | where Action_s == "block"
    | project UserID=UserID_s, blockedIP=DestinationIP_s, t_net=TimeGenerated
) on UserID
| where t_net between (t_dns .. t_dns + 5m)
| project UserID, blockedDomain, blockedIP, t_dns, t_net

Detects: a user’s DNS lookup was blocked, and within 5 minutes they tried a direct IP — a multi-stage attempt.

Elastic / OpenSearch

Logstash pipeline:

input {
  s3 {
    bucket => "gateway-logs"
    endpoint => "<account-id>.r2.cloudflarestorage.com"
    access_key_id => "<r2-key>"
    secret_access_key => "<r2-secret>"
    codec => "json_lines"
  }
}

filter {
  if [dataset] == "gateway_dns" {
    mutate { add_field => { "[@metadata][index]" => "cf-gateway-dns-%{+YYYY.MM.dd}" } }
  }
}

output {
  elasticsearch {
    hosts => ["https://es.company.com:9200"]
    index => "%{[@metadata][index]}"
  }
}

Datadog

HTTP webhook destination:

datadog://http-intake.logs.datadoghq.com/api/v2/logs?header_DD-API-KEY=<key>&ddsource=cloudflare&service=gateway

Datadog parses JSON automatically and applies a pipeline for tagging.


Cross-layer correlation rules

Attack chain detection: a single request pattern blocks at DNS, then tries a Network IP, then an HTTP direct path. Correlate by UserID across datasets inside a 15-minute window to detect the multi-stage attempt.

Rule 1 — DoH bypass attempt

Signal: the same user is blocked at DNS and also blocked for a DoH destination at Network within the same 15 minutes.

Threat model: malware first tries the system resolver → blocked → falls back to DoH → also blocked. A confirmed compromise attempt.

let win = 15m;
let dns_blocks = Cloudflare_Gateway_DNS_CL
  | where Action_s == "block" and Categories_s has "Malware"
  | project UserID, t1=TimeGenerated, dns_q=DNSQuestion_s;
let doh_blocks = Cloudflare_Gateway_Network_CL
  | where Action_s == "block" and PolicyName_s has "DoH"
  | project UserID, t2=TimeGenerated, sni=SNI_s;
dns_blocks
| join kind=inner doh_blocks on UserID
| where t2 between (t1 .. t1 + win)
| project UserID, dns_q, sni, t1, t2

Severity: high — open a ticket, isolate the device.

Rule 2 — Credential stuffing on Access

Signal: N failed Access logins against the same app in 10 minutes from different IPs.

Cloudflare_Access_CL
| where TimeGenerated > ago(10m)
| where Result_s == "blocked"
| summarize ip_count=dcount(IP_s), attempts=count() by App_s
| where attempts > 20 and ip_count > 5

Rule 3 — Session hijack indicator

Signal: the same UserID + SessionID appearing from different IPs and different countries within 5 minutes.

Cloudflare_Access_CL
| where TimeGenerated > ago(5m)
| summarize ip_list=make_set(IP_s), country_list=make_set(Country_s) by UserID, SessionID
| where array_length(ip_list) > 1 and array_length(country_list) > 1

Rule 4 — Data exfil burst

Signal: a user uploads > 1 GB through HTTP POST in 30 minutes to a destination that isn’t a corporate SaaS.

Cloudflare_Gateway_HTTP_CL
| where TimeGenerated > ago(30m)
| where Method_s == "POST" and ContentLength_d > 0
| where Host_s !in ("drive.google.com", "onedrive.live.com", "s3.company.com")
| summarize total_bytes=sum(ContentLength_d) by UserID, Email
| where total_bytes > 1073741824  // 1 GB

Rule 5 — Impossible travel

Signal: the same user logs in from two countries far enough apart that the implied travel speed exceeds 1,000 km/h.

Cloudflare_Access_CL
| where Result_s == "allowed"
| project UserID, Country_s, Latitude_d, Longitude_d, TimeGenerated
| sort by UserID, TimeGenerated asc
| extend prev_country=prev(Country_s), prev_time=prev(TimeGenerated),
         prev_lat=prev(Latitude_d), prev_lon=prev(Longitude_d)
| extend time_diff_h = (TimeGenerated - prev_time) / 1h
| extend dist_km = geo_distance_2points(prev_lon, prev_lat, Longitude_d, Latitude_d) / 1000
| where time_diff_h > 0 and time_diff_h < 12 and dist_km / time_diff_h > 1000

Retention strategy

Tiered storage

TierLocationRetentionCost (/TB/mo)Use
HotCloudflare dashboard30dincludeddashboard query, debug
WarmR21y~$15forensics, SIEM source
ColdR2 archive / Glacier7y~$1–4compliance, litigation

R2 lifecycle policy

{
  "rules": [
    {
      "id": "archive-after-90d",
      "prefix": "gateway-logs/",
      "transitions": [
        { "days": 90, "storage_class": "INFREQUENT_ACCESS" }
      ]
    },
    {
      "id": "delete-after-7y",
      "prefix": "gateway-logs/",
      "expiration": { "days": 2555 }
    }
  ]
}

Compliance minimums

  • PCI DSS 4.0: 1 year online, 1 year cold minimum for cardholder-related audit logs.
  • HIPAA: 6 years.
  • SOC 2: 1 year for audit, 3–7 years for security events.
  • GDPR: minimum necessary, delete once the purpose is complete (usually < 1 year for detail, longer for aggregates).

Cost control

Volume reduction

1. Filter at Logpush — only ship events that matter:

"filter": "{\"where\":{\"key\":\"Action\",\"operator\":\"ne\",\"value\":\"allow\"}}"

Skip allowed events → volume drops 80–90% for DNS/Network.

2. Sampling — random-sample allowed events for a baseline:

Cloudflare doesn’t natively support sampling at Logpush → work around it by shipping everything and sampling at the SIEM ingest pipeline.

3. Compression — Logpush gzip by default. When pushing to an HTTP webhook, verify the endpoint supports gzip.

4. Dataset selection — not every dataset belongs in the SIEM:

  • gateway_dns blocked only → SIEM.
  • gateway_dns allowed → R2 raw (cold).
  • gateway_network blocked only → SIEM.
  • audit_logs 100% → SIEM (compliance).

Storage budget math

Enterprise, 1,000 users, 50M combined events/day:

  • NDJSON ~500 bytes/event → ~100 bytes compressed.
  • 5 GB/day uncompressed → 500 MB/day compressed.
  • Year: ~180 GB compressed.
  • R2: 180 GB × $0.015 = ~$2.70/month. Cheap.

SIEM ingest (Splunk): roughly $1,800/GB/year typical. Filter aggressively before shipping.

Compression ratio

NDJSON is repetitive (field names repeat) → gzip hits 5–10×. Parquet compresses better, but Logpush doesn’t output Parquet natively → convert at the R2 layer with a cron job.


Ongoing operations

Monitor Logpush health

# Check job status
curl "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/logpush/jobs" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" | jq '.result[] | {name, enabled, last_complete, last_error}'

Alert when last_complete is older than 30 minutes = the job is stuck.

Alerting patterns

  • Logpush job failed > 3 consecutive runs → page.
  • Log volume drops > 50% below baseline → investigate (job stuck? CF outage? misconfigured filter?).
  • SIEM ingestion lag > 15 minutes → SOC escalation.
  • Storage approaching quota → alert before the cap.

Weekly pipeline validation

  • Generate a synthetic event (a DNS query to a canary domain that policy blocks).
  • Verify it lands in: CF dashboard (5 min) → R2 (10 min) → SIEM (15 min).
  • Document in /runbooks/logs-pipeline-health-check.md.

Troubleshooting

”Logpush job fails intermittently”

  • Check the last_error field.
  • Common causes: destination quota, auth token expired, network issue.
  • CF auto-retries three times over 15 minutes; if all fail, the next batch retries.

”The SIEM isn’t getting new logs”

  1. Does the event exist in the dashboard? If not, upstream issue.
  2. Logpush job status: active, last_complete recent?
  3. Destination reachable? Test with curl.
  4. SIEM ingestion queue backlogged?
  5. Parser rule rejecting the event? Check the SIEM error log.

”Log volume suddenly 10× baseline”

  • New decrypt policy enabled → HTTP volume surge.
  • A new DNS location with heavy traffic.
  • An attack scenario (scan, DDoS).
  • Check the Logpush dashboard total-events/day trend.

”Athena query is expensive”

  • Partition pruning not working → check that partitions exist in the Glue catalog.
  • Full scan over a large prefix → add a WHERE year/month/day filter.
  • Convert NDJSON → Parquet for frequently-queried ranges.

”Duplicate events in the SIEM”

  • Logpush is at-least-once. CF retries can duplicate.
  • Dedupe by (Timestamp, EventID) in the SIEM.
  • Event-ID field varies per dataset — check the schema docs.

Checklist — production logs pipeline

Dataset coverage:

  • access_requests → SIEM 100%.
  • audit_logs → SIEM 100%.
  • gateway_dns blocks → SIEM, all events → R2.
  • gateway_network blocks → SIEM, all events → R2.
  • gateway_http blocks + sensitive paths → SIEM, all → R2.
  • device_posture_results → SIEM daily summary.
  • zero_trust_dex_test_results → SIEM.

Infrastructure:

  • R2 bucket created + lifecycle policy.
  • Logpush jobs configured per dataset.
  • Job-status monitoring + alerting.
  • Retention matches compliance (PCI 1y+, SOC 3y+, HIPAA 6y).

SIEM:

  • Native connector or HEC/webhook configured.
  • Parser/dashboard for each dataset.
  • Cross-layer correlation rules deployed.
  • Alert routing to SOC/PagerDuty.

Detection:

  • DoH bypass rule.
  • Credential stuffing rule.
  • Session hijack rule.
  • Data exfil rule.
  • Impossible travel rule.
  • Policy-specific detection (tenant-aware rules).

Operations:

  • Weekly synthetic-event verification.
  • Monthly log-volume review.
  • Quarterly detection-rule tuning.
  • Runbook for “pipeline down”.

Lessons from practice

  • Ship every dataset and enable every rule at once → ingestion cost explodes. Start lean: blocks-only + audit 100%, add allowed traffic when a detection use case emerges.
  • NDJSON → Parquet conversion saves 60–70% storage + query cost. Worth a cron job after 30 days.
  • Cross-layer correlation matters. Single-layer alerts are noisy; multi-layer joins cut false positives sharply.
  • Attackers target the log system. A “log volume drop” detection rule is critical — disabling logging is often the first sign.
  • R2 is the sweet spot for mid-volume. S3 is expensive on egress; BigQuery is overkill; Elastic ingest is expensive.
  • Tune the SIEM quarterly. New apps, new SaaS, new threats → detection rules have to update. Stale rules = false sense of security.
  • Compression + partitioning is not “nice to have” — at production scale, querying unpartitioned data costs hundreds of dollars per run.
  • Test synthetic events weekly. A pipeline runs silent until it breaks. A canary event is the only way to know it’s alive.

Summary

The logs pipeline is the foundation of the Observability & Ops block. Without it, Zero Trust is prevention-only — attackers try every vector and you don’t see the pattern.

Production recipe:

  • Logpush → R2 data lake, 1 year.
  • SIEM ingest: blocks + audit + high-value signals.
  • Cross-layer correlation is the superpower — join UserID across datasets.
  • Retention tiered: hot/warm/cold, aligned with compliance.
  • Cost control: filter at the source, don’t ship everything to the SIEM.

One line to remember:

Logs aren’t data — they’re proof of control. Without cross-layer correlation, Zero Trust is zero visibility.

Part 15 switches to DEX — Digital Experience Monitoring: measuring latency, WARP health, and app reachability from the end user’s perspective, to spot issues before the helpdesk ticket arrives.


References

In this series: