End-to-end logs pipeline: Logpush, R2, SIEM correlation

Q: Logpush job fails intermittently

- Check the last_error field. - Common causes: destination quota, auth token expired, network issue. - CF auto-retries three times over 15 minutes; if all fail, the next batch retries.

Q: The SIEM isn't getting new logs

1. Does the event exist in the dashboard? If not, upstream issue. 2. Logpush job status: active, last_complete recent? 3. Destination reachable? Test with curl. 4. SIEM ingestion queue backlogged? 5. Parser rule rejecting the event? Check the SIEM error log.

Q: Log volume suddenly 10× baseline

- New decrypt policy enabled → HTTP volume surge. - A new DNS location with heavy traffic. - An attack scenario (scan, DDoS). - Check the Logpush dashboard total-events/day trend.

Q: Athena query is expensive

- Partition pruning not working → check that partitions exist in the Glue catalog. - Full scan over a large prefix → add a WHERE year/month/day filter. - Convert NDJSON → Parquet for frequently-queried ranges.

Q: Duplicate events in the SIEM

- Logpush is at-least-once. CF retries can duplicate. - Dedupe by (Timestamp, EventID) in the SIEM. - Event-ID field varies per dataset — check the schema docs.

TL;DR

Logs are the observability foundation of Cloudflare One. The native dashboard only keeps 30 days and makes cross-layer correlation hard. Production needs:

Logpush — streaming export to destinations (R2, S3, Splunk, Sentinel, Datadog, …).
Tiered retention — hot 30 days (CF), warm 1 year (R2/S3), cold 7 years (Glacier).
Cross-layer correlation — join DNS + Network + HTTP + Access logs by UserID to catch multi-stage attacks.
Detection rules in the SIEM.
Cost control — sampling, compression, lifecycle policy.

This post covers:

The datasets Cloudflare One exposes.
Logpush mechanics — batching, format, destinations.
R2 as a data lake — setup + query with Athena/DuckDB.
SIEM integration patterns — Splunk, Sentinel, Elastic.
Correlation rules for DoH bypass, credential stuffing, session hijack.
Cost and retention.

The thesis:

Logs aren’t an afterthought — they are the backbone of Zero Trust. Without a SIEM pipeline and cross-layer correlation, you have prevention but no detection. Attackers try every vector; you need to see all of them to detect a pattern.

This is Part 14 of the Cloudflare One Handbook and opens the Observability & Ops block (Parts 14–16).

Who this is for

Security engineers building SIEM integration.
SOC analysts writing cross-layer detection rules.
Platform engineers setting up a data-lake observability layer.

What this post does not cover

General Cloudflare logs outside Zero Trust (CDN, WAF, Workers — a different post).
Deep SIEM tuning (Splunk SPL, KQL optimisation) — brief mention only.
Compliance-specific reporting (PCI, HIPAA) — retention is mentioned, detailed frameworks are not.
Analytics Engine — separate post.

Concepts

Dataset — a log stream Cloudflare exposes; each dataset has a fixed schema (gateway_dns, gateway_network, gateway_http, access_requests, …).
Logpush — the Cloudflare service that pushes log batches to a destination over HTTP / S3.
Destination — where logs land: R2, S3, GCS, Azure Blob, Splunk HEC, Sumo Logic HTTP, Datadog, New Relic.
NDJSON — newline-delimited JSON, Logpush’s default format.
Logpull — legacy pull API (deprecated for most datasets).
SIEM — Security Information and Event Management: Splunk, Sentinel, Elastic, Sumo, Datadog.
Data lake — bulk storage (R2/S3) + a query engine (Athena, DuckDB, BigQuery).

Cloudflare One datasets

Cloudflare One log datasets map: Access Requests, Gateway DNS, Gateway Network, Gateway HTTP, Audit Logs, Device Posture, Zero Trust DEX — each dataset's schema and use.

Dataset	Content	Volume (enterprise, typical)
`access_requests`	Every ZTNA auth attempt — user, app, decision, identity	10K–100K/day
`gateway_dns`	Every DNS query through Gateway	10M–500M/day
`gateway_network`	L4 TCP/UDP connection events	50M–1B/day
`gateway_http`	HTTP requests (when decryption is on)	10M–500M/day
`audit_logs`	Admin / config changes	100–10K/day
`device_posture_results`	Device posture check results	1M–50M/day
`zero_trust_dex_test_results`	DEX (Digital Experience Monitoring) tests	10K–1M/day
`access_login`	ZTNA login events (sessions)	1K–100K/day
`casb_findings`	CASB scan findings on connected SaaS	100–10K/day

Dataset characteristics

Access logs — lower volume, high value. Essential for forensics.
Gateway DNS — broad coverage, moderate volume. Baseline monitoring.
Gateway Network — highest volume. Expensive. Sample aggressively.
Gateway HTTP — medium-to-high volume. Depends on the decrypt scope.
Audit — must ship 100% (compliance).

Logpush mechanics

Logpush flow: Cloudflare edge generates events → batched every 5 minutes or 5MB → compressed (gzip) → pushed via HTTP/S3 to a destination → lands as NDJSON files partitioned by date/time.

Batch behaviour

Batch trigger: 5-minute tick OR 5 MB size (whichever first).
Max batch: 5 MB compressed → ~50 MB uncompressed.
Delivery: at-least-once (rare duplicates; the SIEM must dedupe by event ID).
Ordering: NOT guaranteed inside a batch; use the timestamp field.

Format options

{
  "output_options": {
    "output_type": "ndjson",           // or "csv"
    "timestamp_format": "rfc3339",     // or "unix"
    "field_names": [],                 // empty = all
    "field_delimiter": ",",            // csv only
    "record_delimiter": "\n"
  }
}

Destination types

Object storage: R2, S3, GCS, Azure Blob. Path template: dataset/{DATE}/{TIME}.
HTTP webhook: generic — Splunk HEC, Datadog, Sumo, custom.
Native SIEM: Sentinel connector (Microsoft ecosystem), official Splunk app.

Path templating

r2://gateway-logs/{dataset}/year={YEAR}/month={MONTH}/day={DAY}/hour={HOUR}/{UUID}.json.gz

A Hive-partitioned layout → Athena/DuckDB queries are efficient. year/month/day partitioning is the standard.

Filter at source

{
  "filter": "{\"where\":{\"key\":\"Action\",\"operator\":\"eq\",\"value\":\"block\"}}"
}

Example: push only block events — reduces volume 80–90% for the DNS dataset.

R2 as a data lake

Why R2 over S3

Egress free — querying R2 data costs no egress (S3 does).
Native Cloudflare integration — direct Logpush, no IAM friction.
Cheap storage ~$15/TB/month, comparable to S3, but free egress is the big win.

Setup

Create an R2 bucket:

wrangler r2 bucket create gateway-logs

Create an API token for Logpush:

Dashboard → R2 → Manage R2 API Tokens → create a token with Object Read & Write.

Configure Logpush:

curl -X POST \
  "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/logpush/jobs" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" \
  --data @- <<'EOF'
{
  "name": "gateway-dns-to-r2",
  "dataset": "gateway_dns",
  "destination_conf": "r2://gateway-logs/dns/year={YEAR}/month={MONTH}/day={DAY}/?account-id=${ACCOUNT_ID}&access-key-id=${R2_KEY}&secret-access-key=${R2_SECRET}",
  "output_options": {
    "output_type": "ndjson",
    "timestamp_format": "rfc3339"
  },
  "enabled": true
}
EOF

Verify:

wrangler r2 object list gateway-logs --prefix dns/

Batches appear within 5–10 minutes.

Query with DuckDB (local)

-- Install the httpfs extension
INSTALL httpfs;
LOAD httpfs;

-- Configure R2 credentials
SET s3_endpoint='<account-id>.r2.cloudflarestorage.com';
SET s3_access_key_id='<r2-key>';
SET s3_secret_access_key='<r2-secret>';
SET s3_url_style='path';

-- Query
SELECT
  Email,
  COUNT(*) as blocked_count,
  ARRAY_AGG(DISTINCT DNSQuestion) as domains
FROM read_json('s3://gateway-logs/dns/year=2026/month=05/day=15/*.json.gz')
WHERE Action = 'block'
GROUP BY Email
ORDER BY blocked_count DESC
LIMIT 20;

Runs locally — no cluster required.

Query with Athena (AWS)

R2 is S3-compatible; Athena queries it directly if the bucket is public or a cross-account role is configured. Pattern:

A Glue Crawler scans R2 partitions → catalog table.
Athena SQL runs over the catalog.
Use CTAS (Create Table As Select) for an aggregate layer.

Cost: Athena at $5/TB scanned — partition pruning matters.

SIEM integration patterns

SIEM integration: Logpush → native connectors (Splunk HEC, Sentinel Data Connector, Datadog HTTP, Sumo HTTP Source). Correlation across datasets in the SIEM's query language. Alert routing to PagerDuty or a SOC workflow.

Splunk

Option A — HEC direct:

{
  "destination_conf": "splunk://hec.splunk.company.com/services/collector/raw?header_Authorization=Splunk%20<token>&channel=<uuid>&header_Content-Type=application%2Fjson&insecure-skip-verify=false",
  "dataset": "gateway_dns"
}

Option B — S3 → Splunk pull (cost-effective for high volume):

Logpush → R2 → Splunk SmartStore or a custom pull script.

SPL example — DoH bypass detection:

index=cloudflare sourcetype=gateway_network
| where match(SNI, "dns\\.google|dns\\.quad9\\.net|doh\\.opendns\\.com")
| where Action="block"
| stats count by UserID, Email, SNI
| where count > 5
| sort -count

Microsoft Sentinel

Native Cloudflare connector (in Content Hub):

Install the “Cloudflare (using Azure Function)” connector.
Provide a Cloudflare API token + account ID.
Data flows to the Log Analytics workspace.

Alternative: Logpush → Sentinel via Event Hub.

KQL example — cross-layer correlation:

let timeframe = 1h;
Cloudflare_Gateway_DNS_CL
| where TimeGenerated > ago(timeframe)
| where Action_s == "block"
| where Categories_s contains "Malware"
| project UserID=UserID_s, blockedDomain=DNSQuestion_s, t_dns=TimeGenerated
| join kind=inner (
    Cloudflare_Gateway_Network_CL
    | where TimeGenerated > ago(timeframe)
    | where Action_s == "block"
    | project UserID=UserID_s, blockedIP=DestinationIP_s, t_net=TimeGenerated
) on UserID
| where t_net between (t_dns .. t_dns + 5m)
| project UserID, blockedDomain, blockedIP, t_dns, t_net

Detects: a user’s DNS lookup was blocked, and within 5 minutes they tried a direct IP — a multi-stage attempt.

Elastic / OpenSearch

Logstash pipeline:

input {
  s3 {
    bucket => "gateway-logs"
    endpoint => "<account-id>.r2.cloudflarestorage.com"
    access_key_id => "<r2-key>"
    secret_access_key => "<r2-secret>"
    codec => "json_lines"
  }
}

filter {
  if [dataset] == "gateway_dns" {
    mutate { add_field => { "[@metadata][index]" => "cf-gateway-dns-%{+YYYY.MM.dd}" } }
  }
}

output {
  elasticsearch {
    hosts => ["https://es.company.com:9200"]
    index => "%{[@metadata][index]}"
  }
}

Datadog

HTTP webhook destination:

datadog://http-intake.logs.datadoghq.com/api/v2/logs?header_DD-API-KEY=<key>&ddsource=cloudflare&service=gateway

Datadog parses JSON automatically and applies a pipeline for tagging.

Cross-layer correlation rules

Attack chain detection: a single request pattern blocks at DNS, then tries a Network IP, then an HTTP direct path. Correlate by UserID across datasets inside a 15-minute window to detect the multi-stage attempt.

Rule 1 — DoH bypass attempt

Signal: the same user is blocked at DNS and also blocked for a DoH destination at Network within the same 15 minutes.

Threat model: malware first tries the system resolver → blocked → falls back to DoH → also blocked. A confirmed compromise attempt.

let win = 15m;
let dns_blocks = Cloudflare_Gateway_DNS_CL
  | where Action_s == "block" and Categories_s has "Malware"
  | project UserID, t1=TimeGenerated, dns_q=DNSQuestion_s;
let doh_blocks = Cloudflare_Gateway_Network_CL
  | where Action_s == "block" and PolicyName_s has "DoH"
  | project UserID, t2=TimeGenerated, sni=SNI_s;
dns_blocks
| join kind=inner doh_blocks on UserID
| where t2 between (t1 .. t1 + win)
| project UserID, dns_q, sni, t1, t2

Severity: high — open a ticket, isolate the device.

Rule 2 — Credential stuffing on Access

Signal: N failed Access logins against the same app in 10 minutes from different IPs.

Cloudflare_Access_CL
| where TimeGenerated > ago(10m)
| where Result_s == "blocked"
| summarize ip_count=dcount(IP_s), attempts=count() by App_s
| where attempts > 20 and ip_count > 5

Rule 3 — Session hijack indicator

Signal: the same UserID + SessionID appearing from different IPs and different countries within 5 minutes.

Cloudflare_Access_CL
| where TimeGenerated > ago(5m)
| summarize ip_list=make_set(IP_s), country_list=make_set(Country_s) by UserID, SessionID
| where array_length(ip_list) > 1 and array_length(country_list) > 1

Rule 4 — Data exfil burst

Signal: a user uploads > 1 GB through HTTP POST in 30 minutes to a destination that isn’t a corporate SaaS.

Cloudflare_Gateway_HTTP_CL
| where TimeGenerated > ago(30m)
| where Method_s == "POST" and ContentLength_d > 0
| where Host_s !in ("drive.google.com", "onedrive.live.com", "s3.company.com")
| summarize total_bytes=sum(ContentLength_d) by UserID, Email
| where total_bytes > 1073741824  // 1 GB

Rule 5 — Impossible travel

Signal: the same user logs in from two countries far enough apart that the implied travel speed exceeds 1,000 km/h.

Cloudflare_Access_CL
| where Result_s == "allowed"
| project UserID, Country_s, Latitude_d, Longitude_d, TimeGenerated
| sort by UserID, TimeGenerated asc
| extend prev_country=prev(Country_s), prev_time=prev(TimeGenerated),
         prev_lat=prev(Latitude_d), prev_lon=prev(Longitude_d)
| extend time_diff_h = (TimeGenerated - prev_time) / 1h
| extend dist_km = geo_distance_2points(prev_lon, prev_lat, Longitude_d, Latitude_d) / 1000
| where time_diff_h > 0 and time_diff_h < 12 and dist_km / time_diff_h > 1000

Retention strategy

Tiered storage

Tier	Location	Retention	Cost (/TB/mo)	Use
Hot	Cloudflare dashboard	30d	included	dashboard query, debug
Warm	R2	1y	~$15	forensics, SIEM source
Cold	R2 archive / Glacier	7y	~$1–4	compliance, litigation

R2 lifecycle policy

{
  "rules": [
    {
      "id": "archive-after-90d",
      "prefix": "gateway-logs/",
      "transitions": [
        { "days": 90, "storage_class": "INFREQUENT_ACCESS" }
      ]
    },
    {
      "id": "delete-after-7y",
      "prefix": "gateway-logs/",
      "expiration": { "days": 2555 }
    }
  ]
}

Compliance minimums

PCI DSS 4.0: 1 year online, 1 year cold minimum for cardholder-related audit logs.
HIPAA: 6 years.
SOC 2: 1 year for audit, 3–7 years for security events.
GDPR: minimum necessary, delete once the purpose is complete (usually < 1 year for detail, longer for aggregates).

Cost control

Volume reduction

1. Filter at Logpush — only ship events that matter:

"filter": "{\"where\":{\"key\":\"Action\",\"operator\":\"ne\",\"value\":\"allow\"}}"

Skip allowed events → volume drops 80–90% for DNS/Network.

2. Sampling — random-sample allowed events for a baseline:

Cloudflare doesn’t natively support sampling at Logpush → work around it by shipping everything and sampling at the SIEM ingest pipeline.

3. Compression — Logpush gzip by default. When pushing to an HTTP webhook, verify the endpoint supports gzip.

4. Dataset selection — not every dataset belongs in the SIEM:

gateway_dns blocked only → SIEM.
gateway_dns allowed → R2 raw (cold).
gateway_network blocked only → SIEM.
audit_logs 100% → SIEM (compliance).

Storage budget math

Enterprise, 1,000 users, 50M combined events/day:

NDJSON ~500 bytes/event → ~100 bytes compressed.
5 GB/day uncompressed → 500 MB/day compressed.
Year: ~180 GB compressed.
R2: 180 GB × $0.015 = ~$2.70/month. Cheap.

SIEM ingest (Splunk): roughly $1,800/GB/year typical. Filter aggressively before shipping.

Compression ratio

NDJSON is repetitive (field names repeat) → gzip hits 5–10×. Parquet compresses better, but Logpush doesn’t output Parquet natively → convert at the R2 layer with a cron job.

Ongoing operations

Monitor Logpush health

# Check job status
curl "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/logpush/jobs" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" | jq '.result[] | {name, enabled, last_complete, last_error}'

Alert when last_complete is older than 30 minutes = the job is stuck.

Alerting patterns

Logpush job failed > 3 consecutive runs → page.
Log volume drops > 50% below baseline → investigate (job stuck? CF outage? misconfigured filter?).
SIEM ingestion lag > 15 minutes → SOC escalation.
Storage approaching quota → alert before the cap.

Weekly pipeline validation

Generate a synthetic event (a DNS query to a canary domain that policy blocks).
Verify it lands in: CF dashboard (5 min) → R2 (10 min) → SIEM (15 min).
Document in /runbooks/logs-pipeline-health-check.md.

Troubleshooting

”Logpush job fails intermittently”

Check the last_error field.
Common causes: destination quota, auth token expired, network issue.
CF auto-retries three times over 15 minutes; if all fail, the next batch retries.

”The SIEM isn’t getting new logs”

Does the event exist in the dashboard? If not, upstream issue.
Logpush job status: active, last_complete recent?
Destination reachable? Test with curl.
SIEM ingestion queue backlogged?
Parser rule rejecting the event? Check the SIEM error log.

”Log volume suddenly 10× baseline”

New decrypt policy enabled → HTTP volume surge.
A new DNS location with heavy traffic.
An attack scenario (scan, DDoS).
Check the Logpush dashboard total-events/day trend.

”Athena query is expensive”

Partition pruning not working → check that partitions exist in the Glue catalog.
Full scan over a large prefix → add a WHERE year/month/day filter.
Convert NDJSON → Parquet for frequently-queried ranges.

”Duplicate events in the SIEM”

Logpush is at-least-once. CF retries can duplicate.
Dedupe by (Timestamp, EventID) in the SIEM.
Event-ID field varies per dataset — check the schema docs.

Checklist — production logs pipeline

Dataset coverage:

access_requests → SIEM 100%.
audit_logs → SIEM 100%.
gateway_dns blocks → SIEM, all events → R2.
gateway_network blocks → SIEM, all events → R2.
gateway_http blocks + sensitive paths → SIEM, all → R2.
device_posture_results → SIEM daily summary.
zero_trust_dex_test_results → SIEM.

Infrastructure:

R2 bucket created + lifecycle policy.
Logpush jobs configured per dataset.
Job-status monitoring + alerting.
Retention matches compliance (PCI 1y+, SOC 3y+, HIPAA 6y).

SIEM:

Native connector or HEC/webhook configured.
Parser/dashboard for each dataset.
Cross-layer correlation rules deployed.
Alert routing to SOC/PagerDuty.

Detection:

Operations:

Weekly synthetic-event verification.
Monthly log-volume review.
Quarterly detection-rule tuning.
Runbook for “pipeline down”.

Lessons from practice

Ship every dataset and enable every rule at once → ingestion cost explodes. Start lean: blocks-only + audit 100%, add allowed traffic when a detection use case emerges.
NDJSON → Parquet conversion saves 60–70% storage + query cost. Worth a cron job after 30 days.
Cross-layer correlation matters. Single-layer alerts are noisy; multi-layer joins cut false positives sharply.
Attackers target the log system. A “log volume drop” detection rule is critical — disabling logging is often the first sign.
R2 is the sweet spot for mid-volume. S3 is expensive on egress; BigQuery is overkill; Elastic ingest is expensive.
Tune the SIEM quarterly. New apps, new SaaS, new threats → detection rules have to update. Stale rules = false sense of security.
Compression + partitioning is not “nice to have” — at production scale, querying unpartitioned data costs hundreds of dollars per run.
Test synthetic events weekly. A pipeline runs silent until it breaks. A canary event is the only way to know it’s alive.

Summary

The logs pipeline is the foundation of the Observability & Ops block. Without it, Zero Trust is prevention-only — attackers try every vector and you don’t see the pattern.

Production recipe:

Logpush → R2 data lake, 1 year.
SIEM ingest: blocks + audit + high-value signals.
Cross-layer correlation is the superpower — join UserID across datasets.
Retention tiered: hot/warm/cold, aligned with compliance.
Cost control: filter at the source, don’t ship everything to the SIEM.

One line to remember:

Logs aren’t data — they’re proof of control. Without cross-layer correlation, Zero Trust is zero visibility.

Part 15 switches to DEX — Digital Experience Monitoring: measuring latency, WARP health, and app reachability from the end user’s perspective, to spot issues before the helpdesk ticket arrives.

References

In this series:

← Part 13: Network policy L4
Next → Part 15: DEX — Digital Experience Monitoring
All parts: Cloudflare One Handbook series