Running CSPM across a dozen AWS Landing Zones

TL;DR

enumerate → dispatch → scan → normalize pipeline with Prowler as scanner, D1 for metadata index, R2 for raw artifacts (JSON, HTML reports, screenshots).

Landing Zone is the primary unit, not the individual account — each LZ owns its audit role, region scope, SLA, owning team.

Cap at ~3 concurrent accounts per Landing Zone: 10-way parallel triggered AWS ThrottlingException; dropping to 3 traded a bit of speed for reliable data.

Don’t dedupe too early — collapsing on finding_id hides the same misconfig in another region; the dedup key must be cloud_provider + landing_zone_id + account_id + region + service + resource_id + control_id.

Severity ≠ risk: the same “Critical public S3 bucket” can be Low (static CSS) or Critical (customer export with PII) — enrich findings with business context.

Most useful view is “Critical/High on production, public-facing, sensitive data, open >7 days, no owner” — not “all critical findings”.

Engine refreshes the account list every 15 minutes; the next direction is event-driven from AWS Organizations for near-real-time detection of new accounts/regions.

When an AWS footprint is small, assessing security posture is relatively simple: open the AWS Console, check a few important services, run some AWS CLI commands, or use a tool like Prowler to export a report.

But once an organisation runs dozens of AWS accounts, split into multiple Landing Zones by business unit, environment, region, and owning team, the problem is no longer “scan an account”.

The real question becomes:

How do we know what the overall AWS posture looks like, which account is the riskiest, which finding to fix first, and who owns remediation?

This post describes how I built an in-house CSPM engine that scans a dozen-plus AWS Landing Zones in parallel with Prowler, stores findings in Cloudflare D1, keeps artifacts in R2, and aggregates everything into a single dashboard for Security Operations.

Context

In an enterprise environment, workloads rarely sit in a single AWS account.

The AWS estate is typically organised across several dimensions:

Landing Zones by business unit or operational domain.
Accounts by workload, environment, or ownership.
Regions by latency, compliance, or deployment footprint.
Security baselines that differ between production, staging, sandbox, and shared services.
Owners split across the cloud team, app teams, platform team, and security team.

At small scale, running CSPM can be as simple as:

prowler aws

At enterprise scale, “what does posture look like today?” quickly fragments into narrower questions:

Which Landing Zone has the most critical findings?
Which account drifts furthest from the baseline?
Which findings repeat across multiple regions?
Which resources are public-facing?
Which findings touch sensitive data?
Which team owns each one?
Which findings have been open for many days without remediation?
Which compliance framework is failing most often?

Without a consistent data shape from day one, a CSPM dashboard devolves into a dumping ground for findings — plenty of alerts, but little operational signal.

Design goals

We did not set out to build an “in-house Wiz” on day one.

The initial goals were pragmatic:

Scan multiple AWS Landing Zones.
Scan many accounts and regions in parallel.
Normalise findings into a shared schema.
Query quickly by Landing Zone, account, region, severity, framework, and owner.
Separate metadata from artifacts to keep storage costs reasonable.
Build a dashboard that Security Operations actually uses daily.

The critical constraint was that the system had to serve two different needs:

Audit / compliance view: what is passing or failing which framework.
Operational risk view: which finding to fix first given the business context.

High-level architecture

At a high level, the CSPM engine is an enumerate → dispatch → scan → normalize pipeline that splits output between D1 (metadata index) and R2 (raw artifacts). The dashboard reads D1 and drills into R2 only when an investigator needs the evidence.

CSPM high-level pipeline: AWS Landing Zones flow through Enumerator → Dispatcher (~3 accounts in parallel) → Prowler workers → Normalizer; the output splits into Cloudflare D1 metadata index and Cloudflare R2 raw artifacts, with the Security dashboard reading D1 and drilling into R2 on demand.

1. Enumerate: define the scan scope

The first stage determines what needs to be scanned.

Instead of hard-coding AWS accounts, the engine treats the Landing Zone as the primary operational unit. Each Landing Zone may have:

Its own SSO Start URL.
Its own list of accounts.
Its own audit role.
Its own region scope.
Its own owner.
Its own scan schedule.
Its own remediation SLA.

A minimal Landing Zone record:

{
  "landing_zone_id": "lz-retail-prod",
  "name": "Retail Production Landing Zone",
  "cloud": "aws",
  "sso_start_url": "https://example.awsapps.com/start",
  "audit_role_name": "SecurityAuditRole",
  "regions": ["ap-southeast-1", "us-east-1"],
  "owner": "retail-platform-team",
  "scan_enabled": true
}

From there, the engine enumerates the accounts and regions to scan. Each (account, region) pair is one scan unit:

Enumerate structure: a Landing Zone fans out to accounts (A, B, C); each account further enumerates its regions (ap-southeast-1, us-east-1) to form scan units. No static account list — new accounts are picked up on the next enumerate pass.

This approach avoids a static list of accounts. When a new account is added to a Landing Zone, the engine picks it up on the next refresh.

2. Dispatch: scan in parallel, but under control

A common mistake when scaling CSPM is assuming “more parallelism is always faster”.

In practice, scanning too many accounts simultaneously triggers AWS API throttling. Some service APIs have lower quotas than expected, particularly when the scanner calls many APIs across many regions.

The dispatcher is therefore designed around these constraints:

Each account-region pair is a scan unit.
Scans run in parallel.
Concurrency is capped per Landing Zone.
Retries use exponential backoff on throttling.
Each scan job has a timeout.
Job state is explicit: pending, running, succeeded, failed, partial.

Dispatch logic, illustrated:

Dispatcher states: inside Landing Zone A with max_concurrent_accounts=3, three accounts are running and two are pending. When a running scan finishes, the scheduler promotes the next pending account.

The initial version scanned 10 accounts in parallel inside the same Landing Zone. On paper it was faster, but the data became unreliable because of frequent ThrottlingException errors.

After dropping the limit to around 3 concurrent accounts per Landing Zone, overall scan time grew slightly, but stability improved significantly.

In CSPM, fast scans are good. Fast scans with untrustworthy data make the dashboard worthless.

3. Normalize: map findings into a shared schema

Prowler output is reasonably complete on its own. But aggregating findings from many accounts, regions, Landing Zones, and frameworks into a single dashboard requires a shared schema.

Findings are normalised into the following field groups:

Identity — finding_id, scan_id, landing_zone_id, account_id, account_name, region, cloud_provider
Resource — resource_id, resource_type, resource_name, service
Security — severity, status, risk_score, risk_reason, remediation
Compliance — framework, control_id, requirement_id
Ownership — owner_team, business_unit, environment
Time — first_seen_at, last_seen_at, resolved_at

The important rule: do not deduplicate too early.

The first version deduplicated by finding_id. It made the dashboard cleaner, but it also stripped away important context.

The same misconfiguration can show up in multiple regions. Dedup on finding_id alone collapses them into one row and the engineer only fixes one region — the other bucket stays public. The dedup key has to carry enough context:

Dedup scope comparison: on the left (too narrow) dedup by finding_id hides the eu-west-1 bucket once ap-southeast-1 is listed; on the right (scope-aware) a composite key of cloud_provider + landing_zone_id + account_id + region + service + resource_id + control_id keeps both buckets visible with their own owner and remediation path.

With this key, findings retain accuracy when the user drills down.

4. Render: a dashboard for Security Operations

A dashboard is not just for “looking at findings”. It has to help the security team make operational decisions.

The filters that matter:

Landing Zone
Account ID
Account name
Region
Severity
Risk score
Compliance framework
Service
Resource type
Owner team
Environment
First seen / last seen
Remediation status

The most useful view for Security Operations is usually not “all critical findings”, but:

Critical or High findings on production accounts, public-facing resources, with sensitive data classification, open for more than 7 days, and without an assigned owner.

Two layers are kept distinct:

Layer	Meaning
Severity	Technical severity reported by the scanner
Risk score	Adjusted risk level after business context is applied

For example:

Finding	Severity	Business context	Risk
Public S3 bucket holding static CSS	Critical	No sensitive data	Low
Public S3 bucket holding customer export	Critical	Contains PII / customer data	Critical
Security group with SSH open internally	High	Private subnet only, ZTNA in front	Medium
IAM user with a stale access key	Medium	Holds admin in production	Critical

This distinction is what separates a scanner report from a CSPM dashboard that is actually usable.

Why D1 and R2

Data splits into two categories:

Data	Location
Metadata, findings, scan status, index	Cloudflare D1
Raw Prowler JSON, HTML reports, exports, screenshots	Cloudflare R2

D1 for metadata

Findings are row-shaped data that is queried and filtered continuously.

Typical queries include:

SELECT *
FROM findings
WHERE landing_zone_id = ?
  AND severity IN ('HIGH', 'CRITICAL')
  AND status = 'FAIL'
ORDER BY risk_score DESC, last_seen_at DESC;

D1 fits the metadata layer because:

The data is structured.
Queries hit many dimensions.
The dashboard reads far more than it writes.
No separate database has to be operated.
It integrates cleanly with Cloudflare Workers.

R2 for artifacts

Raw scanner reports do not belong inside the database.

Artifacts such as:

Raw Prowler JSON
HTML reports
CSV exports
Evidence files
Screenshots
Debug logs

land in R2. D1 only keeps pointers:

{
  "scan_id": "scan_20260428_001",
  "raw_json_url": "r2://cspm-artifacts/prowler/raw/scan_20260428_001.json",
  "html_report_url": "r2://cspm-artifacts/prowler/html/scan_20260428_001.html"
}

This keeps the database small, queries fast, artifacts cheap to store, and lifecycle management straightforward.

Minimum schema

A minimum schema for the CSPM engine can start here:

CREATE TABLE landing_zones (
  id TEXT PRIMARY KEY,
  name TEXT NOT NULL,
  cloud_provider TEXT NOT NULL,
  owner_team TEXT,
  scan_enabled INTEGER DEFAULT 1,
  created_at TEXT,
  updated_at TEXT
);

CREATE TABLE cloud_accounts (
  id TEXT PRIMARY KEY,
  landing_zone_id TEXT NOT NULL,
  account_id TEXT NOT NULL,
  account_name TEXT,
  environment TEXT,
  owner_team TEXT,
  status TEXT,
  created_at TEXT,
  updated_at TEXT,
  FOREIGN KEY (landing_zone_id) REFERENCES landing_zones(id)
);

CREATE TABLE scans (
  id TEXT PRIMARY KEY,
  landing_zone_id TEXT NOT NULL,
  account_id TEXT,
  region TEXT,
  scanner TEXT NOT NULL,
  status TEXT NOT NULL,
  started_at TEXT,
  finished_at TEXT,
  artifact_url TEXT,
  error_message TEXT,
  FOREIGN KEY (landing_zone_id) REFERENCES landing_zones(id)
);

CREATE TABLE findings (
  id TEXT PRIMARY KEY,
  scan_id TEXT NOT NULL,
  landing_zone_id TEXT NOT NULL,
  account_id TEXT NOT NULL,
  region TEXT,
  service TEXT,
  resource_id TEXT,
  resource_type TEXT,
  control_id TEXT,
  title TEXT,
  severity TEXT,
  status TEXT,
  risk_score INTEGER,
  risk_reason TEXT,
  framework TEXT,
  remediation TEXT,
  owner_team TEXT,
  first_seen_at TEXT,
  last_seen_at TEXT,
  resolved_at TEXT,
  FOREIGN KEY (scan_id) REFERENCES scans(id)
);

This is not the final shape, but it is enough to support the core use cases:

Scan tracking
Finding inventory
Dashboard filters
Risk prioritisation
Owner assignment
Historical trending

Three lessons

1. Do not deduplicate findings too early

Deduplication cleans up the dashboard, but done wrong it strips context.

Findings must be tracked at the right scope — account + region + resource + control. Not by control name or finding ID alone.

In cloud security, the same issue on a different account, region, or resource has a completely different remediation impact.

2. Parallel scans need rate control

Parallel scans are mandatory to scale.

Parallel does not mean firing API calls without limit.

The necessary controls:

Concurrency limit per Landing Zone
Concurrency limit per account
Retry with exponential backoff
Timeouts per job
Partial-result handling
Explicit scan status
Error classification

A good CSPM engine does not only know that a scan succeeded. It knows where it failed, whether the failure is a credential issue, throttling, a disabled region, or missing permission.

3. Severity is not risk

Scanner severity is a technical signal. Risk is the result of combining that signal with business context.

For the dashboard to be useful, findings need enrichment with information such as:

Is the resource public-facing?
Does it hold sensitive data?
Production or non-production?
Is there internet exposure?
Is there a compensating control?
Which team owns it?
How long has the finding been open?
Is there a realistic exploit path?

Without a risk-scoring layer, the dashboard drowns in critical/high findings without giving the team a clear order of operations.

What this produced

After rolling out this design, the result was a centralised CSPM dashboard with the baseline capabilities:

Posture view per Landing Zone.
Drill down by account, region, and service.
Filters by severity, risk score, framework, and owner.
Scan status tracking.
Raw evidence preserved for audit.
Remediation ordered by business risk rather than purely by technical severity.

More importantly, the security team no longer has to open individual AWS Consoles or scroll through separate reports to answer:

Where is the AWS estate most at risk right now?

What is still open

The engine currently refreshes the account inventory on a periodic schedule (every 15 minutes).

That is acceptable for audit and posture review, but insufficient for near-real-time detection when a new Landing Zone, account, or region is added.

The next direction is a more event-driven model:

Detect new accounts from AWS Organizations events.
Trigger scans when a new account or region appears.
An idempotent ingestion pipeline.
Separate scan orchestration from finding ingestion.
Multi-cloud support on the same schema: AWS, Azure, GCP.

Summary

When building CSPM for an enterprise environment, the problem is not primarily about picking a scanner.

Prowler, ScoutSuite, Steampipe, and other open-source scanners can all produce findings. The harder layer is operational:

How is the scan scope defined?
How is parallelism controlled?
How are findings normalised?
Where does metadata and artifact data live?
What is the dedup logic?
How are owners assigned?
What drives remediation priority?
Does the dashboard help the security team decide?

The approach that worked:

Treat the Landing Zone as the primary operational unit.
Use Prowler as the initial scanner.
Keep metadata in D1.
Keep raw artifacts in R2.
Normalise findings into a shared schema.
Layer business-aware risk scoring on top of severity.

This is not a complete CSPM in the sense of a commercial product. It is a solid foundation for an internal CSPM platform that can grow by account, region, Landing Zone, and eventually multi-cloud.

Running CSPM across a dozen AWS Landing Zones

Context

Design goals

High-level architecture

1. Enumerate: define the scan scope

2. Dispatch: scan in parallel, but under control

3. Normalize: map findings into a shared schema

4. Render: a dashboard for Security Operations

Why D1 and R2

D1 for metadata

R2 for artifacts

Minimum schema

Three lessons

1. Do not deduplicate findings too early

2. Parallel scans need rate control

3. Severity is not risk

What this produced

What is still open

Summary

Mentions from the web

Ask the blog

Sources

Context

Design goals

High-level architecture

1. Enumerate: define the scan scope

2. Dispatch: scan in parallel, but under control

3. Normalize: map findings into a shared schema

4. Render: a dashboard for Security Operations

Why D1 and R2

D1 for metadata

R2 for artifacts

Minimum schema

Three lessons

1. Do not deduplicate findings too early

2. Parallel scans need rate control

3. Severity is not risk

What this produced

What is still open

Summary

Related reading

Migrating AWS/Vercel to Cloudflare: a real playbook

Cloudflare Developer Platform cost model: tiers vs AWS

Worker security: secrets, CSP, Bot Management, Turnstile