Five lessons on D1 schema design, learned the hard way

TL;DR

Lessons from running D1 with ~50 million rows for six months — SQLite at the edge has pressure curves SQLite on a laptop never sees.

Composite primary keys like (account_id, region, resource_id, control_id) beat surrogate id TEXT PRIMARY KEY when the natural key is stable — saves storage and removes the duplicate unique index.

Indexes follow the query pattern, not intuition — severity alone is low-cardinality; place it after landing_zone_id, status, account_id inside a composite index.

For free-text search use SQLite fts5 instead of LIKE '%keyword%' — the leading wildcard kills index usage past a few million rows; Vietnamese text needs Unicode normalization + diacritic stripping.

Sharding is the last resort — column-prune (push raw JSON/HTML/evidence to R2), archive > 90 days data, then build finding_summary_daily denormalized tables before splitting tables.

Time travel ≠ backup: run wrangler d1 export --remote daily to R2 with 30d/12w/12mo retention, and actually test restore — an untested backup is just a feeling of safety.

Cloudflare D1 is easy to start with.

You create a database, write a few SQL statements, deploy a Worker, and everything runs smoothly. Because D1 sits on top of SQLite, the first impression is usually “this is just SQLite.”

Once D1 goes to production — especially with tens of millions of rows — that impression breaks down quickly:

D1 is SQLite, but SQLite at the edge has pressures that are very different from SQLite on a laptop or in a traditional backend VM.

This post collects five lessons after around six months of operating D1 with roughly 50 million rows. Some come from benchmarks. Others come from dashboards crawling, queries scanning too many rows, or migrations that looked simple and destabilised the system.

Context

D1 powers a few metadata-index workloads here:

Findings from a CSPM scanner.
Scan status.
Resource inventory.
Mapping between account, landing zone, owner, and framework.
Dashboard filters for Security Operations.
A light portion of the audit trail.

This isn’t a heavy OLTP banking core, and it isn’t an analytics warehouse either. But the data grows steadily, query patterns repeat, and the dashboard has to respond fast.

Early on, the schema looked a lot like something you’d build on Postgres or MySQL:

CREATE TABLE findings (
  id TEXT PRIMARY KEY,
  account_id TEXT,
  finding_id TEXT,
  region TEXT,
  severity TEXT,
  status TEXT,
  created_at TEXT
);

It looks fine at a glance. But as data grew from millions to tens of millions of rows, small schema decisions started to cost real latency.

1. A composite primary key often beats a surrogate ID

The first mistake was using a surrogate ID on nearly every table.

For example:

id TEXT PRIMARY KEY

…and then adding the columns that actually identify the row:

account_id TEXT
finding_id TEXT
region TEXT
resource_id TEXT
control_id TEXT

It’s familiar, easy to write code against, easy to debug. Against D1 at scale, it introduces two problems.

First, you’re carrying a id column that isn’t tied to any business meaning.

Second, the real natural key still needs its own unique index to prevent duplicates:

CREATE UNIQUE INDEX idx_findings_unique
ON findings(account_id, region, resource_id, control_id);

So the same logical identifier is maintained twice — once as the surrogate PK, once as a secondary unique index.

After a few rounds of measuring storage and query patterns, we moved tables with stable natural keys over to a composite primary key:

CREATE TABLE findings (
  account_id TEXT NOT NULL,
  region TEXT NOT NULL,
  resource_id TEXT NOT NULL,
  control_id TEXT NOT NULL,

  landing_zone_id TEXT NOT NULL,
  service TEXT,
  severity TEXT,
  status TEXT,
  risk_score INTEGER,
  first_seen_at TEXT,
  last_seen_at TEXT,

  PRIMARY KEY (account_id, region, resource_id, control_id)
);

For CSPM-style findings, the natural key is obvious:

account_id + region + resource_id + control_id

Or in some cases:

landing_zone_id + account_id + region + finding_id

When the natural key is stable, using it as the primary key:

Reduces storage — no extra surrogate column.
Removes the need for a secondary unique index.
Keeps prefix-key queries efficient.
Lowers the risk of duplicate findings.
Makes the schema match the domain model.

Surrogate IDs stay useful when the data has no stable natural key, for example:

User-generated content.
Event logs where duplicate content is legitimate.
Records that need identity independent of business keys.
Entities whose business keys can change.

The rule is simple:

If the data already has a clear, stable natural key, don’t reflexively add a surrogate ID.

2. Only index columns that are actually used in `WHERE`, `JOIN`, or `ORDER BY`

When a dashboard slows down, the reflex is to add an index.

Query slow on severity? Add an index.

Slow on status? Add an index.

Slow on region? Add an index.

A few weeks later, the schema is littered with single-column indexes:

CREATE INDEX idx_findings_severity ON findings(severity);
CREATE INDEX idx_findings_status ON findings(status);
CREATE INDEX idx_findings_region ON findings(region);
CREATE INDEX idx_findings_account ON findings(account_id);
CREATE INDEX idx_findings_framework ON findings(framework);

It feels safe. It isn’t optimal.

Every index has a cost:

Higher storage.
Slower inserts and updates.
Heavier migrations.
May not be picked by the query planner if selectivity is low.
Doesn’t handle multi-condition queries well.

The dashboard rarely runs:

SELECT *
FROM findings
WHERE severity = 'CRITICAL';

The real query usually looks like:

SELECT *
FROM findings
WHERE landing_zone_id = ?
  AND account_id = ?
  AND severity IN ('HIGH', 'CRITICAL')
  AND status = 'FAIL'
ORDER BY risk_score DESC, last_seen_at DESC
LIMIT 100;

For that query, a composite index that matches the pattern beats several single-column indexes every time.

For example:

CREATE INDEX idx_findings_dashboard
ON findings (
  landing_zone_id,
  account_id,
  status,
  severity,
  risk_score,
  last_seen_at
);

Or, when the dashboard filters by landing zone first:

CREATE INDEX idx_findings_lz_status_severity
ON findings (
  landing_zone_id,
  status,
  severity,
  last_seen_at
);

The rule we apply:

Case	Treatment
Column appears often in `WHERE`	Consider an index
Query uses several stable conditions together	Prefer a composite index
Column has low cardinality	Don’t rush to add a single-column index
Column is display-only, not filtered	No index
Column rarely used, not on a critical path	Skip for now
Column on the main dashboard	Measure, then index deliberately

A good example of a column not to index by itself is severity.

Severity usually has only a handful of values:

CRITICAL, HIGH, MEDIUM, LOW, INFORMATIONAL

Indexing severity alone still forces the database to scan many rows because selectivity is low. Place severity after landing_zone_id, status, and account_id in a composite index, and the effect is much better.

The lesson:

Indexes should follow the query pattern — not intuition.

3. FTS is still very much worth it for free-text search

Sooner or later, every dashboard grows a search box.

The quickest first step is LIKE:

SELECT *
FROM findings
WHERE title LIKE '%public bucket%'
   OR resource_id LIKE '%public bucket%'
   OR remediation LIKE '%public bucket%';

That’s fine on small datasets. Once the table grows past a few hundred thousand or a few million rows, %keyword% queries slow down noticeably.

The reason: a wildcard at the start of the string prevents conventional indexes from helping.

For free-text search, FTS is a better fit.

Create an FTS table:

CREATE VIRTUAL TABLE findings_fts
USING fts5(
  finding_pk,
  title,
  resource_id,
  remediation,
  content
);

When ingesting a finding, write to the FTS table too:

INSERT INTO findings_fts (
  finding_pk,
  title,
  resource_id,
  remediation,
  content
)
VALUES (?, ?, ?, ?, ?);

On search:

SELECT finding_pk
FROM findings_fts
WHERE findings_fts MATCH ?
LIMIT 50;

Then join back to the main table:

SELECT f.*
FROM findings f
JOIN findings_fts s
  ON s.finding_pk = f.finding_pk
WHERE findings_fts MATCH ?
LIMIT 50;

FTS suits:

Searching titles.
Resource names.
Remediation text.
Control descriptions.
Owner or text-like metadata.
Cross-field search on the dashboard.

Vietnamese text needs extra care. SQLite FTS doesn’t ship a proper Vietnamese stemmer. If the data contains a lot of Vietnamese, preprocess before writing to the FTS table:

Normalise Unicode.
Lowercase.
Strip or normalise diacritics when the use case allows.
Normalise whitespace.
Tokenise consistently.
Optionally store a search-normalised field alongside the original.

A simple example:

"Máy chủ cấu hình sai Security Group"

can be normalised to:

"may chu cau hinh sai security group"

Depending on the use case, storing both the original (for display) and the normalised form (for search) is a cheap win.

The lesson:

For free-text search, don’t keep extending the life of LIKE '%keyword%'. If search is a primary feature, design FTS in from the start.

4. Sharding is the last resort, not the first reflex

When a table grows fast, the reflex is to shard.

For example:

findings_2026_01
findings_2026_02
findings_2026_03

Or:

findings_lz_a
findings_lz_b
findings_lz_c

Sharding may become necessary eventually. In practice, it’s the last option.

Before sharding, three things come first.

4.1. Column pruning

Not every piece of data belongs in D1.

Early on, we tended to dump everything into the findings table:

raw_json TEXT,
html_report TEXT,
evidence TEXT,
debug_log TEXT

Rows ballooned quickly.

The fix was clean separation:

Data	Location
Fields used for filter/query	D1
Raw JSON reports	R2
HTML reports	R2
Screenshots / evidence files	R2
Debug logs	R2
Metadata pointers	D1

D1 keeps only pointers:

artifact_url TEXT,
raw_json_url TEXT,
evidence_url TEXT

Smaller D1, faster queries, sensible storage allocation.

The rule:

D1 holds metadata and indexes. Big artifacts live outside.

4.2. Archive aggressively

Not all old data needs to live in the hot table.

A daily-ops dashboard typically queries:

Current findings.
Newly detected findings.
Findings that aren’t resolved.
Findings from the last 7, 30, or 90 days.

Older data can move to archive.

For example:

Data	Location
Active findings	D1 hot table
Findings ≤ 90 days	D1 or summary table
Findings > 90 days	R2 archive
Raw historical data	R2 (Parquet / JSONL)

The lifecycle can look like this:

0 – 90 days:
  D1 hot table

> 90 days:
  Export to R2
  Remove from the hot table or move to an archive table

> 1 year:
  Retain aggregates + artifacts required for audit

For long-range analytics, R2 with Parquet is usually a better match than D1.

4.3. Denormalise with a summary table

A dashboard shouldn’t always query the findings table directly.

To show counts by severity:

SELECT severity, COUNT(*)
FROM findings
WHERE landing_zone_id = ?
GROUP BY severity;

Fine on small data. If every dashboard runs aggregates against the full table, the cost climbs fast.

Pre-compute summaries instead:

CREATE TABLE finding_summary_daily (
  date TEXT NOT NULL,
  landing_zone_id TEXT NOT NULL,
  account_id TEXT,
  region TEXT,
  severity TEXT NOT NULL,
  status TEXT NOT NULL,
  count INTEGER NOT NULL,

  PRIMARY KEY (
    date,
    landing_zone_id,
    account_id,
    region,
    severity,
    status
  )
);

The overview dashboard reads the summary table instead of scanning findings:

SELECT severity, SUM(count) AS total
FROM finding_summary_daily
WHERE landing_zone_id = ?
  AND date >= ?
GROUP BY severity;

This is especially useful for:

Daily trends.
Counts by severity.
Counts by framework.
Counts by account.
SLA dashboards.
Executive summaries.

The lesson:

Before sharding, shrink rows, archive cold data, and pre-compute aggregates.

5. Backups are your job

D1’s time travel is genuinely useful for a specific class of operational mishaps:

A bad UPDATE.
An accidental delete of a batch of records.
Needing to revert to a recent state.
Debugging data before/after a migration.

Time travel is not a full disaster-recovery strategy.

A real backup strategy still needs periodic exports to independent storage.

A minimal script:

wrangler d1 export <DATABASE_NAME> --remote --output backup.sql

Upload the backup to R2:

wrangler r2 object put cspm-d1-backups/$(date +%F)/backup.sql --file backup.sql

A minimal backup schedule:

Frequency	Retention
Daily	30 days
Weekly	12 weeks
Monthly	12 months

For production, add:

Backup before any large migration.
Backup before a batch update/delete.
Periodic restore tests.
Checksum or size validation after the backup.
Separate credentials for database writes and backup deletion.
Store backups in a bucket with its own lifecycle and access policy.

The critical part is testing the restore.

A backup that has never been restored successfully is a feeling of safety, not safety.

A restore-test runbook:

1. Export the production D1.
2. Upload the backup to R2.
3. Create a fresh D1 staging database.
4. Import the backup into staging.
5. Run sanity checks:
   - row count
   - sample query
   - dashboard smoke test
6. Log the restore-test result.

The lesson:

Time travel handles small recent errors. Backups let you survive a major incident. Neither replaces the other.

Reference schema after tuning

After several iterations, the schema settled into a shape that separates hot metadata, artifact pointers, and summaries.

The findings table:

CREATE TABLE findings (
  landing_zone_id TEXT NOT NULL,
  account_id TEXT NOT NULL,
  region TEXT NOT NULL,
  resource_id TEXT NOT NULL,
  control_id TEXT NOT NULL,

  account_name TEXT,
  service TEXT,
  resource_type TEXT,
  title TEXT,
  severity TEXT NOT NULL,
  status TEXT NOT NULL,
  risk_score INTEGER,
  owner_team TEXT,
  framework TEXT,

  artifact_url TEXT,
  first_seen_at TEXT NOT NULL,
  last_seen_at TEXT NOT NULL,
  resolved_at TEXT,

  PRIMARY KEY (
    landing_zone_id,
    account_id,
    region,
    resource_id,
    control_id
  )
);

Index for the main dashboard:

CREATE INDEX idx_findings_lz_status_severity
ON findings (
  landing_zone_id,
  status,
  severity,
  last_seen_at
);

Index for owner view:

CREATE INDEX idx_findings_owner_status
ON findings (
  owner_team,
  status,
  risk_score,
  last_seen_at
);

FTS for search:

CREATE VIRTUAL TABLE findings_fts
USING fts5(
  finding_key,
  title,
  resource_id,
  service,
  framework,
  content
);

Summary table:

CREATE TABLE finding_summary_daily (
  date TEXT NOT NULL,
  landing_zone_id TEXT NOT NULL,
  account_id TEXT,
  region TEXT,
  severity TEXT NOT NULL,
  status TEXT NOT NULL,
  count INTEGER NOT NULL,

  PRIMARY KEY (
    date,
    landing_zone_id,
    account_id,
    region,
    severity,
    status
  )
);

The operating principles we ended up with

After some time in production, the schema-design rules settled around these:

Principle	Meaning
Natural key first, surrogate ID second	Only add a surrogate ID when no stable business key exists
Indexes follow the query pattern	No intuition-based indexing
Composite indexes beat several single ones	Especially for multi-condition dashboard filters
D1 for metadata, R2 for artifacts	Avoids row bloat
FTS for free-text search	Don’t use `LIKE '%keyword%'` on large tables
Summary tables for dashboard aggregates	Don’t aggregate over the hot table too often
Archive cold data	A smaller hot table is always better
Backups independent of time travel	DR requires actual exports and tested restores

Takeaway

D1 is powerful when used for what it’s good at.

It fits metadata, indexes, dashboard queries, light transactional workloads, and anything that runs close to Cloudflare Workers. But treating it like a traditional database and dumping everything into one large table brings you quickly to limits on row scanning, storage, indexes, and migrations.

The five things that stood out most:

Composite primary keys usually beat surrogate IDs when the natural key is stable.
Indexes must track the query pattern, especially for the dashboard’s main filters.
FTS should be used for free-text search instead of stretching LIKE further.
Sharding is the last step — column-prune, archive, and denormalise first.
Backups are a mandatory operational practice, not an afterthought on top of time travel.

Closing

Starting on D1 doesn’t require optimising everything on day one.

Some decisions can wait until the data is big enough to measure. Two should happen early:

Design the primary key and indexes around real query patterns.
Set up backups to R2 from day one.

A good schema doesn’t make a system scale infinitely by itself, but it spares you a lot of unnecessary pain when the data finally starts growing.

The most important lesson for us was:

Don’t design the schema as if the data will stay small. But don’t over-engineer as if you already have a billion-row problem either. Measure first, optimise after, and always keep a rollback path.

Five lessons on D1 schema design, learned the hard way

Context

1. A composite primary key often beats a surrogate ID

2. Only index columns that are actually used in `WHERE`, `JOIN`, or `ORDER BY`

3. FTS is still very much worth it for free-text search

4. Sharding is the last resort, not the first reflex

4.1. Column pruning

4.2. Archive aggressively

4.3. Denormalise with a summary table

5. Backups are your job

Reference schema after tuning

The operating principles we ended up with

Takeaway

Closing

Mentions from the web

Ask the blog

Sources

Context

1. A composite primary key often beats a surrogate ID

2. Only index columns that are actually used in WHERE, JOIN, or ORDER BY

3. FTS is still very much worth it for free-text search

4. Sharding is the last resort, not the first reflex

4.1. Column pruning

4.2. Archive aggressively

4.3. Denormalise with a summary table

5. Backups are your job

Reference schema after tuning

The operating principles we ended up with

Takeaway

Closing

Related reading

Migrating AWS/Vercel to Cloudflare: a real playbook

Cloudflare Developer Platform cost model: tiers vs AWS

Worker security: secrets, CSP, Bot Management, Turnstile

2. Only index columns that are actually used in `WHERE`, `JOIN`, or `ORDER BY`