TL;DR
- Lessons from running D1 with ~50 million rows for six months — SQLite at the edge has pressure curves SQLite on a laptop never sees.
- Composite primary keys like
(account_id, region, resource_id, control_id)beat surrogateid TEXT PRIMARY KEYwhen the natural key is stable — saves storage and removes the duplicate unique index.- Indexes follow the query pattern, not intuition —
severityalone is low-cardinality; place it afterlanding_zone_id,status,account_idinside a composite index.- For free-text search use SQLite
fts5instead ofLIKE '%keyword%'— the leading wildcard kills index usage past a few million rows; Vietnamese text needs Unicode normalization + diacritic stripping.- Sharding is the last resort — column-prune (push raw JSON/HTML/evidence to R2), archive
> 90 daysdata, then buildfinding_summary_dailydenormalized tables before splitting tables.- Time travel ≠ backup: run
wrangler d1 export --remotedaily to R2 with 30d/12w/12mo retention, and actually test restore — an untested backup is just a feeling of safety.
Cloudflare D1 is easy to start with.
You create a database, write a few SQL statements, deploy a Worker, and everything runs smoothly. Because D1 sits on top of SQLite, the first impression is usually “this is just SQLite.”
Once D1 goes to production — especially with tens of millions of rows — that impression breaks down quickly:
D1 is SQLite, but SQLite at the edge has pressures that are very different from SQLite on a laptop or in a traditional backend VM.
This post collects five lessons after around six months of operating D1 with roughly 50 million rows. Some come from benchmarks. Others come from dashboards crawling, queries scanning too many rows, or migrations that looked simple and destabilised the system.
Context
D1 powers a few metadata-index workloads here:
- Findings from a CSPM scanner.
- Scan status.
- Resource inventory.
- Mapping between account, landing zone, owner, and framework.
- Dashboard filters for Security Operations.
- A light portion of the audit trail.
This isn’t a heavy OLTP banking core, and it isn’t an analytics warehouse either. But the data grows steadily, query patterns repeat, and the dashboard has to respond fast.
Early on, the schema looked a lot like something you’d build on Postgres or MySQL:
CREATE TABLE findings (
id TEXT PRIMARY KEY,
account_id TEXT,
finding_id TEXT,
region TEXT,
severity TEXT,
status TEXT,
created_at TEXT
);
It looks fine at a glance. But as data grew from millions to tens of millions of rows, small schema decisions started to cost real latency.
1. A composite primary key often beats a surrogate ID
The first mistake was using a surrogate ID on nearly every table.
For example:
id TEXT PRIMARY KEY
…and then adding the columns that actually identify the row:
account_id TEXT
finding_id TEXT
region TEXT
resource_id TEXT
control_id TEXT
It’s familiar, easy to write code against, easy to debug. Against D1 at scale, it introduces two problems.
First, you’re carrying a id column that isn’t tied to any business meaning.
Second, the real natural key still needs its own unique index to prevent duplicates:
CREATE UNIQUE INDEX idx_findings_unique
ON findings(account_id, region, resource_id, control_id);
So the same logical identifier is maintained twice — once as the surrogate PK, once as a secondary unique index.
After a few rounds of measuring storage and query patterns, we moved tables with stable natural keys over to a composite primary key:
CREATE TABLE findings (
account_id TEXT NOT NULL,
region TEXT NOT NULL,
resource_id TEXT NOT NULL,
control_id TEXT NOT NULL,
landing_zone_id TEXT NOT NULL,
service TEXT,
severity TEXT,
status TEXT,
risk_score INTEGER,
first_seen_at TEXT,
last_seen_at TEXT,
PRIMARY KEY (account_id, region, resource_id, control_id)
);
For CSPM-style findings, the natural key is obvious:
account_id + region + resource_id + control_id
Or in some cases:
landing_zone_id + account_id + region + finding_id
When the natural key is stable, using it as the primary key:
- Reduces storage — no extra surrogate column.
- Removes the need for a secondary unique index.
- Keeps prefix-key queries efficient.
- Lowers the risk of duplicate findings.
- Makes the schema match the domain model.
Surrogate IDs stay useful when the data has no stable natural key, for example:
- User-generated content.
- Event logs where duplicate content is legitimate.
- Records that need identity independent of business keys.
- Entities whose business keys can change.
The rule is simple:
If the data already has a clear, stable natural key, don’t reflexively add a surrogate ID.
2. Only index columns that are actually used in WHERE, JOIN, or ORDER BY
When a dashboard slows down, the reflex is to add an index.
Query slow on severity? Add an index.
Slow on status? Add an index.
Slow on region? Add an index.
A few weeks later, the schema is littered with single-column indexes:
CREATE INDEX idx_findings_severity ON findings(severity);
CREATE INDEX idx_findings_status ON findings(status);
CREATE INDEX idx_findings_region ON findings(region);
CREATE INDEX idx_findings_account ON findings(account_id);
CREATE INDEX idx_findings_framework ON findings(framework);
It feels safe. It isn’t optimal.
Every index has a cost:
- Higher storage.
- Slower inserts and updates.
- Heavier migrations.
- May not be picked by the query planner if selectivity is low.
- Doesn’t handle multi-condition queries well.
The dashboard rarely runs:
SELECT *
FROM findings
WHERE severity = 'CRITICAL';
The real query usually looks like:
SELECT *
FROM findings
WHERE landing_zone_id = ?
AND account_id = ?
AND severity IN ('HIGH', 'CRITICAL')
AND status = 'FAIL'
ORDER BY risk_score DESC, last_seen_at DESC
LIMIT 100;
For that query, a composite index that matches the pattern beats several single-column indexes every time.
For example:
CREATE INDEX idx_findings_dashboard
ON findings (
landing_zone_id,
account_id,
status,
severity,
risk_score,
last_seen_at
);
Or, when the dashboard filters by landing zone first:
CREATE INDEX idx_findings_lz_status_severity
ON findings (
landing_zone_id,
status,
severity,
last_seen_at
);
The rule we apply:
| Case | Treatment |
|---|---|
Column appears often in WHERE | Consider an index |
| Query uses several stable conditions together | Prefer a composite index |
| Column has low cardinality | Don’t rush to add a single-column index |
| Column is display-only, not filtered | No index |
| Column rarely used, not on a critical path | Skip for now |
| Column on the main dashboard | Measure, then index deliberately |
A good example of a column not to index by itself is severity.
Severity usually has only a handful of values:
CRITICAL, HIGH, MEDIUM, LOW, INFORMATIONAL
Indexing severity alone still forces the database to scan many rows because selectivity is low. Place severity after landing_zone_id, status, and account_id in a composite index, and the effect is much better.
The lesson:
Indexes should follow the query pattern — not intuition.
3. FTS is still very much worth it for free-text search
Sooner or later, every dashboard grows a search box.
The quickest first step is LIKE:
SELECT *
FROM findings
WHERE title LIKE '%public bucket%'
OR resource_id LIKE '%public bucket%'
OR remediation LIKE '%public bucket%';
That’s fine on small datasets. Once the table grows past a few hundred thousand or a few million rows, %keyword% queries slow down noticeably.
The reason: a wildcard at the start of the string prevents conventional indexes from helping.
For free-text search, FTS is a better fit.
Create an FTS table:
CREATE VIRTUAL TABLE findings_fts
USING fts5(
finding_pk,
title,
resource_id,
remediation,
content
);
When ingesting a finding, write to the FTS table too:
INSERT INTO findings_fts (
finding_pk,
title,
resource_id,
remediation,
content
)
VALUES (?, ?, ?, ?, ?);
On search:
SELECT finding_pk
FROM findings_fts
WHERE findings_fts MATCH ?
LIMIT 50;
Then join back to the main table:
SELECT f.*
FROM findings f
JOIN findings_fts s
ON s.finding_pk = f.finding_pk
WHERE findings_fts MATCH ?
LIMIT 50;
FTS suits:
- Searching titles.
- Resource names.
- Remediation text.
- Control descriptions.
- Owner or text-like metadata.
- Cross-field search on the dashboard.
Vietnamese text needs extra care. SQLite FTS doesn’t ship a proper Vietnamese stemmer. If the data contains a lot of Vietnamese, preprocess before writing to the FTS table:
- Normalise Unicode.
- Lowercase.
- Strip or normalise diacritics when the use case allows.
- Normalise whitespace.
- Tokenise consistently.
- Optionally store a search-normalised field alongside the original.
A simple example:
"Máy chủ cấu hình sai Security Group"
can be normalised to:
"may chu cau hinh sai security group"
Depending on the use case, storing both the original (for display) and the normalised form (for search) is a cheap win.
The lesson:
For free-text search, don’t keep extending the life of
LIKE '%keyword%'. If search is a primary feature, design FTS in from the start.
4. Sharding is the last resort, not the first reflex
When a table grows fast, the reflex is to shard.
For example:
findings_2026_01
findings_2026_02
findings_2026_03
Or:
findings_lz_a
findings_lz_b
findings_lz_c
Sharding may become necessary eventually. In practice, it’s the last option.
Before sharding, three things come first.
4.1. Column pruning
Not every piece of data belongs in D1.
Early on, we tended to dump everything into the findings table:
raw_json TEXT,
html_report TEXT,
evidence TEXT,
debug_log TEXT
Rows ballooned quickly.
The fix was clean separation:
| Data | Location |
|---|---|
| Fields used for filter/query | D1 |
| Raw JSON reports | R2 |
| HTML reports | R2 |
| Screenshots / evidence files | R2 |
| Debug logs | R2 |
| Metadata pointers | D1 |
D1 keeps only pointers:
artifact_url TEXT,
raw_json_url TEXT,
evidence_url TEXT
Smaller D1, faster queries, sensible storage allocation.
The rule:
D1 holds metadata and indexes. Big artifacts live outside.
4.2. Archive aggressively
Not all old data needs to live in the hot table.
A daily-ops dashboard typically queries:
- Current findings.
- Newly detected findings.
- Findings that aren’t resolved.
- Findings from the last 7, 30, or 90 days.
Older data can move to archive.
For example:
| Data | Location |
|---|---|
| Active findings | D1 hot table |
| Findings ≤ 90 days | D1 or summary table |
| Findings > 90 days | R2 archive |
| Raw historical data | R2 (Parquet / JSONL) |
The lifecycle can look like this:
0 – 90 days:
D1 hot table
> 90 days:
Export to R2
Remove from the hot table or move to an archive table
> 1 year:
Retain aggregates + artifacts required for audit
For long-range analytics, R2 with Parquet is usually a better match than D1.
4.3. Denormalise with a summary table
A dashboard shouldn’t always query the findings table directly.
To show counts by severity:
SELECT severity, COUNT(*)
FROM findings
WHERE landing_zone_id = ?
GROUP BY severity;
Fine on small data. If every dashboard runs aggregates against the full table, the cost climbs fast.
Pre-compute summaries instead:
CREATE TABLE finding_summary_daily (
date TEXT NOT NULL,
landing_zone_id TEXT NOT NULL,
account_id TEXT,
region TEXT,
severity TEXT NOT NULL,
status TEXT NOT NULL,
count INTEGER NOT NULL,
PRIMARY KEY (
date,
landing_zone_id,
account_id,
region,
severity,
status
)
);
The overview dashboard reads the summary table instead of scanning findings:
SELECT severity, SUM(count) AS total
FROM finding_summary_daily
WHERE landing_zone_id = ?
AND date >= ?
GROUP BY severity;
This is especially useful for:
- Daily trends.
- Counts by severity.
- Counts by framework.
- Counts by account.
- SLA dashboards.
- Executive summaries.
The lesson:
Before sharding, shrink rows, archive cold data, and pre-compute aggregates.
5. Backups are your job
D1’s time travel is genuinely useful for a specific class of operational mishaps:
- A bad
UPDATE. - An accidental delete of a batch of records.
- Needing to revert to a recent state.
- Debugging data before/after a migration.
Time travel is not a full disaster-recovery strategy.
A real backup strategy still needs periodic exports to independent storage.
A minimal script:
wrangler d1 export <DATABASE_NAME> --remote --output backup.sql
Upload the backup to R2:
wrangler r2 object put cspm-d1-backups/$(date +%F)/backup.sql --file backup.sql
A minimal backup schedule:
| Frequency | Retention |
|---|---|
| Daily | 30 days |
| Weekly | 12 weeks |
| Monthly | 12 months |
For production, add:
- Backup before any large migration.
- Backup before a batch update/delete.
- Periodic restore tests.
- Checksum or size validation after the backup.
- Separate credentials for database writes and backup deletion.
- Store backups in a bucket with its own lifecycle and access policy.
The critical part is testing the restore.
A backup that has never been restored successfully is a feeling of safety, not safety.
A restore-test runbook:
1. Export the production D1.
2. Upload the backup to R2.
3. Create a fresh D1 staging database.
4. Import the backup into staging.
5. Run sanity checks:
- row count
- sample query
- dashboard smoke test
6. Log the restore-test result.
The lesson:
Time travel handles small recent errors. Backups let you survive a major incident. Neither replaces the other.
Reference schema after tuning
After several iterations, the schema settled into a shape that separates hot metadata, artifact pointers, and summaries.
The findings table:
CREATE TABLE findings (
landing_zone_id TEXT NOT NULL,
account_id TEXT NOT NULL,
region TEXT NOT NULL,
resource_id TEXT NOT NULL,
control_id TEXT NOT NULL,
account_name TEXT,
service TEXT,
resource_type TEXT,
title TEXT,
severity TEXT NOT NULL,
status TEXT NOT NULL,
risk_score INTEGER,
owner_team TEXT,
framework TEXT,
artifact_url TEXT,
first_seen_at TEXT NOT NULL,
last_seen_at TEXT NOT NULL,
resolved_at TEXT,
PRIMARY KEY (
landing_zone_id,
account_id,
region,
resource_id,
control_id
)
);
Index for the main dashboard:
CREATE INDEX idx_findings_lz_status_severity
ON findings (
landing_zone_id,
status,
severity,
last_seen_at
);
Index for owner view:
CREATE INDEX idx_findings_owner_status
ON findings (
owner_team,
status,
risk_score,
last_seen_at
);
FTS for search:
CREATE VIRTUAL TABLE findings_fts
USING fts5(
finding_key,
title,
resource_id,
service,
framework,
content
);
Summary table:
CREATE TABLE finding_summary_daily (
date TEXT NOT NULL,
landing_zone_id TEXT NOT NULL,
account_id TEXT,
region TEXT,
severity TEXT NOT NULL,
status TEXT NOT NULL,
count INTEGER NOT NULL,
PRIMARY KEY (
date,
landing_zone_id,
account_id,
region,
severity,
status
)
);
The operating principles we ended up with
After some time in production, the schema-design rules settled around these:
| Principle | Meaning |
|---|---|
| Natural key first, surrogate ID second | Only add a surrogate ID when no stable business key exists |
| Indexes follow the query pattern | No intuition-based indexing |
| Composite indexes beat several single ones | Especially for multi-condition dashboard filters |
| D1 for metadata, R2 for artifacts | Avoids row bloat |
| FTS for free-text search | Don’t use LIKE '%keyword%' on large tables |
| Summary tables for dashboard aggregates | Don’t aggregate over the hot table too often |
| Archive cold data | A smaller hot table is always better |
| Backups independent of time travel | DR requires actual exports and tested restores |
Takeaway
D1 is powerful when used for what it’s good at.
It fits metadata, indexes, dashboard queries, light transactional workloads, and anything that runs close to Cloudflare Workers. But treating it like a traditional database and dumping everything into one large table brings you quickly to limits on row scanning, storage, indexes, and migrations.
The five things that stood out most:
- Composite primary keys usually beat surrogate IDs when the natural key is stable.
- Indexes must track the query pattern, especially for the dashboard’s main filters.
- FTS should be used for free-text search instead of stretching
LIKEfurther. - Sharding is the last step — column-prune, archive, and denormalise first.
- Backups are a mandatory operational practice, not an afterthought on top of time travel.
Closing
Starting on D1 doesn’t require optimising everything on day one.
Some decisions can wait until the data is big enough to measure. Two should happen early:
- Design the primary key and indexes around real query patterns.
- Set up backups to R2 from day one.
A good schema doesn’t make a system scale infinitely by itself, but it spares you a lot of unnecessary pain when the data finally starts growing.
The most important lesson for us was:
Don’t design the schema as if the data will stay small. But don’t over-engineer as if you already have a billion-row problem either. Measure first, optimise after, and always keep a rollback path.