Notes from three months of rolling out Zero Trust

What actually worked, what didn't live up to expectation, and the operational lessons from rolling out Cloudflare Zero Trust across an organisation of thousands.

· 13 min read · Đọc bản tiếng Việt
Three months rolling out Cloudflare Zero Trust to thousands of users: Access + Tunnel + WARP + Gateway, a six-phase posture rollout, structured group naming, and shipping logs into SIEM from day one

TL;DR

  • After 3 months rolling out Cloudflare Zero Trust across thousands of users (Access + Tunnel + WARP + Gateway + SIEM pipeline), the headline lesson: Zero Trust is an identity project before it’s a network project.
  • Tunnel + Access produced value fastest — control plane moved from “can the user enter the network” to “this user, this device, this context, this app”; no inbound ports needed.
  • Roll out device posture in 6 stages: Observe → Report → Notify → Pilot enforce → Progressive enforce → Full enforce — flipping a block rule on day one blows up the ticket queue.
  • Use structured group names like ZTNA.Platform.CICD.Admin / ZTNA.Vendor.AppX.Support, not Allow IT — longer names, much faster audits and incidents.
  • Avoid big-bang migration — DNS + routing + IdP groups + posture + browser cache + legacy firewall rules tangled together is untraceable; cut over by service domain.
  • 3 integrations worth doing first: IdP (user/group lifecycle, joiner/mover/leaver), endpoint posture (baseline + dashboard), and shipping Access/Tunnel/Gateway logs to the SIEM on day one.
  • Defer: Browser Isolation, DLP, SaaS posture, advanced egress — they mostly add complexity until identity/posture/logs are stable.

This post is for security engineers, IT leads, or infrastructure teams preparing to roll out Zero Trust across an organisation of thousands of users.

It isn’t an explainer on what Zero Trust is.

There are already too many of those.

It’s the less-discussed part: after you’ve picked a platform, bought the licenses, drawn the architecture diagram, and run the workshops — what actually happens when Zero Trust meets day-two operations?

After three months of Cloudflare Zero Trust, the most pointed takeaway was:

The hardest part of Zero Trust isn’t configuration. It’s turning access control into an operational capability with owners, process, logs, audit, and the ability to scale.

Put differently, this isn’t a VPN-replacement project.

It’s a project that rewrites how an organisation grants permissions, controls devices, publishes internal applications, and observes access behaviour.

Context

The project kicked off around October 2025, targeting an organisation of thousands of users.

The initial goals were clear:

  • Reduce reliance on the traditional VPN for internal systems and admin planes.
  • Move application access to a model driven by identity, group, and context.
  • Apply device posture before granting access to sensitive resources.
  • Pipe Access logs, Gateway logs, and Tunnel events into the SIEM.
  • Standardise permissions around groups instead of manual per-user handling.

The primary stack was Cloudflare Zero Trust:

  • Cloudflare Access for application-level access control.
  • Cloudflare Tunnel to publish internal services without inbound ports.
  • Cloudflare WARP for device enrollment, private routing, and posture.
  • Cloudflare Gateway for DNS, HTTP, and network policy plus logging.
  • SIEM pipeline for monitoring, investigation, and audit.

That sounds like a purely technical problem.

In practice, the technical part was only half the story.

The other half was identity hygiene, group ownership, exception handling, approval workflow, user communication, and the trust of teams living inside the old system every day.

First lesson: Zero Trust is an identity project before it’s a network project

Talking about Zero Trust often starts with network questions:

  • Which Tunnel does this app go through?
  • Which virtual network is this route on?
  • Private IP or public hostname?
  • Do we need split tunnel?
  • Do we need WARP?

Those questions matter.

The more important question is:

Who is this user, what group are they in, what device are they on, and do they actually need access to this system?

If the identity layer isn’t clean, every policy on top of it is just automation sitting on untrusted data.

Take a very simple policy:

Allow users in ZTNA.Platform.CICD.Admin to access CI/CD Admin Portal

Clear enough on paper.

In practice, it drags a string of questions behind it:

  • Who owns this group?
  • Who can add members?
  • What happens when a user changes teams?
  • Are contractors sitting in the same group as employees?
  • Is the group about a department, a role, or an access right?
  • Is there periodic review?
  • Are there groups created “temporarily” that never die?

At tens of users, memory and chat messages cover this.

At thousands, they don’t.

One group with wrong membership opens the door to several applications. One group without an owner delays revocation. One unclear naming convention turns an audit into collective guessing.

So the first lesson is simple:

Don’t start Zero Trust with policy. Start with the identity model.

User lifecycle, group lifecycle, naming convention, owners, approvers, joiner/mover/leaver processes — these have to be clean before Access policy can behave correctly.

Not the glamorous part.

But it’s the foundation.

What worked best: Tunnel changes how internal apps are published

Cloudflare Tunnel was one of the fastest sources of value.

For internal web admin portals, dashboards, CI/CD, monitoring, Git, artifact repositories, internal wikis, or ops tools, Tunnel fundamentally changed the access model.

The old model looked like:

  1. User turns on the VPN.
  2. VPN hands out a route.
  3. User hits an IP or internal DNS name.
  4. Firewall allows by subnet.
  5. Audit relies on scattered VPN, firewall, and app logs.

It works, but it has several weak points:

  • Access is tied to the network rather than identity.
  • Once on the VPN, the footprint usually exceeds what’s needed.
  • Hard to say a user should reach app A but not app B.
  • Hard to enforce posture per application.
  • Hard to produce a clean audit trail at the application layer.

Moving services behind Tunnel + Access changes the experience meaningfully.

Users don’t think about routes, subnets, or VPN profiles. They open a URL, log in via SSO, pass policy, and reach the app.

The security team gains more control:

  • No inbound ports.
  • No service directly exposed to the Internet.
  • Policy tied to identity, group, posture, and context.
  • Per-application audit logs.
  • Faster revocation by group.
  • Different conditions for production, non-production, vendors, or privileged access.

The important point isn’t “Tunnel replaces VPN”.

The important point is that Tunnel shifts the control plane from network-centric access to identity-aware application access.

That’s a large shift.

VPN answers:

Can this user enter this network?

Zero Trust Access answers:

Can this user, from this device, under these conditions, enter this application?

They sound similar. Architecturally, they’re different worlds.

Device posture: the part that keeps Zero Trust from becoming “VPN with SSO”

A common mistake is treating SSO as “enough” Zero Trust.

With SSO alone, you authenticate the user. You haven’t evaluated the device.

In many situations, the risk isn’t the wrong identity; it’s a non-compliant endpoint:

  • The laptop isn’t company-managed.
  • EDR isn’t running.
  • OS is too old.
  • Disk isn’t encrypted.
  • WARP isn’t enrolled.
  • The device stopped being compliant but kept its session.
  • Correct user, wrong endpoint for production access.

Device posture adds that layer.

A sensible baseline usually includes:

  • Managed device.
  • EDR/AV running.
  • Minimum OS version.
  • Disk encryption.
  • Domain-joined or MDM-enrolled.
  • WARP enrolled.
  • A certificate or device-identity signal.

Posture is a double-edged sword.

Enforce too early and the ticket queue explodes. Observe forever and posture never reduces real risk.

A better shape is a staged rollout:

  1. Observe — collect posture signals without blocking.
  2. Report — dashboards by user, team, OS, device state.
  3. Notify — tell users/teams that aren’t meeting the bar.
  4. Pilot enforce — apply the policy to a small group or a low-risk app.
  5. Progressive enforce — broaden by business unit or application tier.
  6. Full enforce — apply to sensitive systems and production access.

The lesson:

Posture isn’t a technical checkbox. It’s an endpoint-behaviour change programme.

Making posture work requires dashboards, communication, an exception process, and a clear timeline.

You can’t flip a block rule and hope users figure it out.

What didn’t work: expecting users to request access perfectly

On paper, access requests sound clean.

User needs permission → requests → owner approves → security audits → everything logged.

In reality, when the process is slow or unclear, users route around it.

They DM someone they know. They ask for broader access than they need to avoid re-requesting. They reuse a shared account. They go back to the old VPN. Or they escalate because “it’s urgent”.

The problem isn’t a lack of user discipline.

The problem is that the permissioning process can’t keep up with operations.

At thousands of users, access requests can’t be held together by a few people remembering to process them manually. They have to become a designed workflow.

Minimum components:

  • A named app owner.
  • A named approver.
  • A defined SLA.
  • Clear criteria for granting access.
  • Expiry on permissions.
  • An emergency-access path.
  • An audit trail on every decision.
  • A revocation process for role changes or when the need ends.

Without these, approval becomes a bottleneck.

When security control becomes a bottleneck, users create workarounds.

In security, workarounds are often worse than the original problem.

The fastest way to accumulate technical debt: over-broad policy

Early on, it’s tempting to ship a big policy like:

Allow all employees

Easy to roll out. Good demo. Few tickets. Little pushback.

Debt accumulates fast.

When every employee has access to an app, audit questions get harder:

  • Who actually needs this access?
  • Which teams use it?
  • Can contractors reach it?
  • Do users who changed teams still have access?
  • Are prod and non-prod permissions mixed?
  • During an incident, which group do we revoke?

Broad policy doesn’t hurt on day one.

It hurts when the organisation needs to audit, offboard, respond to incidents, or split permissions by environment.

Design groups and policies around access domains from the start.

For example:

ZTNA.Platform.CICD.Admin
ZTNA.Platform.CICD.ReadOnly
ZTNA.Security.SIEM.Analyst
ZTNA.Security.SIEM.Admin
ZTNA.Production.Monitoring.ReadOnly
ZTNA.Vendor.AppX.Support
ZTNA.Data.Analytics.User

Longer names, much clearer meaning.

A good policy isn’t just about allow/block. It has to let you answer quickly:

Who can access what, in which role, from what device, under what conditions, and who approved that permission?

That’s a policy you can audit.

What to avoid: a big-bang migration

An expensive lesson: don’t attempt a big-bang migration.

We once tried to cut over many services from VPN to Tunnel in a single window. On paper, the plan seemed reasonable: prep ahead, test ahead, pick a low-usage slot, keep a rollback.

Reality has too many variables:

  • DNS.
  • Routing.
  • IdP groups.
  • Access policies.
  • Posture checks.
  • Browser behaviour.
  • Service dependencies.
  • Legacy firewall rules.
  • User habits.
  • Break-glass paths.
  • Rollback coordination.

When something breaks, the cause is hard to pin down.

A user who can’t reach an app could be hitting a wrong group, a failed posture check, an un-propagated DNS record, a service behind Tunnel that’s broken, a cached browser entry, or a route conflict.

When several services cut over at once, troubleshooting becomes chaos fast.

Break migration down by service domain:

  1. Low-risk internal dashboards.
  2. Monitoring.
  3. CI/CD.
  4. Git and artifact repositories.
  5. Admin portals.
  6. Production admin plane.
  7. Complex legacy apps.

Each batch needs:

  • Pilot users.
  • An app owner.
  • Test cases.
  • Success criteria.
  • A rollback path.
  • A log dashboard.
  • Communication before and after the cutover.
  • A post-migration review.

Incremental cutover looks slower, but it’s actually faster — fewer rollbacks, less lost trust, easier troubleshooting.

In infrastructure migration, user trust is a real asset.

Burn it once, and the next migration is much harder.

The three integrations worth doing first

If you only get three parts right from the start, pick these.

1. Identity Provider

The IdP is the control plane of Zero Trust.

Every policy starts with identity. A dirty identity layer yields untrustworthy policy.

Get these right early:

  • User lifecycle.
  • Group lifecycle.
  • Group owners.
  • Naming convention.
  • Joiner/mover/leaver process.
  • Contractor/vendor identity.
  • MFA policy.
  • Privileged group review.
  • Break-glass identity.

This isn’t supporting infrastructure.

It’s the core of the architecture.

2. Endpoint posture

Posture distinguishes:

Correct user.

from:

Correct user on a device that shouldn’t be reaching this resource.

It matters most for production access, the admin plane, and sensitive apps.

You don’t have to enforce everything on day one. You do need a baseline, a dashboard, and a clear enforcement roadmap.

Without that, Zero Trust tends to stall at “SSO in front of an internal app”.

3. SIEM log pipeline

Access control without logging is hard to operate as a real security control.

At a minimum, ship these to the SIEM:

  • Access decisions.
  • User identity.
  • Application.
  • Source IP.
  • Device posture result.
  • Policy matched.
  • Tunnel events.
  • Gateway DNS events.
  • Gateway HTTP events.
  • Gateway network events.
  • Block/allow decisions.
  • Admin activity.

No logs, no detection.

No detection, no investigation.

No investigation, no trust during an incident.

Build the log pipeline on day one, not after rollout.

Capabilities to defer

Some features are seductive but shouldn’t go first:

  • Browser Isolation.
  • DLP.
  • SaaS security control.
  • Remote browser rendering.
  • Advanced egress policy.
  • Fine-grained per-request risk scoring.
  • Complex data-protection workflows.

They have value.

When the IdP is still messy, posture is still unstable, logs aren’t in the SIEM, and approval has no SLA, adding these mostly adds complexity.

A better sequence:

  1. Identity foundation.
  2. Access policy model.
  3. Tunnel / private-app publishing.
  4. Device enrollment.
  5. Device posture.
  6. SIEM logging.
  7. Approval workflow.
  8. Fine-grained segmentation.
  9. DLP, Isolation, and advanced controls.

Zero Trust should be built as a platform, not assembled from unrelated features.

A handful of practical notes

1. Naming conventions matter more than you’d think

Group, app, policy, and rule names need to be readable by whoever comes next.

Not:

Allow IT

Better:

Allow-Platform-Team-to-CICD-Admin-Portal-With-Managed-Device

Longer, but audits and incidents go much faster.

2. The exception process has to exist from day one

No system enforces 100% with zero exceptions.

There will always be vendors, legacy devices, emergency access, older apps, OS upgrades in flight, or specific business cases.

The point isn’t to have no exceptions.

The point is that every exception has:

  • An owner.
  • A documented reason.
  • An expiry.
  • An approval.
  • A log.
  • A periodic review.
  • A plan to resolve.

Exceptions without management become permanent bypasses.

Permanent bypasses are where Zero Trust starts losing meaning.

3. Break-glass isn’t a sign of failure

Zero Trust doesn’t eliminate the need for break-glass.

It does raise the bar for how break-glass is designed.

A sound break-glass model:

  • A separate account.
  • Strong MFA.
  • Minimum-viable permissions.
  • Time-limited use.
  • Full logging.
  • Alerts on use.
  • Post-use review.
  • Not used for day-to-day ops.

Without a designed break-glass, the ops team invents one.

Organic workarounds are almost always worse.

4. Communication determines user acceptance

Security teams tend to focus on policy, posture, and logs.

Users care about very practical questions:

  • How do I reach the old app?
  • Who do I contact when I’m blocked?
  • Why did this work before and not now?
  • What do I do to get my device compliant?
  • How long does an access request take?
  • Will this affect today’s work?

Without clear communication, users see Zero Trust as an obstacle.

With clear communication, they see it as a safer, more controlled, less VPN-dependent way of working.

Rolling out Zero Trust isn’t about enabling policies.

It’s about changing the user’s access experience.

What three months made clear

After three months, a few conclusions stand out.

First, Zero Trust is an identity project before it is a network project. Network matters, but identity is where policy actually starts.

Second, device posture is what turns SSO into a context-aware security control. Without posture, you know who the user is but not whether the device is trustworthy.

Third, SIEM logging has to ship with the rollout, not after it. Without logs, it’s hard to prove effectiveness, investigate incidents, or operate policy at scale.

Fourth, approval workflow is part of the architecture. Too slow, users build workarounds. Too broad, security loses control.

Fifth, incremental migration is always safer than big bang. The more services and users, the more critical it is to keep rollout small enough to debug and to keep user trust.

Closing

The first three months of Zero Trust typically feel slow.

Not because the team isn’t working — but because most of the effort is foundational:

  • Clean the identity.
  • Standardise groups.
  • Identify owners.
  • Design the policy model.
  • Enrol devices.
  • Collect posture.
  • Ship logs to the SIEM.
  • Build the approval workflow.
  • Explain the new access model to users.

None of that produces a “new feature” every day.

All of it decides whether the system can operate at thousands of users.

Once the foundation is stable, rollout speed jumps.

New applications no longer require a dedicated VPN route, dedicated firewall rules, a custom access guide, or a one-off exception. They fit into the model:

  • The app has an owner.
  • The group has a purpose.
  • The policy has conditions.
  • The device has posture.
  • Access has logs.
  • Exceptions have expiry.
  • Permissions are auditable.

That’s when Zero Trust starts producing real value.

Not because it’s newer than VPN.

Because it forces the organisation to operate access in a more controlled, more conditional, more measurable, and more enterprise-appropriate way.