AWS IAM Access Key rotation: Lambda + Secrets Manager

An AWS-native solution for rotating, disabling, and deleting IAM access keys on policy — the multi-account architecture, trade-offs, and what operating it actually takes.

· 10 min read · Đọc bản tiếng Việt

TL;DR

  • AWS-native stack: Lambda + EventBridge + Secrets Manager + SES + StackSets — no DynamoDB, no Step Functions, IAM is the source of truth.
  • State machine ACTIVE → ROTATED → DISABLED → DELETED with defaults RotationPeriod=90, InactivePeriod=100, RecoveryGracePeriod=10 days — preserves a rollback window when a pipeline breaks.
  • Always start with DryRunFlag=True and run it ~2 weeks to surface CI pipelines still hard-coding old keys.
  • Member-account role gets only iam:ListUsers, iam:CreateAccessKey, iam:UpdateAccessKey, iam:DeleteAccessKey + Secrets Manager scoped by Account_*_User_*_AccessKey prefix — no iam:*, no iam:PassRole.
  • The exemption group is a potential backdoor: enable a CloudTrail alarm on iam:AddUserToGroup for IAMKeyRotationExemptionGroup and review it quarterly.
  • Rotation is not a substitute for removing IAM users — set a quarterly KPI on reducing IAM user count, not just average key age.
  • Secrets Manager $0.40/secret/month adds up at thousand-user scale; consider deleting secrets once owners have fully migrated.

IAM access keys are one of the most leak-prone artifacts in AWS: they end up in developer dotfiles, CI runners, Docker images, Postman workspaces, Slack DMs. When an organisation has dozens of accounts and hundreds of IAM users, nobody rotates them by hand on a schedule. This post describes an AWS-native solution (Lambda + EventBridge + Secrets Manager + SES + CloudFormation StackSets) that automates rotate → disable → delete against a policy — and, more importantly, the trade-offs to understand before deploying it to production. Reference implementation: vanhoangkha/aws-iam-access-key-auto-rotation.

Context

A typical AWS Organization looks like this:

  • 20–100 AWS accounts, managed through Control Tower or a home-grown landing zone.
  • Most workloads have moved to IAM roles (EC2 instance profile, IRSA, Lambda execution role). A residual set of stubborn IAM users remains: legacy CI/CD service users, third-party SaaS that needs programmatic access, vendor tools with no OIDC support, and developers using access keys locally because setting up an SSO profile feels like a tax.
  • Each account averages 5–20 IAM users with active access keys. Average key age: over 180 days.
  • Compliance frameworks (ISO 27001, SOC 2, PCI-DSS, CIS AWS Benchmark 1.14) all require periodic credential rotation — usually 90 days.

The goal of this solution: bring the “average key age” number below 90 days without human intervention, and preserve a grace window for rollback when something breaks.

Problem

Rotating access keys by hand hits three concrete pain points:

  1. Nobody knows where a key is being used. Deleting it breaks a CI pipeline → rollback → nobody dares delete again. Result: three-year-old keys still active.
  2. The work spreads across many accounts. An administrator has to assume a role into each account, list users, create new keys, hand them to owners, wait for confirmation, and only then disable the old keys. Does not scale.
  3. No audit trail. When an auditor asks “who rotated this key, when, and was the owner notified?” — there is no structured log to answer from.

Half-solutions fall short:

  • IAM Access Analyzer → unused access finder: only detects unused keys, does not rotate them.
  • Secrets Manager rotation for RDS/Aurora: built-in, but there is no equivalent built-in rotation template for IAM access keys — a Lambda has to be written.
  • Remove IAM users entirely and move to IAM Identity Center (SSO) + IAM roles: the right direction, but the migration takes quarters and the problem needs a bridge during that time.

This solution fills that gap: it assumes IAM users still exist, and automates key lifecycle until they can be removed.

Architecture

IAM access key auto-rotation architecture: an EventBridge daily cron in the security/audit account triggers an inventory Lambda that lists accounts from Organizations and fans out work to a rotation engine Lambda. The rotation engine assumes a role into each member account, rotates IAM keys, stores new keys in Secrets Manager, and SES notifies the owner and security ops.

ComponentPurposeTechnology
Account inventoryLists every account in the Organization; fans out workLambda (Python 3.13) + Organizations API
Rotation engineDecides rotate/disable/delete from key ageLambda (Python 3.13) + IAM API
Key storageStores new keys encrypted after creationAWS Secrets Manager (KMS CMK)
NotifierEmails owner + admin before and after each actionAmazon SES with an HTML template
SchedulerDaily triggerEventBridge rule (cron)
Cross-account accessAssumes role into member accountsIAM role (ExecutionRole) deployed via StackSet
ExemptionExcludes specific service usersIAM group IAMKeyRotationExemptionGroup
AuditLogs every actionCloudWatch Logs + CloudTrail (default)

Lifecycle state machine

A key passes through four states — ACTIVE → ROTATED → DISABLED → DELETED — controlled by the RotationPeriod, InactivePeriod, and RecoveryGracePeriod parameters.

Access key lifecycle: starts ACTIVE, becomes ROTATED at the RotationPeriod (default 90 days) with a new key stored in Secrets Manager, then DISABLED (still re-enablable) after the InactivePeriod, then permanently DELETED once the grace window closes.

Three parameters are tunable through CloudFormation:

  • RotationPeriod = 90 days
  • InactivePeriod = 100 days
  • RecoveryGracePeriod = 10 days (the window between disable and delete)

The 10-day window between “disable” and “delete” is the safety net — if a user still has the old key hard-coded somewhere, their pipeline will fail on day 100 rather than silently work until the key disappears permanently. Re-enabling a disabled key is much faster than recreating the user.

Deployment

The entire solution deploys through CloudFormation. The repo has four main templates:

  • ASA-iam-key-auto-rotation-and-notifier-solution.yaml — the core template, deployed into the security/audit account
  • ASA-iam-key-auto-rotation-iam-assumed-roles.yaml — the cross-account role, deployed via StackSet to every member account
  • ASA-iam-key-auto-rotation-list-accounts-role.yaml — the role that reads the Organizations API, deployed into the management account
  • ASA-iam-key-auto-rotation-vpc-endpoints.yaml — optional, if the Lambda runs inside a VPC

1. Prepare the SES identity

SES in sandbox mode only sends to verified addresses. Request production access before using it in production.

aws ses verify-email-identity \
  --email-address security-ops@example.com \
  --region us-east-1

2. Upload artifacts to S3

Lambda packages and templates are pulled from S3, not inlined. Create a bucket in the security account:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export BUCKET_NAME="asa-iam-rotation-${AWS_ACCOUNT_ID}-us-east-1"

aws s3 mb "s3://$BUCKET_NAME" --region us-east-1
aws s3 cp Lambda/    "s3://$BUCKET_NAME/asa/asa-iam-rotation/Lambda/"   --recursive
aws s3 cp template/  "s3://$BUCKET_NAME/asa/asa-iam-rotation/Template/" --recursive

3. Deploy the execution role to every member account

Use a CloudFormation StackSet with the service-managed permission model (requires Organizations and trusted access for CloudFormation enabled):

aws cloudformation create-stack-set \
  --stack-set-name asa-iam-rotation-member-role \
  --template-body file://CloudFormation/ASA-iam-key-auto-rotation-iam-assumed-roles.yaml \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
  --capabilities CAPABILITY_NAMED_IAM

aws cloudformation create-stack-instances \
  --stack-set-name asa-iam-rotation-member-role \
  --deployment-targets OrganizationalUnitIds=ou-xxxx-yyyyyyyy \
  --regions us-east-1

This role has the minimum policy needed: iam:ListUsers, iam:ListAccessKeys, iam:CreateAccessKey, iam:UpdateAccessKey, iam:DeleteAccessKey, iam:GetGroup, and secretsmanager:CreateSecret / PutSecretValue scoped by prefix. No iam:*.

4. Deploy the core stack — keep DryRun on

aws cloudformation deploy \
  --template-file CloudFormation/ASA-iam-key-auto-rotation-and-notifier-solution.yaml \
  --stack-name iam-key-auto-rotation \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
    S3BucketName="$BUCKET_NAME" \
    S3BucketPrefix="asa/asa-iam-rotation" \
    AdminEmailAddress="security-ops@example.com" \
    AWSOrgID="o-xxxxxxxxxx" \
    OrgListAccount="111111111111" \
    DryRunFlag="True" \
    RotationPeriod="90" \
    InactivePeriod="100"

DryRunFlag=True is required for the first run. In this mode the Lambda will:

  • Enumerate users and keys
  • Compute key ages
  • Email simulated actions that would be taken
  • Create, modify, or delete nothing in IAM

This is the window to discover which keys exceed the threshold, which users will be affected, and whether owner emails are verified.

5. Test a specific account

Invoke directly instead of waiting for the EventBridge cron:

aws lambda invoke \
  --function-name ASA-IAM-Access-Key-Rotation-Function \
  --payload '{"account":"222222222222","email":"owner@example.com","name":"prod-workload"}' \
  --cli-binary-format raw-in-base64-out \
  /tmp/out.json

cat /tmp/out.json

Check the function’s CloudWatch Logs for the rotation decision:

[INFO] User: deploy-bot
[INFO] Oldest active key: AKIA...  age=137d
[INFO] Decision: DISABLE (age >= InactivePeriod=100)
[INFO] DryRun=True → skipping iam:UpdateAccessKey

6. Flip to enforcement

Once the DryRun output has been reviewed:

aws cloudformation update-stack \
  --stack-name iam-key-auto-rotation \
  --use-previous-template \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameters \
    ParameterKey=DryRunFlag,ParameterValue=False \
    ParameterKey=S3BucketName,UsePreviousValue=true \
    ParameterKey=S3BucketPrefix,UsePreviousValue=true \
    ParameterKey=AdminEmailAddress,UsePreviousValue=true \
    ParameterKey=AWSOrgID,UsePreviousValue=true \
    ParameterKey=OrgListAccount,UsePreviousValue=true

7. Retrieve the newly created key

New keys are stored in the member account:

aws secretsmanager get-secret-value \
  --secret-id Account_222222222222_User_deploy-bot_AccessKey \
  --query SecretString --output text

The secret is encrypted with KMS. Only a principal with kms:Decrypt on the corresponding CMK can read it — typically the owner themselves, or a Lambda in that account.

Security considerations

A credential-rotation solution is itself a high-value target. An attacker who takes over the rotation Lambda can create new keys for any IAM user, delete keys in use (DoS), or read the secret holding the freshly created key. The review checklist for this solution:

  • Least privilege on the ExecutionRole. The member-account role has a narrow IAM write scope and Secrets Manager access scoped to the Account_*_User_*_AccessKey prefix. No iam:PassRole, iam:AttachPolicy, iam:CreateUser. Where possible, further restrict with an aws:PrincipalArn condition so only the security-account Lambda role can assume it.
  • Never log the secret into CloudWatch. Audit the Lambda code for stray print() calls on access-key secrets. CloudWatch log groups often have long retention and may be forwarded to a SIEM — a leak there is a real leak.
  • KMS CMK rather than AWS-managed key. A CMK provides access logs (CloudTrail kms:Decrypt events) and policies that control who can re-share the secret.
  • The exemption group is not a backdoor. IAMKeyRotationExemptionGroup is a convenience, but it is also a way around rotation. Enable a CloudTrail alarm on iam:AddUserToGroup when the target is this group, and include it in the periodic compliance review.
  • VPC endpoints for paranoid environments. The vpc-endpoints.yaml template provisions endpoints for IAM, Secrets Manager, STS, SES, and CloudWatch Logs — so a Lambda inside a VPC doesn’t need egress to the internet. Useful in accounts with egress restrictions; overkill for simpler ones.
  • The grace period is mandatory. Setting InactivePeriod == RecoveryGracePeriod + RotationPeriod (disable and delete on the same day) turns every misconfiguration into downtime. Ten days is a reasonable default; some production environments set 14–30 days.

What this solution does not protect against:

  • Keys leaked through non-AWS channels (public GitHub repos, Slack, screenshots). GitHub secret scanning and the Access Analyzer unused finder need to run in parallel.
  • Keys compromised between rotations. A 90-day rotation means a leaked key can be used for up to 90 days before rotation — that is a trade-off, not a mitigation.
  • IAM users nobody knows about (shadow users created by former administrators). That is an IAM inventory and access-review problem, not a rotation one.

Operations

The cron runs daily, not in real time. Observability setup:

Mandatory CloudWatch alarms:

  • Rotation Lambda error rate > 0 in 24 hours → page on-call.
  • Duration > 50% of timeout → the fan-out is stalling, likely IAM API throttling.
  • Throttles metric > 0 → increase reserved concurrency or split into smaller batches.

Periodic reports:

  • Weekly: list users with keys past RotationPeriod that are still in the exemption group → review whether the exemption is still justified.
  • Monthly: report on keys rotated / disabled / deleted → feed into the compliance dashboard.

Incident runbook:

  • “Pipeline X broke after the key was disabled” → re-enable the old key with iam:UpdateAccessKey Status=Active, hand over the new key from Secrets Manager to the owner, and push for migration before RecoveryGracePeriod expires.
  • “The rotation Lambda did not run today” → check the EventBridge rule state and the Lambda concurrency quota. The solution is idempotent: skipping a day breaks nothing, keys just rotate a day late.
  • “SES bounces” → a high bounce rate damages the SES reputation; verify owner email addresses before enforcement, and fall back to a team alias.

Do not treat rotation as fire-and-forget. The value of this solution is that it turns every key into an observable event. If nobody reads the emails and nobody watches the dashboard, keys still rotate, but during an incident there is no way to trace who touched what.

Trade-offs

DecisionOptionsChosenWhy
Source of truthDynamoDB state table vs. read IAM directly every runRead IAM directlyLess infrastructure; IAM is the real source of truth, no sync needed
New key storageSecrets Manager vs. SSM Parameter StoreSecrets ManagerHas a rotation API, built-in audit, per-secret KMS, acceptable cost
SchedulerEventBridge cron vs. Step FunctionsEventBridge + fan-out LambdaSimpler; no state machine needed for a single-step process
NotificationSES vs. SNS email vs. Slack webhookSESCustom HTML templates, no subscription required, suitable for sensitive content
Multi-accountSTS AssumeRole vs. per-account LambdaAssumeRole from the security accountOne code path, one log group, one alarm — much easier to operate than N copies
DryRun defaultDefault True vs. default FalseDefault TrueSafer; forces the administrator to read the report before enforcement
ExemptionTag on IAM user vs. IAM groupIAM groupGroups produce a clearer audit trail when membership changes; tags are easier to edit unnoticed
LanguagePython 3.13 vs. TypeScriptPython 3.13boto3 is the most complete; AWS sample code is available in Python

Lessons

After deploying and running this for a while, the things worth doing differently next time:

  • Run DryRun for two weeks, not two days. The first run enforced too early and uncovered ~7 CI pipelines still using old keys that nobody remembered. The rollback conversation with the team was worse than the wait would have been.
  • Rotation is better than no rotation, but it is not a substitute for removing IAM users. Set a quarterly KPI on reducing the number of IAM users, not just average key age. The former is real progress.
  • Keep the exemption group as small as possible. The rule to enforce: every user in the exemption group must have a ticket justifying it and an expiry date; auto-remove them each quarter and force owners to re-justify.
  • Email is not enough. Owners ignore internal email as a rule. Integrate Slack DMs or auto-created Jira tickets when a key crosses RotationPeriod - 14 — then people act.
  • Measure cost from day one. Lambda + Secrets Manager is nearly free at small scale, but for an organisation with thousands of IAM users, Secrets Manager ($0.40/secret/month) adds up. Consider a lifecycle that deletes secrets once the owner has fully migrated to the new key.

References