This page covers Amazon Web Services workload hardening across the compute surfaces that decide whether an attacker who lands code execution on a single instance can pivot to credentials, sibling workloads, or the AWS control plane. Scope is EC2 (instance metadata and remote access), Amazon ECR (container image supply chain), Amazon Inspector (vulnerability assessment), AWS Lambda (function-level least privilege and secrets handling), Amazon EKS (Kubernetes workload identity and control-plane posture), EC2 Image Builder (golden machine images), and AWS Systems Manager Patch Manager (post-deployment patch hygiene). Cross-cutting principles — image hardening, runtime protection, supply-chain integrity, secrets management — are explained in the General Workloads page sections on runtime security and supply chain; this page maps the principles to AWS primitives.
One canonical-content cross-link to flag at the top, because authoring this page in isolation would otherwise duplicate ~1500 words of canonical material: secrets management for AWS Lambda is documented on the General IAM — secrets management page, not here. The Phase 4 canonical-content rule (one canonical treatment per cross-cutting topic) lives this rule out in aws-work-05: the control covers Lambda execution-role least privilege and function-URL auth, and cross-links to general/iam.html for the Secrets Manager + KMS reference architecture rather than re-authoring it. The same pattern will recur on aws/data.html (encryption-in-transit cross-links to aws/network.html).
Two anti-conflation callouts up front, because both pairs get confused in design reviews. First: SSM Session Manager replaces SSH; bastion hosts are legacy. The default reflex of "stand up a bastion in a public subnet, allow port 22 from corporate IPs, jump from there" is obsolete (covered as aws-work-02): Session Manager exposes no public ports, requires no inbound network path, integrates with IAM for per-user authorisation, and writes a full session log to S3 + CloudWatch with optional KMS encryption. Engineers who insist on bastions are reproducing 2014's threat model — pick Session Manager, retire the bastion. Second: EKS Pod Identity (Dec 2023 GA) is the preferred workload-identity mechanism; IRSA is legacy-but-supported. Pod Identity decouples the trust-policy step that IRSA required, scales to many clusters without OIDC-provider sprawl, and is the path AWS is investing in (covered as aws-work-06). IRSA continues to work and existing IRSA deployments need not migrate urgently, but new clusters should default to Pod Identity. The same Pod-Identity-vs-IRSA choice was made on the IAM page for aws-iam-06; this page maintains alignment.
Order matters. Controls 01–02 are foundational invariants for every EC2 instance: IMDSv2 mandatory (the SSRF-to-credentials kill-chain mitigation) and SSM as the remote-access plane. Controls 03–04 close the container and vulnerability-assessment loop: ECR scan-on-push at build time, Amazon Inspector for continuous EC2 / ECR / Lambda assessment. Control 05 hardens Lambda functions. Control 06 hardens EKS. Control 07 establishes golden-AMI provenance via EC2 Image Builder. Control 08 handles ongoing patch hygiene via Systems Manager Patch Manager. The page is structured so a reader can skim 01–02 for the everyday EC2 baseline, then dip into 03–08 by service area as needed. Equivalence callouts at the bottom of each control point to the matching control on the Azure, GCP, and OCI sibling pages so a reader can compare modelling across providers, and the compliance-frameworks page describes why each control row carries the same seven framework columns.
aws-work-01-imdsv2-mandatory!CRITICALPREVENTIVE
Configure every EC2 instance with IMDSv2 token-required and hop-limit = 1, and pin the requirement with an organisation-level SCP that denies ec2:RunInstances when ec2:MetadataHttpTokens is not required. IMDSv1 is the unauthenticated, GET-only Instance Metadata Service that any local process — including a web server reflected through an SSRF bug — can call to retrieve the instance role's temporary credentials (Amazon EC2 — IMDSv2 enforcement and hop limit (accessed 2026-05)). IMDSv2 turns the call into a two-step session-token handshake (PUT to obtain a token, GET with the token header) that an SSRF reflection cannot perform because most SSRF payloads can only emit GETs. Hop-limit = 1 means the IMDS response packet has a TTL that decrements to zero after one hop — so a container in the host network namespace can reach it, but a forwarded HTTP request from a non-co-resident attacker cannot. PITFALL 5: hop-limit must be 1 for non-container workloads; the only legitimate reason to raise it to 2 is ECS-on-EC2, where the agent forwards the request through one virtual hop before reaching the IMDS — never raise hop-limit to 2 for general workloads "in case some app needs it", because that is exactly the attacker's wish.
Remediation — AWS CLI
# Enforce IMDSv2 on an existing instance (hop-limit=1 = non-container workload default).
aws ec2 modify-instance-metadata-options \
--instance-id i-0abc123def4567890 \
--http-tokens required \
--http-put-response-hop-limit 1 \
--http-endpoint enabled
# Account-wide default: every new instance launched after this call uses IMDSv2.
aws ec2 modify-instance-metadata-defaults \
--http-tokens required \
--http-put-response-hop-limit 1 \
--http-endpoint enabled
# Audit: list instances still allowing IMDSv1.
aws ec2 describe-instances \
--filters Name=metadata-options.http-tokens,Values=optional \
--query 'Reservations[].Instances[].[InstanceId,Tags[?Key==`Name`].Value|[0]]' \
--output table
AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 launch template mandating IMDSv2 (HttpTokens=required) on every instance launched from it.
Resources:
ImdsV2LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: imdsv2-mandatory
LaunchTemplateData:
MetadataOptions:
HttpTokens: required
HttpEndpoint: enabled
HttpPutResponseHopLimit: 2
InstanceMetadataTags: enabled
Remediation — AWS CDK (TypeScript)
import * as cdk from 'aws-cdk-lib';
import { aws_ec2 as ec2 } from 'aws-cdk-lib';
import { Construct } from 'constructs';
export class ImdsV2LaunchTemplateStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
new ec2.CfnLaunchTemplate(this, 'ImdsV2Lt', {
launchTemplateName: 'imdsv2-mandatory',
launchTemplateData: {
metadataOptions: {
httpTokens: 'required',
httpEndpoint: 'enabled',
httpPutResponseHopLimit: 2,
instanceMetadataTags: 'enabled',
},
},
});
}
}
Compliance mapping
CIS AWS Foundations v3.0.0
CIS Microsoft Azure Foundations v3.0.0
CIS GCP Foundation v4.0.0
CIS OCI Foundation v2.0.0
NIST SP 800-53 rev5
ISO/IEC 27001:2022
ISO/IEC 27017:2015
4.x (verify)
n/a
n/a
n/a
AC-3; CM-7; SC-8
A.8.20; A.8.25
CLD.9.5.1
Log signals
CloudTrail ec2:ModifyInstanceMetadataOptions events where requestParameters.httpTokens resolves to optional or where requestParameters.httpEndpoint remains enabled with httpPutResponseHopLimit above 1 — the IMDSv1 fallback re-opens the SSRF pivot from compromised containers and SDK clients into the instance role.
CloudTrail ec2:RunInstances events whose requestParameters.metadataOptions.httpTokens is optional (default for older AMIs and launch-templates) — surfaces fleet drift at launch time rather than at modification time.
Config rule ec2-imdsv2-check evaluating NON_COMPLIANT against production-tagged instances — backstop signal for instances that pre-date the CloudTrail event-window or were modified via the console with the event captured outside the working window.
Query
fields @timestamp, eventName, requestParameters.instanceId, requestParameters.httpTokens, requestParameters.httpPutResponseHopLimit, requestParameters.metadataOptions.httpTokens, userIdentity.arn
| filter eventSource = "ec2.amazonaws.com" and eventName in ["ModifyInstanceMetadataOptions","RunInstances"]
| filter requestParameters.httpTokens = "optional" or requestParameters.metadataOptions.httpTokens = "optional" or requestParameters.httpPutResponseHopLimit > 1
| sort @timestamp desc
| limit 100
The CloudWatch Logs Insights query covers both at-modification and at-launch paths; for fleets that rely on auto-scaling launch-templates, also confirm the launch-template itself is hardened via aws ec2 describe-launch-template-versions diff against the IaC.
Alert threshold
Any httpTokens=optional on a production instance — page immediately; the instance role's credentials are now reachable via IMDSv1 from any compromised process inside the OS.
RunInstances launching with httpTokens=optional from a launch-template not in the IaC allow-list — high-priority ticket; the new launch indicates either a manual launch outside the auto-scaling flow or a stale launch-template.
httpPutResponseHopLimit above 1 on a non-EKS instance — page; the hop-limit increase has no legitimate use case outside container workloads and is a deliberate signal that the operator wants pod-level access to IMDS.
Initial response
Re-enforce IMDSv2 with aws ec2 modify-instance-metadata-options --instance-id {id} --http-tokens required --http-put-response-hop-limit 1; for launch-templates, increment the version with the hardened metadata-options and set the new version as default.
Pivot to CloudTrail sts:AssumeRole events for the instance's IAM role during the relaxation window and identify any token use from outside the instance's expected workload — pod or container processes accessing the role via IMDSv1 are the canonical abuse pattern.
Open an incident per general/ir.html if any unexpected role use is found; rotate the instance role's session credentials via aws iam update-role trust-policy re-sign and follow up with the credential-rotation playbook on any downstream resources the role had access to.
Replace SSH with AWS Systems Manager Session Manager on every EC2 instance; delete the bastion fleet. Session Manager establishes interactive shells through the SSM control plane, requires no inbound network path to the instance, authenticates via IAM, and writes full session transcripts to S3 + CloudWatch Logs with optional KMS encryption (AWS Systems Manager — Session Manager (accessed 2026-05)). The architectural shift is the entire point: bastion hosts with public IPs and port 22 ingress are obsolete — Session Manager is zero-trust (no implicit network reachability), logged by default, and has no public attack surface. Instances need only the AmazonSSMManagedInstanceCore managed policy on their instance profile and an SSM Agent (pre-installed on all Amazon Linux 2, AL2023, Ubuntu 18.04+, and Windows Server 2016+ AMIs). Severity HIGH PREVENTIVE because the control eliminates an entire category of internet-exposed SSH brute-force attacks and credential-stuffing campaigns against bastion fleets; CRITICAL is reserved for SSRF-class single-step exploitation paths (e.g. aws-work-01).
CloudTrail ec2:RunInstances events whose requestParameters.keyName is non-empty on instances tagged for SSM-only access — attaching an SSH key-pair to an instance creates the SSH access path even if the security-group denies port 22 today (a later SG edit re-opens it without further alert).
VPC Flow Logs ACCEPT records on TCP 22 to any instance's primary ENI — the SSM-only posture means the steady-state SSH traffic to production instances is exactly zero, so any flow is high-signal.
CloudTrail ssm:UpdateInstanceInformation failures returning the instance as ConnectionLost while the instance is still running per ec2:DescribeInstances — indicates the SSM agent has been disabled or the instance role's AmazonSSMManagedInstanceCore attachment has been removed, breaking the documented access path.
The CloudWatch Logs Insights query catches the at-launch path; for the in-flight path, run a parallel query on ec2:AuthorizeSecurityGroupIngress filtered to port-22 introductions and join the results against the org's SSM-only tag in a downstream alert-correlation step.
Alert threshold
Any RunInstances with a non-empty keyName in production — page immediately; the org's launch templates set keyName=null as policy and a non-null value is a deliberate deviation.
Inbound TCP 22 flow to a production instance — page; cross-reference the source IP against corporate-egress CIDRs and against any active SSM Session Manager session in progress (Session Manager does not use port 22 so a concurrent SSH flow is high-confidence unauthorized).
SSM ConnectionLost persisting for more than 15 minutes while the instance shows healthy in EC2 — high-priority ticket; the documented access path is broken and the operator must restore it before the next maintenance window.
Initial response
Terminate the SSH path: revoke any port-22 ingress rule on the instance's security-group, and if a key-pair is attached at launch time, remove the public key from ~/.ssh/authorized_keys via an SSM Run Command before rebooting to ensure the key does not persist in the cloud-init data.
Re-attach AmazonSSMManagedInstanceCore to the instance role if the SSM agent is reporting ConnectionLost; confirm the agent reconnects by starting a Session Manager session as a smoke test.
Pull VPC Flow Logs for the exposure window and enumerate every inbound port-22 flow that succeeded; open an incident per general/ir.html for any flow from outside the corporate egress CIDRs and rotate credentials reachable from the instance's IAM role.
Enable enhanced scanning (Inspector v2 powered) on every Amazon ECR repository, set image_tag_mutability = IMMUTABLE, and gate the deployment pipeline so that any image with CRITICAL findings is blocked from production promotion (Amazon ECR User Guide — image scanning configuration (accessed 2026-05)). Enhanced scanning provides continuous CVE assessment, package-vulnerability detail for both OS and application-language ecosystems (npm, PyPI, RubyGems, Go modules, Maven, NuGet), and integrates findings into Amazon Inspector and Security Hub. The IMMUTABLE tag policy means an attacker who somehow lands push permission to myrepo:latest cannot overwrite an already-scanned tag with a back-doored image — the deployment pipeline pulls the same SHA256-pinned image that was scanned. The DETECTIVE typology is deliberate: scanning surfaces unsafe state, the build-pipeline gate is the PREVENTIVE pair (deployment denied on CRITICAL findings).
CloudTrail ecr:PutRegistryScanningConfiguration events where the requestParameters.scanType shifts from ENHANCED to BASIC or where the requestParameters.rules array removes the canonical scan-on-push rule for production repository name patterns.
CloudTrail ecr:PutImageScanningConfiguration per-repository where requestParameters.imageScanningConfiguration.scanOnPush flips to false — silently disables scanning for one repository without touching the registry-wide configuration.
Inspector finding-export volume from ECR-source findings drops to zero on a repository that historically reports findings — passive signal that scan results stopped arriving even if the configuration looks correct.
The CloudWatch Logs Insights query targets the two regression paths simultaneously; the scanType=BASIC case is more severe because it strips Inspector-managed CVE feeds and leaves only the older basic-scan engine.
Alert threshold
Any PutRegistryScanningConfiguration shifting to BASIC in production — page immediately; the registry-wide downgrade affects every repository and removes Inspector CVE coverage instantly.
Per-repository scanOnPush=false change — high-priority ticket; the repository's image-push gate is now bypassed and any vulnerability scanning happens (if at all) on a delayed registry scan rather than at the push gate.
Inspector finding-stream volume below 10% of trailing-7-day baseline for an ECR repository with active pushes — informational; promote to incident if the divergence persists for 24 hours and the scan configuration appears nominally correct (indicates an Inspector-side issue or service-link misconfiguration).
Initial response
Restore the scanning configuration with aws ecr put-registry-scanning-configuration --scan-type ENHANCED --rules file://canonical-rules.json; for per-repository overrides, re-set scanOnPush=true via put-image-scanning-configuration.
Trigger a manual rescan of all images pushed during the disabled window with aws ecr start-image-scan per image-digest; this surfaces any CVE that would have blocked the push if scanning had been active.
Open an incident via general/ir.html if any image with a Critical or High CVE was pushed during the window and has since been deployed; the corresponding workload's runtime exposure needs to be evaluated and a patched image promoted ahead of normal release cadence.
Enable Amazon Inspector organisation-wide with the delegated-administrator pattern, covering EC2 instances, ECR repositories, and Lambda functions; route findings into AWS Security Hub for unified triage (Amazon Inspector User Guide — EC2/ECR/Lambda scanning (accessed 2026-05)). A naming caveat that matters for accuracy in audit reports: the product is Amazon Inspector (current name; the older "AWS Inspector" branding is no longer correct), and the "v2" qualifier was dropped in 2024 — the previous "Amazon Inspector v2" is now simply "Amazon Inspector". Inspector continuously assesses EC2 (agent-based and agentless modes), ECR images (the same scanning surfaced in aws-work-03), and Lambda functions (package and code scanning). Severity HIGH DETECTIVE because Inspector surfaces unsafe state — CVEs, misconfigurations, public-network-path findings — but is not itself the preventive gate; the build-pipeline integration and the IMDSv2/SCP combination on aws-work-01 are the preventive pairs.
Remediation — AWS CLI
# Designate the delegated administrator (from the Organizations management account).
aws inspector2 enable-delegated-admin-account \
--delegated-admin-account-id 222233334444
# From the delegated admin: enable Inspector for all member accounts, all resource types.
aws inspector2 enable \
--account-ids ALL_MEMBERS \
--resource-types EC2 ECR LAMBDA LAMBDA_CODE
# Verify status.
aws inspector2 batch-get-account-status \
--account-ids 111122223333 222233334444 \
--query 'accounts[].[accountId,state.status]'
AWSTemplateFormatVersion: '2010-09-09'
Description: Inspector v2 enablement for EC2 + ECR + Lambda scanning in the account.
Resources:
InspectorEnabler:
Type: AWS::InspectorV2::Filter
Properties:
Name: enable-all-scan-types
FilterAction: NONE
FilterCriteria:
Severity:
- Comparison: EQUALS
Value: HIGH
Compliance mapping
CIS AWS Foundations v3.0.0
CIS Microsoft Azure Foundations v3.0.0
CIS GCP Foundation v4.0.0
CIS OCI Foundation v2.0.0
NIST SP 800-53 rev5
ISO/IEC 27001:2022
ISO/IEC 27017:2015
(best-practices)
n/a
n/a
n/a
RA-5; SI-4
A.8.8
CLD.12.4.5
Log signals
CloudTrail inspector2:Disable events where the requestParameters.resourceTypes includes EC2, ECR, or LAMBDA — turns off scanning for the corresponding resource family at the org level.
CloudTrail inspector2:DisassociateMember on member accounts — peels accounts out of the org aggregation; the finding stream from those accounts stops flowing to the delegated-administrator account from that point.
CloudTrail inspector2:UpdateOrganizationConfiguration where autoEnable for any resource type flips to false — leaves current coverage intact but breaks the onboarding posture for new accounts and resources.
Run the CloudWatch Logs Insights query against the delegated-administrator's CloudTrail log group; Inspector org-management events route through that account and the delegated-admin context is the canonical source of truth for org-level Inspector state.
Alert threshold
Any Disable affecting a resource type in production — page immediately; CVE-feed coverage stops flowing for that resource family at the moment of the disable and re-enabling triggers a full re-scan which takes hours to complete.
DisassociateMember or DeleteMember on more than one account in a 24-hour window — page; single-account events may reflect legitimate account closures but multi-account events indicate a sweep.
UpdateOrganizationConfiguration with autoEnable=false for any resource type — high-priority ticket within one business hour; the downside surfaces over weeks as new resources / accounts arrive uncovered.
Initial response
Re-enable scanning with aws inspector2 enable --resource-types EC2,ECR,LAMBDA --account-ids {accounts}; verify via batch-get-account-status that all three resource types report ENABLED for every member account.
Restore autoEnable=true via aws inspector2 update-organization-configuration and confirm new-account enrolment by creating a sandbox test account and verifying it auto-enrolls within one hour.
Open an incident via general/ir.html; for the gap-of-coverage window, manually trigger a one-shot scan via aws inspector2 enable-delegated-admin-account re-association and inventory any Critical / High findings that surface in the post-restoration sweep — these were latent during the gap.
Every AWS Lambda function gets its own least-privileged execution role (no shared "lambda-default" role, no *:* wildcards), its function URL (where present) is configured with AuthType = AWS_IAM, its secrets are pulled at runtime from AWS Secrets Manager via KMS-encrypted references rather than baked into environment variables, and production functions carry a reserved_concurrent_executions ceiling that caps blast-radius cost during anomalous traffic (AWS Lambda Developer Guide — execution-role least privilege (accessed 2026-05)). The canonical secrets-management reference architecture — Secrets Manager rotation, KMS key policies, runtime fetching — is documented on the General IAM page (§secrets management), and this control intentionally cross-links rather than re-authoring per the Phase 4 canonical-content rule. Severity HIGH PREVENTIVE because Lambda functions inherit AWS-managed isolation but their execution-role permissions ARE the blast radius if the function is compromised; a function with s3:* on * is functionally an Organisation-wide read-write key in the hands of any attacker who exploits a code bug.
Remediation — AWS CLI
# Update an existing function to use a least-priv role and AWS_IAM URL auth.
aws lambda update-function-configuration \
--function-name order-processor \
--role arn:aws:iam::111122223333:role/lambda-order-processor-role \
--reserved-concurrent-executions 50 \
--environment 'Variables={SECRET_ARN=arn:aws:secretsmanager:eu-west-1:111122223333:secret:db/order-xyz}'
aws lambda update-function-url-config \
--function-name order-processor \
--auth-type AWS_IAM
# Audit: list functions whose role is the legacy lambda_basic_execution role.
aws lambda list-functions \
--query 'Functions[?contains(Role,`lambda_basic_execution`)].[FunctionName,Role]' \
--output table
CloudTrail lambda:CreateFunction or lambda:UpdateFunctionConfiguration where the resolved execution-role's attached-policy list includes any *FullAccess, AdministratorAccess, or iam:* policy — the function inherits the entire policy surface, defeating the least-privilege intent.
CloudTrail iam:AttachRolePolicy targeting a Lambda execution role with a managed-policy ARN outside the org's curated allow-list — silently widens an existing function's permissions without touching the function itself.
CloudTrail lambda:UpdateFunctionConfiguration setting KmsKeyArn to null or removing the function-URL AuthType=AWS_IAM — adjacent posture regressions that often accompany permission widening.
Query
fields @timestamp, eventName, requestParameters.functionName, requestParameters.role, requestParameters.policyArn, requestParameters.roleName, userIdentity.arn
| filter eventSource in ["lambda.amazonaws.com","iam.amazonaws.com"] and eventName in ["CreateFunction","UpdateFunctionConfiguration","AttachRolePolicy"]
| filter requestParameters.policyArn like /FullAccess$/ or requestParameters.policyArn like /AdministratorAccess$/ or requestParameters.policyArn like /:policy\/iam-/
| sort @timestamp desc
| limit 100
The CloudWatch Logs Insights query covers both function-side and role-side mutations; the policyArn regex catches the most common over-privileging idioms and should be tuned against the org's managed-policy allow-list maintained as a TSV in the IaC repository.
Alert threshold
Any Lambda execution role acquiring an *FullAccess or AdministratorAccess policy in production — page immediately; the function's request-driven invocation model means any caller-reachable trigger now has effective use of the over-privileged role.
A function deployed with KmsKeyArn=null when environment variables contain non-empty values — high-priority ticket; environment variables are visible to anyone with lambda:GetFunctionConfiguration when not CMK-encrypted and frequently contain secrets that should have been in Secrets Manager instead.
Function-URL AuthType=NONE on a production function — page; the function is now Internet-reachable without IAM-signed requests and the invocation surface is anyone who knows the URL.
Initial response
Detach the over-privileged policy with aws iam detach-role-policy and re-attach the function's IaC-canonical scoped policy; if the function was created from scratch with the over-privileged role, delete the function (idempotent re-deploy will recreate it with the correct role from IaC).
Inventory the function's CloudWatch Logs invocation stream during the over-privilege window and enumerate every AWS API call the function made — any call outside the function's documented action set is a candidate abuse trace.
Open an incident via general/ir.html if the function's invocation stream includes calls to S3 objects, KMS keys, or IAM principals outside its expected scope; rotate any credentials the function may have touched via the credential-rotation playbook (aws-ir-06-credential-rotation-playbook).
Harden Amazon EKS clusters along four orthogonal axes: (a) workload identity via EKS Pod Identity (Dec 2023 GA, preferred over IRSA per Phase 5 IAM precedent on aws-iam-06; IRSA remains legacy-but-supported for existing clusters); (b) Pod Security Admission with the restricted profile enforced on every workload namespace; (c) private control-plane endpoint with public access disabled; (d) control-plane audit logs shipped to CloudWatch (Amazon EKS — Pod Identity (accessed 2026-05)). Pod Identity decouples Kubernetes ServiceAccount → IAM role mapping from the per-cluster OIDC-provider step IRSA required; one EKS-side association replaces the cluster-specific IRSA trust-policy edit, which is the property that makes Pod Identity scale across many clusters without OIDC-provider sprawl. Scope deliberately bounded: this control covers (a)-(d) above; service-mesh integration, admission-controller frameworks beyond PSA, and runtime-detection agents are out of scope for this page and will be mentioned as v2 supplementary topics in a future release.
Remediation — AWS CLI
# Create a Pod Identity association: Kubernetes SA <-> IAM role mapping.
aws eks create-pod-identity-association \
--cluster-name prod-cluster \
--namespace orders \
--service-account order-processor-sa \
--role-arn arn:aws:iam::111122223333:role/eks-order-processor
# Update cluster: private endpoint only + full audit logging.
aws eks update-cluster-config \
--name prod-cluster \
--resources-vpc-config endpointPrivateAccess=true,endpointPublicAccess=false \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
# Enforce Pod Security Admission restricted profile via namespace labels.
kubectl label namespace orders \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/audit=restricted \
pod-security.kubernetes.io/warn=restricted
Remediation — Terraform
# Terraform AWS provider ~> 5.0
# Source: AWS docs (accessed 2026-05)
resource "aws_eks_cluster" "prod" {
name = "prod-cluster"
role_arn = aws_iam_role.cluster.arn
version = "1.30"
vpc_config {
subnet_ids = aws_subnet.private[*].id
endpoint_private_access = true
endpoint_public_access = false # Private control plane only
}
enabled_cluster_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
encryption_config {
provider { key_arn = aws_kms_key.eks.arn }
resources = ["secrets"]
}
}
# Pod Identity association (preferred over IRSA for new clusters).
resource "aws_eks_pod_identity_association" "order_processor" {
cluster_name = aws_eks_cluster.prod.name
namespace = "orders"
service_account = "order-processor-sa"
role_arn = aws_iam_role.order_processor_pod.arn
}
resource "aws_iam_role" "order_processor_pod" {
name = "eks-order-processor"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["sts:AssumeRole", "sts:TagSession"]
Principal = { Service = "pods.eks.amazonaws.com" }
}]
})
}
Remediation — CloudFormation
AWSTemplateFormatVersion: '2010-09-09'
Description: EKS Pod Identity addon — installs the agent that mints scoped credentials per pod.
Parameters:
ClusterName:
Type: String
Resources:
PodIdentityAgent:
Type: AWS::EKS::Addon
Properties:
ClusterName: !Ref ClusterName
AddonName: eks-pod-identity-agent
ResolveConflicts: OVERWRITE
Compliance mapping
CIS AWS Foundations v3.0.0
CIS Microsoft Azure Foundations v3.0.0
CIS GCP Foundation v4.0.0
CIS OCI Foundation v2.0.0
NIST SP 800-53 rev5
ISO/IEC 27001:2022
ISO/IEC 27017:2015
n/a (post-v3.0.0)
n/a
n/a
n/a
AC-3; AC-6; SC-7
A.5.15; A.8.20
CLD.9.5.1
Log signals
EKS audit-log events showing a pod requesting a service-account token that maps via Pod Identity to an IAM role whose policy graph includes iam:*, sts:AssumeRole on broad principals, or any *FullAccess managed policy — surfaces over-privileged pod-identity bindings even when the binding itself was created legitimately.
CloudTrail sts:AssumeRoleWithWebIdentity (legacy IRSA path) on a cluster that should have migrated to Pod Identity — indicates either an un-migrated workload or a deliberate downgrade to the older path that lacks the EKS-managed credential injection guard rails.
CloudTrail use of a pod-identity-bound role's session credentials from outside the cluster's documented egress IP ranges — the SDK-side use should always trace back to the cluster's NAT-gateway egress IP set, so any other source IP is high-confidence credential leakage.
Query
fields @timestamp, eventName, requestParameters.roleArn, requestParameters.roleSessionName, sourceIPAddress, userIdentity.sessionContext.sessionIssuer.arn
| filter eventSource = "sts.amazonaws.com" and eventName = "AssumeRole"
| parse requestParameters.roleArn /arn:aws:iam::(?<acct>\d+):role\/(?<role>.+)/
| filter role like /^eks-pod-/
| stats count() as n by sourceIPAddress, role
| sort n desc
| limit 50
The CloudWatch Logs Insights query aggregates AssumeRole calls per source-IP per role; the canonical posture has at most a handful of egress IPs per cluster, so any unfamiliar source-IP entry above 0 calls is the actionable anomaly.
Alert threshold
Any pod-identity-bound role accessed from a source-IP outside the cluster's egress IP set — page immediately and treat as confirmed credential leakage until proven otherwise; the role's session credentials should never be observed outside the cluster network.
A new pod-identity binding to an IAM role whose effective policy graph includes iam:* — page; the binding makes that role's blast radius the entire IAM control plane and the legitimate use cases for such bindings are extremely rare.
Legacy IRSA AssumeRoleWithWebIdentity traffic on a cluster post-migration — informational; the migration completion criteria should drive this to zero and any residual traffic indicates an un-migrated workload.
Initial response
Revoke the binding via aws eks delete-pod-identity-association if the role over-privilege was the regression, or detach the over-privileged policy from the role if the binding itself was legitimate; force a pod restart in the affected namespace so the in-memory token cache invalidates.
For source-IP leakage, immediately revoke active sessions with aws iam delete-role-permissions-boundary + temporary scoped-deny SCP from aws-iam-08-scp-deny-list; this stops further use within seconds while the IAM-engineering team works on a permanent role replacement.
Open an incident per general/ir.html; cross-reference every sts:AssumeRole call against the cluster's egress-IP set during the prior 24 hours and treat any unaccounted-for source as a candidate active compromise.
Build EC2 instances from golden AMIs produced by an EC2 Image Builder pipeline whose ImageRecipe applies the CIS Amazon Linux 2023 (or equivalent) hardening components, installs the SSM Agent, and distributes the resulting AMI to every workload region via a DistributionConfiguration (EC2 Image Builder User Guide (accessed 2026-05)). The pipeline rebuilds on a quarterly cadence AND on a CVE-driven trigger (a CRITICAL CVE in the base image fires the build via an EventBridge rule). Severity MEDIUM PREVENTIVE because golden AMIs reduce ongoing patch surface but are not by themselves the patch-management story — aws-work-08 (Patch Manager) handles post-deployment patches; golden AMIs handle initial-state hygiene. The pair (golden AMI + Patch Manager) is the AWS-native equivalent of the "immutable infrastructure with periodic re-baking" pattern.
AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 Image Builder pipeline producing a hardened golden AMI on a weekly cadence.
Parameters:
RecipeArn:
Type: String
InfrastructureConfigArn:
Type: String
DistributionConfigArn:
Type: String
Resources:
GoldenAmiPipeline:
Type: AWS::ImageBuilder::ImagePipeline
Properties:
Name: golden-ami-weekly
ImageRecipeArn: !Ref RecipeArn
InfrastructureConfigurationArn: !Ref InfrastructureConfigArn
DistributionConfigurationArn: !Ref DistributionConfigArn
Schedule:
ScheduleExpression: cron(0 6 ? * SUN *)
PipelineExecutionStartCondition: EXPRESSION_MATCH_AND_DEPENDENCY_UPDATES_AVAILABLE
Status: ENABLED
Compliance mapping
CIS AWS Foundations v3.0.0
CIS Microsoft Azure Foundations v3.0.0
CIS GCP Foundation v4.0.0
CIS OCI Foundation v2.0.0
NIST SP 800-53 rev5
ISO/IEC 27001:2022
ISO/IEC 27017:2015
(best-practices)
n/a
n/a
n/a
CM-2; SI-2; SA-10
A.8.9; A.8.32
CLD.12.4.5
Log signals
CloudTrail ec2:RunInstances events whose requestParameters.imageId resolves to an AMI not in the canonical golden-AMI set tracked by Image Builder pipelines — production launches outside the golden-AMI allow-list are by definition unauthorized.
CloudTrail imagebuilder:UpdateImagePipeline events where the imageRecipeArn changes or where buildComponentArn entries are removed — silently alters the golden-AMI build outcome without touching the AMI distribution.
CloudTrail imagebuilder:CancelImageCreation or pipeline-execution failures — disrupts the cadence of fresh golden AMIs, leading to fleet drift toward stale AMIs that miss recent CVE patches.
Query
fields @timestamp, eventName, requestParameters.imageId, requestParameters.imagePipelineArn, requestParameters.containerRecipeArn, requestParameters.imageRecipeArn, userIdentity.arn
| filter eventSource in ["ec2.amazonaws.com","imagebuilder.amazonaws.com"] and eventName in ["RunInstances","UpdateImagePipeline","CancelImageCreation","DeleteImagePipeline"]
| sort @timestamp desc
| limit 100
For the RunInstances variant, post-process the CloudWatch Logs Insights output against the org's golden-AMI ID allow-list (maintained in SSM Parameter Store under /golden-amis/{family}/latest) to surface the launches that fell outside the curated set.
Alert threshold
Any production launch from an AMI ID outside the golden-AMI allow-list — page immediately; the launched instance lacks the org's baked-in hardening (CloudWatch agent, SSM agent, CIS-Level-1 configuration) and represents a fleet posture deviation.
UpdateImagePipeline changing the recipe ARN outside a tracked change-management ticket — high-priority ticket within one business hour; the recipe defines which CIS controls bake into every downstream AMI.
Image Builder pipeline-execution failure persisting beyond two consecutive scheduled runs — informational; the alert should escalate to high if the pipeline has not produced a fresh AMI in 14 days, since the org's patch SLA depends on the cadence.
Initial response
For unauthorised AMI launches, immediately isolate the instance per the EC2 isolation playbook (aws-ir-05-isolation-playbook-ec2) — change its security-group to a single-deny-rule containment SG and snapshot the EBS volumes before any further triage.
Restore the Image Builder pipeline's recipe from IaC with aws imagebuilder update-image-pipeline --image-pipeline-arn {arn} --image-recipe-arn {canonical-arn}; trigger an immediate one-shot build via start-image-pipeline-execution to refresh the golden AMI.
Open an incident via general/ir.html; the unauthorised AMI may have been deliberately crafted with a baked-in backdoor, so the snapshot from step 1 should be forensically imaged and the AMI itself analysed for any deviation from the public Amazon Linux 2023 / Ubuntu base manifests.
Enable AWS Systems Manager Patch Manager on every EC2 instance: define a patch baseline (the approved-patch ruleset per OS family), define a maintenance window (the time-of-day envelope when patches are applied), attach instances via tag-based patch groups, and surface patch-compliance metrics to AWS Config (AWS Systems Manager — Patch Manager (accessed 2026-05)). The DETECTIVE typology is deliberate: Patch Manager reports the compliance state of every managed instance against the baseline, and the maintenance-window automation is the remediation pair. Severity MEDIUM because the control is operational hygiene rather than single-step exploitation prevention; pairs with aws-work-04 (Inspector flags the CVEs that the baseline must include) and aws-work-07 (golden AMIs handle initial-state hygiene; Patch Manager handles steady-state drift).
CloudTrail ssm:DeleteMaintenanceWindow events targeting the canonical patch-window resources — destroys the scheduled patch cadence; instances stay on their current patch baseline until manual intervention.
CloudTrail ssm:DeregisterTargetFromMaintenanceWindow where the deregistered target is the production instance-fleet tag — peels instances out of the patch scope one population at a time.
SSM DescribePatchGroupState output where InstancesWithNotApplicablePatches + InstancesWithMissingPatches is non-zero on a patch-group whose maintenance-window has been disabled or removed — passive signal of patch drift that is the actual measurable outcome of the regression.
Pair the CloudWatch Logs Insights query with a daily completeness check that runs aws ssm describe-maintenance-windows and asserts each canonical window is present, enabled, and has the expected target registration count.
Alert threshold
Any DeleteMaintenanceWindow in production — page immediately; the patch cadence breaks at the moment of the delete and the CVE exposure window grows linearly with time.
DeregisterTargetFromMaintenanceWindow removing more than 10% of the fleet — high-priority ticket; an explicit operator choice that should map to a tracked change ticket, not a one-off CLI invocation.
Patch compliance below 95% for any patch-group for more than 14 days — informational hygiene ticket; promote to incident if the affected patch-group includes internet-facing workloads and the missing patches include Critical CVEs.
Initial response
Restore the maintenance-window from IaC with aws ssm create-maintenance-window using the canonical schedule expression, then re-register patch-baseline targets via register-target-with-maintenance-window and register-task-with-maintenance-window.
Trigger a one-shot patch sweep with aws ssm start-automation-execution --document-name AWS-RunPatchBaseline against the affected patch-group to close the gap created by the missed scheduled window.
Open an incident via general/ir.html if any production instance has missed two or more scheduled patch cycles; the workload's exposed CVE list should be enumerated via aws inspector2 list-findings filtered on the instance ARN, and any active-exploit CVE should drive immediate remediation rather than waiting for the next maintenance window.