Skip to main content
  1. Blog/

Getting Started with Amazon ECS

Graham King
Author
Graham King
Senior DevOps & Cloud Infrastructure Engineer. Founder of Luna, a cloud-based vulnerability management platform.

I’ve been running production workloads on ECS for a while now and wanted to share some practical lessons. These are things I’ve picked up along the way that aren’t always obvious from the AWS docs.

Why ECS Over EKS?
#

EKS is great if you need the full Kubernetes ecosystem, but it comes with overhead. The control plane alone costs ~$75/month before you run a single container. ECS has no control plane cost, tighter AWS integration, and is simpler to operate for most workloads.

If your team isn’t already invested in Kubernetes tooling, ECS is worth considering.

Cluster Setup
#

ECS gives you two launch types: Fargate (serverless) and EC2 (self-managed instances). There’s also a newer option called Managed Instances which sits between the two. You get EC2 pricing with ECS managing the instance lifecycle.

For cost-sensitive workloads, Managed Instances with Graviton ARM64 instances (t4g, c7g families) are hard to beat. ARM instances are roughly 20% cheaper than their x86 equivalents for comparable performance.

aws ecs create-cluster --cluster-name my-app-prod

If you’re using Managed Instances, you’ll configure a capacity provider that specifies instance types, architecture, and networking:

aws ecs create-capacity-provider \
  --name my-app-capacity \
  --auto-scaling-group-provider "..." \
  --instance-types t4g.micro,t4g.small,t4g.medium

Task Definitions
#

A task definition is basically a blueprint for your containers. You specify the image, CPU, memory, port mappings, environment variables, and logging.

Here’s a stripped-down example:

{
  "family": "my-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["EC2"],
  "runtimePlatform": {
    "cpuArchitecture": "ARM64",
    "operatingSystemFamily": "LINUX"
  },
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.eu-west-2.amazonaws.com/my-api:latest",
      "portMappings": [
        {
          "containerPort": 8081,
          "protocol": "tcp"
        }
      ],
      "healthCheck": {
        "command": ["CMD", "/app/api", "health"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 30
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-app",
          "awslogs-region": "eu-west-2",
          "awslogs-stream-prefix": "api"
        }
      }
    }
  ]
}

A few things worth noting:

  • awsvpc network mode gives each task its own ENI and private IP. Required for Managed Instances and Fargate, and generally the right choice.
  • Health checks are important. Without them ECS can’t tell if your container is actually working. Build a health endpoint into your app.
  • startPeriod gives your container time to boot before ECS starts checking health. Set this high enough or you’ll get stuck in a restart loop.

Sidecar Pattern
#

ECS supports multiple containers in a single task definition. A common use case is running a sidecar alongside your main application.

For example, you can run a Cloudflare Tunnel sidecar next to your API container. The tunnel connects outbound to Cloudflare, so your API never needs a public IP or a load balancer:

{
  "name": "cloudflared",
  "image": "cloudflare/cloudflared:2024.12.2",
  "command": ["tunnel", "--no-autoupdate", "run"],
  "essential": true,
  "dependsOn": [
    {
      "containerName": "api",
      "condition": "HEALTHY"
    }
  ],
  "secrets": [
    {
      "name": "TUNNEL_TOKEN",
      "valueFrom": "/my-app/cloudflare-tunnel-token"
    }
  ]
}

The dependsOn with HEALTHY condition means the sidecar waits for the API container to pass its health check before starting. This prevents the tunnel from routing traffic to a container that isn’t ready.

This pattern eliminates the need for an ALB or NLB entirely, saving ~$16/month plus data processing costs.

Secrets Management
#

Don’t bake secrets into your images or pass them as plain environment variables. Use AWS Systems Manager Parameter Store:

aws ssm put-parameter \
  --name "/my-app/prod/database-url" \
  --type SecureString \
  --value "mongodb+srv://..."

Then reference them in your task definition:

"secrets": [
  {
    "name": "DATABASE_URL",
    "valueFrom": "/my-app/prod/database-url"
  }
]

ECS pulls these at task launch time. Your execution role needs ssm:GetParameters permission scoped to your parameter prefix.

IAM Roles
#

ECS uses two roles per task:

Execution role is what ECS itself uses to pull images from ECR and fetch secrets from SSM:

{
  "Effect": "Allow",
  "Action": [
    "ssm:GetParameters",
    "ssm:GetParameter"
  ],
  "Resource": "arn:aws:ssm:*:*:parameter/my-app/*"
}

Task role is what your application code uses at runtime. Scope this tightly. If your app needs S3 access, only grant it to specific buckets. If it needs to launch other ECS tasks, restrict it to specific task definitions.

Networking
#

Put your ECS tasks in private subnets. They don’t need public IPs. For outbound access (pulling images, calling external APIs) you need NAT.

A NAT Gateway costs ~$32/month. For non-production or low-traffic workloads, a NAT instance on a t4g.micro is around $3/month and works fine. It’s a single point of failure, but for many use cases that’s an acceptable trade-off.

Internet
Public Subnet: NAT instance (t4g.micro)
Private Subnet: ECS tasks (no public IP)

Container Images
#

Use minimal base images. Chainguard images (cgr.dev/chainguard/static) ship with zero known CVEs and are significantly smaller than Alpine or Debian-based images. Smaller images mean faster pulls and less attack surface.

Build for ARM64 if you’re running Graviton instances:

FROM cgr.dev/chainguard/go:latest AS builder
ENV CGO_ENABLED=0 GOOS=linux GOARCH=arm64
WORKDIR /app
COPY . .
RUN go build -o api .

FROM cgr.dev/chainguard/static:latest
COPY --from=builder /app/api /app/api
USER nonroot
ENTRYPOINT ["/app/api"]

Cost Optimisation for Dev and Staging
#

When you’re developing or running staging environments, there are a few things that can keep costs down without sacrificing the architecture:

  • NAT instance over NAT Gateway saves ~$29/month
  • Graviton ARM64 instances are 20% cheaper than x86
  • ECR lifecycle policies to keep only the last N images so storage doesn’t creep up
  • Scale to zero outside working hours

These are sensible for non-production environments where uptime isn’t critical.

Production Considerations
#

Production needs more thought though. The architecture patterns in this post (private subnets, IAM least-privilege, SSM secrets, health checks, Chainguard images) are all production-grade. But you’ll want to make different infrastructure choices when availability and reliability matter:

  • NAT Gateway over NAT instance. The ~$32/month is worth it for managed high availability. A NAT instance is a single point of failure and will drop all outbound connectivity if it goes down.
  • Multi-AZ everything. Run tasks across at least two availability zones with multiple desired count for your services.
  • Auto-scaling. Configure ECS service auto-scaling based on CPU, memory, or custom CloudWatch metrics rather than running a fixed task count.
  • Load balancer if needed. The Cloudflare Tunnel sidecar pattern works well, but if you need internal service-to-service routing or weighted deployments, an ALB gives you more control.
  • Monitoring and alerting. CloudWatch alarms on task health, Container Insights for resource utilisation, and alerting on service events like task failures and deployment rollbacks.

At Luna, our production environment uses the full production-grade setup with multi-AZ, managed capacity providers, proper auto-scaling, and comprehensive monitoring. The cost-optimised patterns I mentioned earlier run in our dev and staging environments where the trade-offs make sense.

Wrapping Up
#

ECS is a solid choice for running containers on AWS without the complexity of Kubernetes. The core architecture stays the same across environments: private networking, secrets management, minimal images, health checks. What changes is how much you invest in redundancy and managed services based on your availability requirements.

I use ECS to run Luna, a vulnerability management platform, handling everything from the API to on-demand scanning tasks across both staging and production.

Related