DevOps

From Zero to Production-Style Kubernetes on AWS with K3s and GitOps

How I built a production-style Kubernetes platform on AWS EC2 using K3s, Terraform, Argo CD, ingress-nginx, cert-manager, Prometheus, Grafana, and Loki, including a real ingress incident and recovery.

February 19, 20263 min read12 views
From Zero to Production-Style Kubernetes on AWS with K3s and GitOps
Article
  • πŸš€ From Zero to Production-Style Kubernetes on AWS with K3s and GitOps

I built this project to simulate how a real platform team operates: declarative infrastructure, GitOps-based delivery, automated HTTPS, full observability, and incident-driven improvement.

This is not a toy cluster β€” it is a cost-conscious, production-style Kubernetes platform running on AWS EC2 using K3s.

Platform architecture (AWS EC2 + K3s + Argo CD + ingress + observability).

πŸ— What I Built

This platform combines:

  • Terraform-provisioned AWS EC2 infrastructure
  • K3s Kubernetes cluster
  • Argo CD (App-of-Apps GitOps model)
  • ingress-nginx for HTTP/HTTPS routing
  • cert-manager + Let’s Encrypt for automated TLS
  • Prometheus + Alertmanager + Grafana for metrics
  • Loki + Promtail for centralized logging

The repository is organized by responsibility:

infra/terraform/aws-k3s β†’ Infrastructure provisioning platform/ β†’ Cluster platform services (GitOps managed) apps/ β†’ Application workloads envs/dev β†’ Development environment envs/prod β†’ Production environment

This structure keeps infrastructure, platform, and workloads cleanly separated while supporting multi-environment strategy.

πŸ” GitOps Delivery Model

The deployment workflow is deterministic and drift-resistant:

  1. I push changes to GitHub.
  2. Argo CD detects configuration drift.
  3. Argo CD reconciles the desired state into the cluster.
  4. Manual changes are automatically reverted unless committed to Git.
Argo CD applications synced and healthy.

This enforces Git as the single source of truth and prevents configuration drift.

🌐 Ingress and TLS Automation

I use ingress-nginx to expose services and cert-manager to automate certificate issuance and renewal via Let’s Encrypt.

Ingress-NGINX deployment managed through GitOps.
Kube Prometheus Stack managed through GitOps.
cert-manager handling certificate issuance.
TLS validation proof from the live endpoint.

The result is fully automated HTTPS without manual certificate management.

πŸ“Š Observability Stack

Operational visibility is implemented through:

  • Prometheus – scraping cluster and workload metrics
  • Grafana – dashboards and monitoring visualization
  • Loki – centralized log aggregation
  • Promtail – log shipping from workloads
Grafana dashboard for platform monitoring.
Centralized logs with Loki in Grafana.

This enables metric-based monitoring and log-based troubleshooting from a single interface.

πŸ”₯ Real Incident That Improved the Platform

While migrating ingress-nginx to full GitOps management, I experienced a production-style outage.

Symptoms

  • TLS certificate mismatch
  • Public endpoint unreachable

Root Causes

  • K3s default Traefik was still serving traffic.
  • ingress-nginx was configured as NodePort.
  • After disabling Traefik, host ports 80/443 were not properly owned.
  • A manual fix temporarily worked but was reverted by Argo CD because Git still defined the old state.

Durable Fix

  • Permanently disable Traefik in K3s configuration.
  • Change ingress-nginx service type in Git to LoadBalancer.
  • Allow Argo CD to reconcile the corrected desired state.

This incident reinforced the core GitOps principle:

If it is not in Git, it is not a durable fix.

βš–οΈ Current Trade-offs

To balance realism and cost:

  • Single-node cluster (cost-efficient, not HA)
  • Some bootstrap steps remain manual (initial K3s + Argo CD installation)
  • Environment overlays are evolving

These are deliberate engineering trade-offs, not oversights.

πŸ”­ What’s Next

Planned improvements:

  • Add CI validation:
    • yamllint
    • terraform validate
    • Kubernetes schema validation
  • Fully GitOps-manage cert-manager installation
  • Expand environment overlays for workloads
  • Add policy enforcement and security hardening

🧠 What This Project Demonstrates

  • Infrastructure as Code with Terraform
  • Kubernetes cluster operations (K3s)
  • GitOps architecture with Argo CD
  • Ingress and TLS automation
  • Observability integration (metrics + logs)
  • Incident debugging and architectural correction
  • Production-style operational discipline

🏁 Final Thoughts

This project helped me practice the parts of Kubernetes work that matter most in production:

  • Platform design
  • Delivery workflows
  • Failure modes
  • Observability
  • Recovery through correct architecture

Building the platform was valuable.

Debugging it under real constraints is what made it production-grade.

Related posts

More posts

More posts.