From Zero to Production-Style Kubernetes on AWS with K3s and GitOps
How I built a production-style Kubernetes platform on AWS EC2 using K3s, Terraform, Argo CD, ingress-nginx, cert-manager, Prometheus, Grafana, and Loki, including a real ingress incident and recovery.
February 19, 20263 min read12 views
Article
π From Zero to Production-Style Kubernetes on AWS with K3s and GitOps
I built this project to simulate how a real platform team operates: declarative infrastructure, GitOps-based delivery, automated HTTPS, full observability, and incident-driven improvement.
This is not a toy cluster β it is a cost-conscious, production-style Kubernetes platform running on AWS EC2 using K3s.
This structure keeps infrastructure, platform, and workloads cleanly separated while supporting multi-environment strategy.
π GitOps Delivery Model
The deployment workflow is deterministic and drift-resistant:
I push changes to GitHub.
Argo CD detects configuration drift.
Argo CD reconciles the desired state into the cluster.
Manual changes are automatically reverted unless committed to Git.
Argo CD applications synced and healthy.
This enforces Git as the single source of truth and prevents configuration drift.
π Ingress and TLS Automation
I use ingress-nginx to expose services and cert-manager to automate certificate issuance and renewal via Letβs Encrypt.
Ingress-NGINX deployment managed through GitOps.Kube Prometheus Stack managed through GitOps.cert-manager handling certificate issuance.TLS validation proof from the live endpoint.
The result is fully automated HTTPS without manual certificate management.
π Observability Stack
Operational visibility is implemented through:
Prometheus β scraping cluster and workload metrics
Grafana β dashboards and monitoring visualization
Loki β centralized log aggregation
Promtail β log shipping from workloads
Grafana dashboard for platform monitoring.Centralized logs with Loki in Grafana.
This enables metric-based monitoring and log-based troubleshooting from a single interface.
π₯ Real Incident That Improved the Platform
While migrating ingress-nginx to full GitOps management, I experienced a production-style outage.
Symptoms
TLS certificate mismatch
Public endpoint unreachable
Root Causes
K3s default Traefik was still serving traffic.
ingress-nginx was configured as NodePort.
After disabling Traefik, host ports 80/443 were not properly owned.
A manual fix temporarily worked but was reverted by Argo CD because Git still defined the old state.
Durable Fix
Permanently disable Traefik in K3s configuration.
Change ingress-nginx service type in Git to LoadBalancer.
Allow Argo CD to reconcile the corrected desired state.
This incident reinforced the core GitOps principle:
If it is not in Git, it is not a durable fix.
βοΈ Current Trade-offs
To balance realism and cost:
Single-node cluster (cost-efficient, not HA)
Some bootstrap steps remain manual (initial K3s + Argo CD installation)
Environment overlays are evolving
These are deliberate engineering trade-offs, not oversights.
π Whatβs Next
Planned improvements:
Add CI validation:
yamllint
terraform validate
Kubernetes schema validation
Fully GitOps-manage cert-manager installation
Expand environment overlays for workloads
Add policy enforcement and security hardening
π§ What This Project Demonstrates
Infrastructure as Code with Terraform
Kubernetes cluster operations (K3s)
GitOps architecture with Argo CD
Ingress and TLS automation
Observability integration (metrics + logs)
Incident debugging and architectural correction
Production-style operational discipline
π Final Thoughts
This project helped me practice the parts of Kubernetes work that matter most in production:
Platform design
Delivery workflows
Failure modes
Observability
Recovery through correct architecture
Building the platform was valuable.
Debugging it under real constraints is what made it production-grade.