Senior Site Reliability Engineer
We are looking for an experienced Site Reliability Engineer to ensure the reliability, scalability, and performance of large-scale cloud-based web applications. You will work closely with software development, cloud operations, and platform teams to build and maintain resilient infrastructure and improve system stability.
Key Responsibilities:
Design and maintain monitoring, alerting, and incident response systems to ensure high availability
Collaborate closely with engineering, product, and architecture teams
Build and manage cloud infrastructure using Infrastructure-as-Code (e.g., Terraform, Pulumi) on AWS
Operate and optimize Kubernetes environments (e.g., EKS)
Develop and maintain containerized applications using Docker
Improve CI/CD pipelines and drive automation across deployment processes
Implement and manage observability tools (logging, metrics, tracing)
Participate in incident management, postmortems, and reliability improvements
Support capacity planning, disaster recovery, and system scaling
Contribute to security, compliance, and operational best practices
Develop automation and AI-driven solutions for monitoring and incident prevention
Requirements:
5+ years of experience in SRE, DevOps, or similar roles
Strong experience with AWS cloud services and Infrastructure-as-Code tools
Hands-on experience with Kubernetes and containerized environments
Proficiency in Docker and CI/CD pipelines (e.g., GitHub Actions)
Solid understanding of databases (e.g., PostgreSQL, Amazon RDS) and SQL
Knowledge of networking concepts (VPC, DNS, troubleshooting tools like dig/traceroute)
Strong Linux/Unix administration skills
Experience with observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace)
Familiarity with automation and AI-based solutions in infrastructure
Strong problem-solving and incident management skills
Senior Site Reliability Engineer
Senior Site Reliability Engineer