Senior DevOps / SRE (Platform Reliability Engineer)
We are looking for a Senior DevOps / Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and security of our platform and cloud infrastructure. You will play a key role in building and operating cloud-native systems, improving observability, automating operations, implementing SRE best practices (SLOs/SLIs), and supporting development teams to deliver highly available services.
Key Responsibilities
Design, implement, and maintain highly available and scalable infrastructure on AWS.
Own and improve the reliability of production systems using SRE principles (SLO, SLI, error budgets).
Build and manage CI/CD pipelines to support fast and safe software delivery.
Develop and maintain Infrastructure as Code (IaC) using Terraform, Ansible, CloudFormation, etc.
Manage and optimize container orchestration platforms (Kubernetes, Docker, Helm).
Implement and maintain monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, Datadog, Splunk).
Lead incident response, perform root cause analysis, and write postmortems to drive continuous improvement.
Improve system performance, capacity planning, scaling strategies, and disaster recovery processes.
Collaborate closely with development teams to improve deployment strategies and system resilience.
Implement security best practices (IAM, secret management, vulnerability scanning, patching).
Define operational standards, runbooks, documentation, and best practices for platform reliability.
Participate in on-call rotation and provide senior-level support for critical production issues.
Key Requirements
5+ years of experience in DevOps / SRE / Cloud Infrastructure / Platform Engineering.
Strong expertise in Linux systems administration and troubleshooting.
Proven experience with Kubernetes in production environments.
Strong experience with CI/CD tools (GitLab CI, Jenkins, GitHub Actions, Azure DevOps).
Solid knowledge of Infrastructure as Code (Terraform highly preferred).
Experience with cloud platforms: AWS, Azure, or Google Cloud.
Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, reverse proxies).
Experience with observability tools: monitoring, metrics, logging, tracing.
Strong scripting skills (Bash, Python, or similar).
Nice to Have
Experience with additional cloud platforms (Azure, GCP).
Strong understanding of networking fundamentals.
Senior DevOps / SRE (Platform Reliability Engineer)
Senior DevOps / SRE (Platform Reliability Engineer)