Site Reliability Engineer
Role Overview
We are looking for a skilled and proactive Observability Engineer to implement, automate, and support enterprise-grade observability and monitoring solutions across cloud and application platforms. The ideal candidate should have strong AWS infrastructure knowledge, hands-on automation skills, and experience building reliable monitoring and alerting ecosystems for modern distributed applications.
The role involves working closely with Platform Engineering, Data Engineering, and Application teams to develop observability solutions and bring operational visibility, reliability, incident detection, and platform performance.
Main Responsibilities
· Design, implement, and maintain observability solutions for cloud-native and distributed systems.
· Build monitoring, logging, alerting, and dashboarding solutions across infrastructure and applications.
· Develop automation scripts and tooling using Python.
· Implement and maintain Infrastructure as Code (IaC) using Terraform.
· Build and support CI/CD pipelines using Jenkins and Git-based workflows.
· Configure and optimize monitoring for AWS services, Kubernetes workloads, APIs, databases, and applications.
· Create actionable alerts and operational dashboards to improve incident response and system reliability.
· Work with engineering teams to onboard applications into observability platforms.
· Support troubleshooting, root cause analysis, and performance optimization initiatives.
· Ensure observability standards, governance, and best practices are followed across projects.
Key Requirements
· Strong hands-on experience with Amazon Web Services (AWS).
· Solid Python development/scripting experience.
· Strong experience with Terraform.
· Experience building and maintaining CI/CD pipelines using Jenkins.
· Elasticsearch / ELK Stack experience and building queries.
· Worked with Data Platforms monitoring is preferred.
· Experience with Linux systems and shell scripting.
· Understanding of monitoring, logging, and alerting concepts.
· Experience working in Agile/DevOps environments.
Nice to Have Skills
Experience with any of the following is highly desirable:
· Snowflake
· Databricks
· dbt
· Matillion
· Grafana
· New Relic
· Datadog
· Prometheus
· Elasticsearch / ELK Stack experience
NOTES: We are looking for an Engineer who loves to build. This is a highly technical role—90% of the job is hands-on coding in python and terraform.
Site Reliability Engineer
Site Reliability Engineer