Introduction & Summary
We are seeking a dedicated Site Reliability Engineer (SRE) to join our team. The ideal candidate will possess a strong technical background and operational excellence in ensuring the reliability, availability, and performance of critical systems. You will play a key role in monitoring, troubleshooting, and resolving issues, while leveraging your expertise in observability for robust incident management.
Your core duties will include:
- Monitoring production systems and services using observability tools.
- Responding to incidents, alerts, and outages in real time.
- Participating in a rotating on-call schedule.
- Designing, implementing, and maintaining observability solutions.
- Collaborating with development and infrastructure teams to ensure system reliability.
- Automating operational tasks and documenting procedures.
- Conducting post-incident reviews and proposing monitoring enhancements.
- Bachelor's degree in Information Technology, Computer Science or related field.
- 2-5 years of experience in cloud and operations engineering.
- Proficiency with Azure services; AWS and GCP experience is a plus.
- Hands-on experience with Infrastructure-as-Code (IaC) tools like Terraform.
- Strong scripting skills in Python, Bash or PowerShell.
- Familiarity with Gitlab CI/CD tools integrated with Azure.
- Proficiency in monitoring and logging tools.
- Master's degree or relevant certifications.
This position involves a 24/7 shift rotation, ensuring continuous system reliability and performance. The role emphasizes proactive monitoring and efficient incident response in a collaborative environment.