Introduction & Summary:
We are seeking a highly skilled Site Reliability Engineer to develop, test, and maintain robust software solutions that enhance the stability and reliability of our systems. The ideal candidate will possess proficiency in programming languages like Python, a solid understanding of Agile methodologies, and experience with cloud services such as Microsoft Azure or GCP. Key attributes also include excellent problem-solving capabilities, strong communication skills, and a collaborative mindset.
Main Responsibilities:
The Site Reliability Engineer will be responsible for ensuring the smooth operation and reliability of our infrastructure. Key responsibilities include:
- Designing, building, and maintaining scalable infrastructure.
- Leading incident response efforts for critical issues.
- Developing automation tools to enhance system reliability.
- Analyzing system performance metrics to identify bottlenecks.
- Ensuring systems are secure and compliant with industry standards.
- Identifying opportunities for process improvements.
- Creating comprehensive documentation of processes.
- Developing monitoring setups based on Service Levels (SLI/SLO).
Key Requirements:
- Proficiency in one or more programming/scripting languages, such as Python.
- Solid understanding of Agile development methodologies.
- Experience with operations, incident, and problem management.
- Good knowledge of cloud service providers: Microsoft Azure or GCP.
- Experience building CI/CD workflows with GitHub Actions.
- Experience in observability setup using tools like Splunk, Grafana.
- Familiarity with version control systems like Git.
- Strong problem-solving skills and eagerness to learn.
- Excellent communication and teamwork skills.
Nice to Have:
- Experience working in a DevOps culture.
- Knowledge of container orchestration tools like Kubernetes.
- Certifications in cloud services.
- Familiarity with security practices in DevOps.
Other Details:
This position offers remote working flexibility and is open to candidates with a strong background in engineering and system reliability. Successful applicants will be involved in innovative projects aimed at improving system performance and security.