Site Reliability Engineer
Job Overview:
We are seeking experienced Site Reliability Engineer (SRE) to support the monitoring, maintenance, and onboarding of microservices in production environments. The role focuses on improving reliability, scalability, performance, and automation of service deployment processes while ensuring compliance with security and operational standards.
Key Responsibilities:
Service Onboarding Automation:
Assist teams with the deployment of new microservices and automate gaps in the onboarding process.
Manage the end-to-end onboarding process and coordinate with relevant stakeholders.
Monitoring and Logging:
Set up monitoring and logging for all onboarded services.
Ensure comprehensive metrics collection and implement alerting mechanisms.
Scalability and Load Testing:
Conduct load testing to ensure services scale as needed
Implement auto-scaling mechanisms for production environments
Documentation and Knowledge Transfer:
Create detailed documentation of onboarding processes, configurations, and best practices
Conduct training sessions for internal teams to ensure knowledge transfer.
Reliability and Performance Optimization:
Maintain high uptime and implement failover mechanisms.
Optimize services for minimal latency and high throughput through regular performance tuning.
Security and Compliance:
Ensure all services meet security standards and regulatory requirements.
Maintain audit trails and implement security best practices for deployment and monitoring.
Post-Onboarding Support:
Provide ongoing support to address post-onboarding issues.
Monitor services continuously to maintain reliability and performance.
Requirements:
3+ years of hands-on software development experience in an object-oriented programming language such as C#, C++, or Java
5+ years of working with cloud deployment and configuration tools using scripting and configuration platforms
Hands-on experience with system architecture, API design, and distributed systems
Experience designing, deploying, and maintaining CI/CD pipelines to automate application builds, tests, and deployments
Proficiency in managing platform infrastructure as code (IaC) using Terraform
Proven experience as a Site Reliability Engineer or similar role in a microservices environment
Strong knowledge of service deployment automation, monitoring, and logging
Experience with load testing, auto-scaling, and performance optimization
Understanding of security best practices and regulatory compliance for production services
Excellent documentation and knowledge-sharing skills
Ability to work collaboratively with development teams and other stakeholders
Familiarity with microservices architecture and related infrastructure tools
Site Reliability Engineer
Site Reliability Engineer