Site Reliability Engineer
We are looking for a Site Reliability Engineer to ensure the reliability, scalability, and operational excellence of a production-grade AI platform running across Azure and AWS environments. You will work closely with AI and fullstack engineers to automate deployments, improve observability, optimize infrastructure costs, and support highly available LLM-powered services at scale.
Your responsibilities:
Own the reliability, scalability, and performance of platform services running on Azure Container Apps and AWS ECS.
Build and maintain CI/CD pipelines using GitHub Actions for automated testing, deployment, and release management.
Manage infrastructure as code using Terraform, Bicep, or ARM templates across Azure and AWS environments.
Implement and maintain monitoring, alerting, logging, and observability solutions (New Relic, Langfuse, CloudWatch).
Configure and manage Azure Service Bus, Blob Storage, Key Vault, and containerized environments.
Ensure security best practices, including secret management, vulnerability scanning, and container image hardening.
Implement auto-scaling, load balancing, and cost optimization strategies for AI and LLM workloads.
Support incident response processes and create operational runbooks for production services.
Collaborate with AI engineers to optimize LLM API usage, reduce latency, and control token consumption costs.
We are looking for you, if you have:
3–5+ years of experience in SRE, DevOps, or platform engineering roles.
Strong hands-on experience with Microsoft Azure, including Container Apps, Service Bus, Key Vault, Blob Storage, and Azure OpenAI resources.
Practical knowledge of AWS services such as ECS, S3, Aurora, and CloudWatch.
Experience with Infrastructure as Code tools: Terraform, Bicep, or ARM templates.
Experience designing and maintaining CI/CD pipelines (GitHub Actions preferred).
Strong understanding of Docker and container orchestration; Kubernetes experience is a strong advantage.
Experience with monitoring and observability platforms such as New Relic or equivalent tools.
Familiarity with security best practices, including secrets management, vulnerability remediation, and image scanning.
Scripting and automation skills in Python and/or Bash.
Daily usage of AI-powered development tools such as Cursor, Claude Code, or GitHub Copilot.
Fluent English communication skills, both spoken and written.
We offer:
Participation in interesting and demanding projects
Flexible working hours
A great, non-corporate atmosphere
Stable employment conditions (contract of employment or B2B contract)
Opportunities for development and promotion
Attractive package of benefits
Work model: remote or hybrid (2 days per week from the office)
We reserve the right to contact the selected candidates.

Transition Technologies MS
Transition Technologies MS is a company specializing in providing advanced IT solutions and software development services. It focuses on innovative technologies to support business digital transformation.Site Reliability Engineer
Site Reliability Engineer