Design, implement, and maintain scalable, high-availability infrastructure for GenAI applications hosted primarily in Microsoft Azure (with exposure to AWS and GCP)
Develop and manage CI/CD pipelines (including infrastructure-as-code) using tools like Azure DevOps to support automated provisioning, deployment, and monitoring
Ensure robust performance, reliability, and security of cloud-based GenAI environments through proactive system monitoring, backup planning, and disaster recovery strategies
Collaborate with cross-functional teams—including engineers, developers, and data scientists—for seamless integration and deployment of GenAI models and services
Lead troubleshooting and resolution of infrastructure and deployment challenges, applying best DevOps practices and modern monitoring tools (e.g., DataDog, Azure Insights, Nagios)
Promote a team-oriented, product-focused culture by contributing to design discussions, sharing knowledge, and encouraging continuous learning in cloud and GenAI technologies
Ensure compliance with IT governance, cloud usage policies, and security protocols across all cloud services (PaaS, IaaS, SaaS)
Requirements:
Minimum 5 years of hands-on experience with cloud platforms, particularly Microsoft Azure (preferred), with solid understanding of AWS and GCP
Proficient in containerization and orchestration using tools such as Docker, Kubernetes, AKS, EKS, and similar services
Skilled in infrastructure as code, ideally with Terraform, and familiar with tools like Ansible, Chef, or AWS CloudFormation
Strong background in Linux system administration and scripting (e.g., Bash, Python, PowerShell)
Solid experience working with PaaS components and Azure services such as Function Apps, Web Apps, API Management, Cognitive Search, and SQL Server
Familiar with Git-based version control systems and continuous integration/delivery tools including Azure DevOps, GitHub Actions, Maven, and Docker registries
Capable of implementing CI/CD best practices, with a focus on automation, shift-left testing strategies, and secure release management
Competent in debugging infrastructure and application-level issues using logs and tools such as Azure Monitor, DataDog, and Elastic Stack
Exposure to server virtualization, cloud networking, storage provisioning, and scalable architecture design
Agile mindset with practical experience in Scrum environments and a strong preference for automation and reducing operational toil
Demonstrates a developer-oriented perspective and understanding of performance optimization, zero-downtime deployments (blue-green, canary), and disaster recovery planning
Comfortable working in large enterprise environments, collaborating across cross-functional teams, and adhering to compliance standards (e.g., SOX, SOC2)
Excellent communication skills and a clear focus on business outcomes, system reliability, and innovation