Site Reliability Engineer/Architect (SRE)
Remote, Kraków +4 Locations
EPAM Systems
We are seeking an experienced and accomplished Site Reliability Engineer/Architect (SRE) to join our dynamic, fast-paced team.
In this pivotal leadership role, you will be entrusted with architecting and implementing advanced SRE practices to ensure the reliability, scalability, and efficiency of our Generative AI (GenAI) enablement platform for enterprise use cases. The position offers a unique opportunity to work with cutting-edge technologies, collaborate with peers to drive technical excellence, and shape the operational strategy for an enterprise-grade, multi-cloud platform.
Responsibilities
Define and implement SRE principles, frameworks, and methodologies to ensure platform reliability and stability
Establish Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to create measurable reliability goals aligned with business objectives
Collaborate effectively with stakeholders, including senior leadership, to align the SRE vision with the overall technical and organizational strategy
Architect resilient systems by adopting innovative practices such as canary deployments, shadow traffic, and testing in production environments
Ensure uninterrupted operational reliability for a multi-cloud, multi-tenant enterprise platform
Optimize incident response practices and tools to ensure efficiency and effectiveness, implementing automated solutions where appropriate
Implement robust logging, tracing, and monitoring systems to provide real-time insight, detect faults, and optimize performance proactively
Collaborate with engineering teams to integrate observability frameworks into platform components, improving deployment and runtime confidence
Spearhead automation initiatives to reduce manual operational tasks and improve system scalability
Foster a strong culture of operational excellence through thought leadership and mentorship, promoting an SRE-first mindset within all teams
Collaborate with engineering and product teams to craft scalable designs with reliability embedded throughout the software development lifecycle
Build partnerships with Director-level leadership to conceptualize, prioritize, and deliver on long-term SRE goals
Requirements
A minimum of 7 years of professional experience in site reliability engineering, software engineering, or DevOps roles
Strong coding skills in languages such as Python, Go, or Java, with the ability to implement solutions to algorithmic challenges
Proven expertise in designing and managing multi-cloud environments (e.g., AWS, Azure, GCP), distributed systems, and multi-tenant architectures
Knowledge of CI/CD pipelines, microservices, and containerization technologies like Kubernetes, Docker, and Helm
Background in monitoring and observability tools like Prometheus, Grafana, OpenTelemetry, or Dynatrace
Competency in incident management and production troubleshooting using tools like PagerDuty or similar
Solid understanding of modern SRE concepts, including SLIs, SLOs, fault injection, and canary releases
Familiarity with security best practices for cloud-native architectures and multi-tenant platforms
Nice to have
Knowledge of cloud platforms such as AWS, Azure, or GCP, with experience applying multi-cloud strategies
Background in the fundamentals of Generative AI technologies and related workflows
We offer
We gather like-minded people:
Engineering community of industry professionals
Friendly team and enjoyable working environment
Flexible schedule and opportunity to work remotely within Poland
Chance to work abroad for up to 60 days annually
Business-driven relocation opportunities
We provide growth opportunities:
Outstanding career roadmap
Leadership development, career advising, soft skills, and well-being programs
Certification (GCP, Azure, AWS)
Unlimited access to LinkedIn Learning, Get Abstract, Cloud Guru
English classes
We cover it all:
Stable income (Employment Contract or B2B)
Participation in the Employee Stock Purchase Plan
Benefits package (health insurance, multisport, shopping vouchers)
Strategically located offices featuring entertainment and relaxation zones, table tennis and football, free snacks, fantastic coffee, and more
Referral bonuses
Corporate, social and well-being events
Please, note:
The set of bonuses might vary based on the role you apply for – specifics will be discussed with our recruiter during the general interview
We will reach out to selected candidates exclusively
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Site Reliability Engineer/Architect (SRE)
Site Reliability Engineer/Architect (SRE)
Remote, Kraków
EPAM Systems