Site Reliability Engineer

DevOps

-, Wrocław +1 Location

Grid Dynamics Poland

Full-time

Permanent

Mid

Hybrid

Job description

We are looking for a Site Reliability Engineer to join a high-stakes global tech ecosystem and drive the delivery of a critical enterprise platform migration to the cloud.

Your core mission will be to architect, build, and productionalize the observability and cost intelligence (FinOps) layer for a massive, multi-year financial platform transformation. You will take end-to-end ownership of the cloud platform layer, giving internal stakeholders full visibility into platform behavior, performance, and infrastructure spend. Working alongside a nearshore team of senior engineers, you will solve highly complex architectural challenges in a production-grade, distributed system.

Responsibilities:

End-to-End Infrastructure & FinOps Ownership: Architect and implement a cloud usage and cost attribution dashboard, providing detailed per-pod and per-service cost breakdown using cloud billing APIs and internal FinOps hubs.
Advanced Observability & Tracing: Instrument end-to-end distributed tracing using OpenTelemetry, configuring collectors within Kubernetes environments and exporting traces to cloud monitoring systems utilizing RED metrics.
Performance Engineering & Stress Testing: Write custom tooling from scratch to deliver database performance monitoring, load testing, and trend analysis for critical underlying storage layers.
Monitoring & Alerting Automation: Build and deploy scalable production monitoring, custom alerting policies, and SLO tracking for containerized and serverless services.
Infrastructure as Code: Independently manage, write, and apply infrastructure modifications using Terraform, working within established enterprise repository standards, modules, and environment state management.
Cross-Language Codebase Extension: Read, debug, and extend existing platform code across a diverse stack including Kotlin, Java, and Python to seamlessly integrate technical metrics without disrupting business logic.
Quality & Release Assurance: Implement rigorous unit testing with high code coverage for all newly developed monitoring tools to comply with strict enterprise quality gates and sign-offs.

Min requirements:

Experience: 4 to 6 years of professional software or DevOps engineering experience, with at least 2 to 3 years of hands-on cloud infrastructure management in production.
Advanced Cloud Infrastructure: Deep operational proficiency with Google Cloud Platform (GCP), specifically with managing and configuring workload-level alerting on Google Kubernetes Engine (GKE) and Cloud Run.
Observability & OpenTelemetry: Proven track record of building observability solutions in distributed systems, using OpenTelemetry (both auto and manual instrumentation) alongside distributed tracing and profiling tools.
Strong Automation Scripting: Intermediate-to-advanced fluency in Python for writing custom test tooling, metrics integration scripts, and backend automation from scratch.
Solid Infrastructure as Code: Strong proficiency in Terraform, including experience with multi-environment setups, workspaces, and corporate module standards.
Polyglot & JVM Familiarity: Practical ability to read, understand, and modify existing backend codebases written in Kotlin and Java.
Crucial Non-Technical Skills: Extreme technical autonomy to resolve blockers independently, rapid onboarding skills into large unfamiliar codebases, and fluent written English for async alignment and pull requests.
Process Alignment: Ability to thrive in a highly regulated enterprise environment with strict peer reviews, robust documentation requirements, and formal deployment procedures.

Would be a plus:

Domain Knowledge: Previous experience working within financial services, fintech, investment banking, or other highly regulated industries.
Enterprise Streaming Tools: Working knowledge of cloud messaging systems (such as Cloud Pub/Sub) utilized for inter-service communication.
Advanced Storage Engines: Familiarity with high-throughput distributed database architectures, such as Google Cloud Bigtable.
Systems Languages Awareness: Ability to read or debug foundational code written in low-level systems languages like Rust or C++ during multi-stack production deployments.

We offer:

Opportunity to work on bleeding-edge projects
Work with a highly motivated and dedicated team
Competitive salary
Flexible schedule
Benefits package - medical insurance, sports
Corporate social events
Professional development opportunities
Well-equipped office

About us:

Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI, supported by profound expertise and ongoing investment in data, analytics, cloud & DevOps, application modernization and customer experience. Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.

Tech stack

English

GCP

regular

Terraform

regular

Python

regular

Java

regular

Office location

Site Reliability Engineer

Summary of the offer

Site Reliability Engineer

-, Wrocław

Grid Dynamics Poland

By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Informujemy, że administratorem danych jest Grid Dynamics Poland z siedzibą w Krakowie, al. 3 Maja 9, 30-062 (dalej jako "administrato... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Check similar offers