Site Reliability Engineer
We’re looking for a seasoned Site Reliability Engineer to support a high‑performance, mission‑critical risk and analytics platform used across global trading and finance environments. You’ll play a key role in ensuring the stability, scalability, and observability of complex distributed systems running across hybrid cloud infrastructure.
In this role, you’ll take ownership of production reliability driving incident response, conducting root‑cause analysis, improving monitoring capabilities, and delivering automation that reduces operational toil. You’ll work closely with development teams, platform engineers, and service management leads to strengthen resilience, refine processes, and enhance the engineering culture around availability and performance.
This is a hands on technical position suited to someone who thrives in high‑throughput environments, communicates clearly, and enjoys solving deep engineering problems in real time.
Core Responsibilities
Maintain and improve the reliability, uptime, and performance of distributed applications.
Lead incident response, triage complex issues, coordinate recoveries, and deliver structured post‑incident reviews.
Enhance observability—designing and evolving monitoring, alerting, logging, and tracing frameworks.
Drive continuous improvement across automation, deployment processes, and service stability.
Collaborate with cross‑functional teams to influence architecture, design, and operational standards.
Support CI/CD pipelines, environment configuration, and vulnerability remediation.
Contribute to a knowledge‑driven culture through documentation, tooling, and best‑practice adoption.
Required Skills & Experience
Strong Java background with proven experience supporting or developing distributed systems.
Observability tooling expertise (Grafana, Prometheus, Loki, OpenTelemetry or similar).
Hands‑on with hybrid cloud environments, ideally with GCP or another major cloud provider.
CI/CD and automation experience (e.g., Jenkins, Ansible).
Solid understanding of Linux, RDBMS fundamentals, and job schedulers (e.g., Control‑M or equivalents).
Strong analytical mindset with a methodical approach to troubleshooting.
Excellent communication skills and comfort working in Agile teams.
Site Reliability Engineer
Site Reliability Engineer