Currency

Site Reliability Engineer

8 514 USDNet per month - B2B
DevOps

Site Reliability Engineer

DevOps

-, Gdańsk +4 Locations

Hard Rock Digital

Full-time
B2B
Senior
Remote
8 514 USD
Net per month - B2B

Tech stack

    Docker

    advanced

    AWS

    regular

    DevOps

    regular

    Grafana

    regular

    Ansible

    regular

    PromQL

    regular

    Terraform

    regular

    Kubernetes

    regular

    Java

    regular

    Python

    regular

Job description

Location: Poland only, fully remote

Job Type: B2B, full time

 

Overview

Hard Rock Digital is a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. We care about each customer's interaction, experience, behaviour, and insight and strive to ensure we’re always acting authentically.

 

Rooted in the kindred spirits of the Seminole Tribe of Florida, the new Hard Rock Digital taps a brand known all over the world as the leader in gaming, entertainment, and hospitality. We’re taking that foundation of success and bringing it to the digital space.


What’s the position?

We are looking for a skilled Site Reliability Engineer (SRE) to maintain and improve the reliability, scalability, and performance of our Java-based application. You will be responsible for managing and monitoring the applications and infrastructure, using the Grafana stack (Grafana, Loki, Prometheus) to ensure a high level of observability, and implementing robust monitoring, alerting, and logging solutions.

 

Key Responsibilities:

Application Reliability & Performance:

  • Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment.

  • Troubleshoot and resolve complex issues in production and non-production environments.

  • Participate in both pre- and post-deployment performance testing and monitoring efforts to improve application performance.

  • Optimize Java application performance, ensuring efficient resource utilization and scaling.

Monitoring & Observability:

  • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki) to provide real-time monitoring, logging, and alerting.

  • Implement and refine observability strategies to enhance application and infrastructure visibility.

  • Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance.

Incident Management & Root Cause Analysis:

  • Support the operations team’s incident response efforts, participate in post-mortems, and identify root causes of issues to prevent recurrence.

  • Document and share lessons learned from incidents, contributing to a culture of continuous improvement.

Collaboration & Cross-functional Support:

  • Work closely with developers, architects, and other engineers to design and implement solutions that improve application reliability.

  • Collaborate closely with DevOps and NOC teams to support the application platform.

  • Communicate SRE practices and principles to technical and non-technical stakeholders.

  • Provide feedback and insights on application performance, potential improvements, and observability metrics.


Requirements


What are we looking for?

The ideal candidate will have:

  • Degree in computer science or a related field, or equivalent work experience

  • 2-3 years in SRE, DevOps, or similar Infrastructure roles

  • Experience managing large-scale, high-availability production systems

  • Track record of incident response and post-mortem processes

  • Experience with capacity planning and performance optimization

  • 1+ years hands-on experience managing production Kubernetes clusters

  • Deep understanding of k8s architecture, networking, storage, and security

  • Experience with cluster scaling (Karpenter), upgrades, and multi-cluster management

  • Proficiency with kubectl, Helm, and Kubernetes operators

  • Container orchestration and troubleshooting knowledge

  • Expertise with the Grafana stack for dashboards, alerting, and visualization

  • Hands-on experience with Grafana Alloy for telemetry data collection

  • Proficiency in PromQL

  • Experience with Loki for log aggregation and analysis

  • Experience building comprehensive monitoring and alerting strategies

  • Hands-on experience managing Java-based applications in large-scale, distributed environments, with a focus on JVM tuning and application optimization.

  • Cloud Platform expertise (AWS, GCP, or Azure)

  • Familiarity with infrastructure as code (IAC) tools like Terraform/Terragrunt or Ansible.

  • ArgoCD proficiency for GitOps workflows and continuous deployment

  • Scripting abilities in Bash, Python, or Go

  • Experience with CI/CD piplelines and automation tools

  • Configuration Management and deployment automation

  • Strong troubleshooting skills, with a proactive approach to diagnosing and resolving performance bottlenecks.

  • Proven experience in on-call rotations, incident response, and root cause analysis.

  • Strong communication skills (both written and verbal), positive attitude, and ability to receive constructive feedback.


Tech stack

    Docker

    advanced

    AWS

    regular

    DevOps

    regular

    Grafana

    regular

    Ansible

    regular

    PromQL

    regular

    Terraform

    regular

    Kubernetes

    regular

    Java

    regular

    Python

    regular

Office location

Published: 17.11.2025

Site Reliability Engineer

8 514 USDNet per month - B2B
Summary of the offer

Site Reliability Engineer

-, Gdańsk

Hard Rock Digital

8 514 USDNet per month - B2B
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Please be informed that the data controller is Hard Rock Digital (hereinafter "controller"). You have the right to request access to y... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.