Site Reliability Engineer

Antal Sp. z o.o.

6 528 - 8 704 USDNet/month - B2B

Type of work

Full-time

Experience

Mid

Employment Type

B2B

Operating mode

Hybrid

Tech stack

Jira/Confluence

regular

PM

regular

problem management

regular

Java

regular

Job description

Grow Your Career with Us!

If you’re looking for a career that will help you stand out, join us and fulfill your potential. Whether you aim to reach the top or simply explore an exciting new direction, we offer opportunities, support, and rewards that will take you further.

Technologies We Use

Java SE
Spring Boot
Spring Cloud
Apache Beam
Apache Flink
GCP
Redis
REST APIs
Ansible
Jenkins

Our Work Culture

We invest heavily in an Agile culture, adopting DevOps processes, CI/CD pipelines, and cloud technologies. We plan to establish a new development team in Krakow in 2023 as part of a long-term strategy to develop and support our platform in Europe.

This is an exciting opportunity to join a team in its early stages and make a key contribution.

Your Responsibilities

Manage application support operations, focusing on resiliency, availability, and monitoring system health and performance.
Coordinate resolution of production incidents, conducting post-mortem/RCA to identify root causes and improve processes.
Investigate, triage, and resolve production incidents with a focus on technical signals and root cause analysis.
Document post-incident recovery steps, contributing to process improvements, identifying deviations, and creating a Knowledge Base.
Actively participate in the service management community, engaging in Incident Management, Problem Management, and Service Delivery.
Define and deliver tactical and strategic service improvements across the technical and process landscape.
Apply SRE principles to continuously improve platform reliability, capacity, and performance, reducing toil and enhancing observability.
Develop observability tools and techniques for monitoring, alerting, incident detection, response, capacity management, and release safety.

What You Need to Succeed in This Role

4+ years of experience in developing and supporting distributed systems written in Java.
Experience with Disaster Recovery methods and processes.
A methodical approach to troubleshooting and problem-solving skills.
Experience in application lifecycle management tooling: JIRA/Confluence, Ansible, Vulnerability Remediation, CI/CD automation.
Experience implementing and managing Logging, Monitoring, and Alerting frameworks for hybrid cloud using tools such as Geneos, Grafana, InfluxDB, Splunk, Loki, or similar tools.
Understanding of RDBMS Database, Cloud Technology, Unix/Linux, Job scheduling e.g., Control-m or autosys.
Ability to lead technical conversations with various technical support groups.
Excellent communication skills and experience working in Agile methodology.

Join us and grow your career in a dynamic and innovative environment!

6 528 - 8 704 USD

B2B