Currency

Platform/Site Reliability Engineer

DevOps

Platform/Site Reliability Engineer

DevOps

Poland, Poland (Remote)

DCV Technologies

Full-time
B2B
Senior
Remote

Tech stack

    AWS

    advanced

    Azure

    advanced

    Python

    advanced

    Ansible

    advanced

    Docker

    advanced

    SRE

    advanced

    Agile

    advanced

    Kubernetes

    advanced

    Terraform

    regular

    API

    regular

Job description

Platform/Site Reliability Engineer (SRE)


📌 Remote from Bulgaria and Poland

  • B2B Contract


The Platform Reliability Engineer is responsible for ensuring the reliability, performance, and availability of our critical platforms: Kong (API Management), Solace (Messaging), Mulesoft (iPaaS), and Informatica (ETL).

This role applies Site Reliability Engineering (SRE) principles — including automation, monitoring, and continuous improvement — to proactively identify and resolve potential issues, optimize platform performance, and collaborate with cross-functional teams to deliver exceptional service reliability.

This role requires a deep understanding of distributed systems, cloud technologies, and a passion for building resilient and scalable platforms.

The consultant will work closely with various platform teams in the Integration space and report directly to the Enterprise Integration Manager.


Platform Reliability & Performance (SRE Focus)

  • Ensure the reliability and availability of the Kong, Solace, Mulesoft, and Informatica platforms, applying SRE principles of automation, monitoring, and continuous improvement.

  • Proactively identify and resolve potential issues before they impact production environments, using data-driven insights and predictive analysis.

  • Develop and implement comprehensive monitoring and alerting systems to ensure platform health and performance.

  • Collaborate with the Support team and conduct thorough post-incident reviews with the goal of continuous improvement of platform reliability.

  • Conduct root cause analysis (RCA) for incidents and implement preventative measures, focusing on automation and systemic solutions.

  • Collaborate with development, operations, and security teams to ensure smooth platform operations, promoting a culture of shared responsibility for reliability.

  • Take ownership of platform SLAs and SLOs, ensuring they are met or exceeded, and proactively identify opportunities for improvement.

  • Evaluate and implement new tools and technologies to improve platform reliability and efficiency, staying up to date with the latest SRE trends and technologies.


Chaos Engineering & Resilience

  • Design, implement, and execute chaos engineering experiments to proactively identify weaknesses and vulnerabilities in integration platforms.

  • Develop and maintain a chaos engineering framework to systematically test platform resilience under various failure scenarios.

  • Analyze chaos experiment results and collaborate with engineering teams to implement improvements to enhance platform resilience.

  • Participate in designing and implementing fault-tolerant and self-healing systems.


Disaster Recovery & Business Continuity

  • Collaborate with DevOps engineers to develop, maintain, and test disaster recovery plans for the integration platforms.

  • Participate in disaster recovery exercises to validate plan effectiveness and identify areas for improvement.

  • Ensure disaster recovery plans align with business continuity requirements.

  • Implement and maintain backup and recovery procedures for critical platform components.


Upstream/Downstream Dependency Management

  • Analyze integration platform dependencies on other systems (e.g., API Gateway, backend services) and assess their reliability impact on overall service.

  • Implement monitoring and alerting for issues in upstream and downstream systems that could affect integration platforms.

  • Collaborate with other teams to improve the reliability and performance of dependent systems.

  • Design and implement strategies for handling failures in dependent systems, such as circuit breakers, retries, and fallbacks.


Collaboration & Communication

  • Work closely with the Support team to address platform-related issues and improve support processes, providing them with tools and knowledge to resolve issues efficiently.

  • Collaborate with Platform Engineers to optimize platform architecture and infrastructure, ensuring alignment with SRE best practices.

  • Partner with the Product Owner to define and communicate platform reliability metrics and performance to stakeholders through clear dashboards and reports.


Performance Optimization

  • Monitor platform performance and identify areas for optimization using performance profiling and load testing techniques.

  • Conduct performance testing and tuning to ensure optimal resource utilization and eliminate bottlenecks.

  • Collaborate with development teams to optimize application performance and provide guidance on best practices.

  • Implement caching strategies and other techniques to improve responsiveness and reduce latency.


Documentation and Knowledge Sharing

  • Create and maintain comprehensive documentation for daily activities, platform architecture, configuration, and operational procedures.

  • Ensure documentation is up to date and accessible.

  • Share knowledge and best practices with the team, fostering a culture of learning and collaboration.


Qualifications

  • Bachelor’s degree in Computer Science, Engineering, or a related field.

  • 5+ years of experience in a similar role focused on platform reliability and operations, ideally within an SRE environment.

  • Strong understanding of Kong API Gateway, Solace PubSub+, Mulesoft Anypoint Platform, and Informatica PowerCenter.

  • Experience with cloud platforms such as AWS, Azure, or GCP.

  • Proficiency in scripting languages such as Python, Bash, or Go.

  • Experience with infrastructure-as-code tools such as Terraform or Ansible.

  • Experience with monitoring and alerting tools such as Datadog.

  • Strong understanding of networking concepts and protocols.

  • Excellent problem-solving and troubleshooting skills.

  • Excellent communication and collaboration skills, with the ability to communicate technical concepts clearly.

  • Strong understanding of SRE principles and practices.

  • Experience with containerization (Docker, Kubernetes).

  • Experience with CI/CD pipelines and automation tools.

  • Relevant certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer Expert, Google Cloud Professional Cloud Architect).

  • Experience with Agile development methodologies.


📩 If you’re interested and meet the qualifications, please send your CV to Alina Pchelnikova at alina.pchelnikova@dcvtechnologies.co.uk


Tech stack

    AWS

    advanced

    Azure

    advanced

    Python

    advanced

    Ansible

    advanced

    Docker

    advanced

    SRE

    advanced

    Agile

    advanced

    Kubernetes

    advanced

    Terraform

    regular

    API

    regular

Office location

Published: 29.10.2025

Platform/Site Reliability Engineer

Summary of the offer

Platform/Site Reliability Engineer

Poland, Poland (Remote)

DCV Technologies

By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Informujemy, że administratorem danych jest z siedzibą w , ul.(dalej jako "administrator"). Masz prawo do żądania dostępu do swoich da... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
ADVERTISEMENT: Recommended by Just Join IT