Platform/Site Reliability Engineer
Poland, Poland (Remote)
DCV Technologies
Platform/Site Reliability Engineer (SRE)
📌 Remote from Bulgaria and Poland
B2B Contract
The Platform Reliability Engineer is responsible for ensuring the reliability, performance, and availability of our critical platforms: Kong (API Management), Solace (Messaging), Mulesoft (iPaaS), and Informatica (ETL).
This role applies Site Reliability Engineering (SRE) principles — including automation, monitoring, and continuous improvement — to proactively identify and resolve potential issues, optimize platform performance, and collaborate with cross-functional teams to deliver exceptional service reliability.
This role requires a deep understanding of distributed systems, cloud technologies, and a passion for building resilient and scalable platforms.
The consultant will work closely with various platform teams in the Integration space and report directly to the Enterprise Integration Manager.
Platform Reliability & Performance (SRE Focus)
Ensure the reliability and availability of the Kong, Solace, Mulesoft, and Informatica platforms, applying SRE principles of automation, monitoring, and continuous improvement.
Proactively identify and resolve potential issues before they impact production environments, using data-driven insights and predictive analysis.
Develop and implement comprehensive monitoring and alerting systems to ensure platform health and performance.
Collaborate with the Support team and conduct thorough post-incident reviews with the goal of continuous improvement of platform reliability.
Conduct root cause analysis (RCA) for incidents and implement preventative measures, focusing on automation and systemic solutions.
Collaborate with development, operations, and security teams to ensure smooth platform operations, promoting a culture of shared responsibility for reliability.
Take ownership of platform SLAs and SLOs, ensuring they are met or exceeded, and proactively identify opportunities for improvement.
Evaluate and implement new tools and technologies to improve platform reliability and efficiency, staying up to date with the latest SRE trends and technologies.
Chaos Engineering & Resilience
Design, implement, and execute chaos engineering experiments to proactively identify weaknesses and vulnerabilities in integration platforms.
Develop and maintain a chaos engineering framework to systematically test platform resilience under various failure scenarios.
Analyze chaos experiment results and collaborate with engineering teams to implement improvements to enhance platform resilience.
Participate in designing and implementing fault-tolerant and self-healing systems.
Disaster Recovery & Business Continuity
Collaborate with DevOps engineers to develop, maintain, and test disaster recovery plans for the integration platforms.
Participate in disaster recovery exercises to validate plan effectiveness and identify areas for improvement.
Ensure disaster recovery plans align with business continuity requirements.
Implement and maintain backup and recovery procedures for critical platform components.
Upstream/Downstream Dependency Management
Analyze integration platform dependencies on other systems (e.g., API Gateway, backend services) and assess their reliability impact on overall service.
Implement monitoring and alerting for issues in upstream and downstream systems that could affect integration platforms.
Collaborate with other teams to improve the reliability and performance of dependent systems.
Design and implement strategies for handling failures in dependent systems, such as circuit breakers, retries, and fallbacks.
Collaboration & Communication
Work closely with the Support team to address platform-related issues and improve support processes, providing them with tools and knowledge to resolve issues efficiently.
Collaborate with Platform Engineers to optimize platform architecture and infrastructure, ensuring alignment with SRE best practices.
Partner with the Product Owner to define and communicate platform reliability metrics and performance to stakeholders through clear dashboards and reports.
Performance Optimization
Monitor platform performance and identify areas for optimization using performance profiling and load testing techniques.
Conduct performance testing and tuning to ensure optimal resource utilization and eliminate bottlenecks.
Collaborate with development teams to optimize application performance and provide guidance on best practices.
Implement caching strategies and other techniques to improve responsiveness and reduce latency.
Documentation and Knowledge Sharing
Create and maintain comprehensive documentation for daily activities, platform architecture, configuration, and operational procedures.
Ensure documentation is up to date and accessible.
Share knowledge and best practices with the team, fostering a culture of learning and collaboration.
Qualifications
Bachelor’s degree in Computer Science, Engineering, or a related field.
5+ years of experience in a similar role focused on platform reliability and operations, ideally within an SRE environment.
Strong understanding of Kong API Gateway, Solace PubSub+, Mulesoft Anypoint Platform, and Informatica PowerCenter.
Experience with cloud platforms such as AWS, Azure, or GCP.
Proficiency in scripting languages such as Python, Bash, or Go.
Experience with infrastructure-as-code tools such as Terraform or Ansible.
Experience with monitoring and alerting tools such as Datadog.
Strong understanding of networking concepts and protocols.
Excellent problem-solving and troubleshooting skills.
Excellent communication and collaboration skills, with the ability to communicate technical concepts clearly.
Strong understanding of SRE principles and practices.
Experience with containerization (Docker, Kubernetes).
Experience with CI/CD pipelines and automation tools.
Relevant certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer Expert, Google Cloud Professional Cloud Architect).
Experience with Agile development methodologies.
📩 If you’re interested and meet the qualifications, please send your CV to Alina Pchelnikova at alina.pchelnikova@dcvtechnologies.co.uk
Platform/Site Reliability Engineer
Platform/Site Reliability Engineer
Poland, Poland (Remote)
DCV Technologies