Production Systems Engineer – Mass Recovery
Unleash resilience and shape the future of disaster recovery — drive enterprise-wide mass outage response and infrastructure robustness!
Krakow-based opportunity with a hybrid work model (2 days per week in the office).
As a Production Systems Engineer – Mass Recovery, you will be working for a leading financial institution committed to safeguarding the stability of the global financial system. You will help design and implement advanced IT resilience strategies, ensuring rapid, effective recovery from major incidents affecting critical services. This is your chance to be at the forefront of innovative disaster recovery solutions, making tangible impact in a dynamic banking environment.
Your main responsibilities:
Develop and maintain detailed service dependency models across applications, platforms, and infrastructure layers to support disaster recovery efforts.
Identify, document, and analyze shared failure domains such as virtualization, storage, and network components.
Define scenario-based blast radius models to anticipate and mitigate mass outage impacts.
Support rapid failure correlation by analyzing service failures and providing actionable insights for recovery teams.
Validate and challenge existing resilience data sources, ensuring alignment with real system behaviors.
Document gaps in resilience, including RTO mismatches and missing recovery pathways, to enhance recovery strategies.
Collaborate with cross-functional teams and tooling platforms to extract and synthesize relevant operational data.
Contribute to designing fault-tolerant architectures and recovery procedures for high-availability systems.
You're ideal for this role if you have:
Minimum of 4 years’ experience in production engineering, site reliability engineering, or infrastructure engineering within large-scale environments.
Strong knowledge of virtualization platforms (ESX), cloud providers, and storage/big data systems.
Solid understanding of networking fundamentals and infrastructure topology.
Hands-on experience working with CMDB platforms (like ServiceNow), observability tools (such as AppDynamics, Splunk).
Proven ability to analyze complex data sets, identify patterns, and derive practical insights.
Experience operating under high-pressure incident management scenarios.
Excellent communication skills in English, fluent command required.
It is a strong plus if you have:
Previous experience within banking or financial services, especially with HSBC or similar institutions.
Exposure to Disaster Recovery or Mass Recovery planning/execution.
Data manipulation and extraction skills.
Familiarity with Jira/Confluence and large distributed system environments.
Eligibility for the role:
Only candidates with an existing legal right to work in Europe will be considered.
We offer you:
ITDS Business Consultants is involved in various, innovative, and professional IT projects for international companies in the financial industry in Europe. We offer an environment for professional, ambitious, and driven people. The offer includes:
Stable and long-term cooperation with very good conditions
Enhance your skills and develop your expertise in the financial industry
Work on the most strategic projects available in the market
Define your career roadmap and develop yourself in the best and fastest possible way by delivering strategic projects for different clients of ITDS over several years
Participation in Social Events, training, and work in an international environment
Access to an attractive Medical Package
Access to Multisport Program
#GETREADY
Internal job ID #9014
You can report violations in accordance with ITDS’s Whistleblower Procedure available here.
Production Systems Engineer – Mass Recovery
Production Systems Engineer – Mass Recovery