Senior Site Reliability Engineer

DevOps

-, Wrocław

MANGOPAY

Full-time

Permanent, B2B

Senior

Remote

Job description

At Mangopay, our mission is to power the payment infrastructure and payment operations of the world's biggest and most exciting marketplaces & platforms.

We provide marketplaces and platforms with powerful modular payment and regulatory solutions. Since 2013, we have enabled the success of some of the biggest names in e-commerce, retail, and cutting-edge platforms such as Vinted, Rakuten, Chrono24, La Redoute, Wallapop and over 2,500+ more.

Our team of 400+ is spread across Europe, with offices in Berlin, Dublin, Luxembourg, London, Madrid, Paris, and Warsaw. In an environment where marketplaces and fintech ventures are thriving, we're actively seeking exceptional individuals to tackle the challenges in our field and contribute to our growth. Our commitment to diversity is unwavering, and we are dedicated to promoting employee well-being, inclusivity, and equal opportunities. Joining Mangopay means you’ll be part of a dynamic, flexible, and rapidly growing team.

Job Description

We’re currently looking for motivated and results driven Senior & Lead Site Reliability Engineers (SREs) to join our Platform team. As a valued member of the Technology department, you will have the opportunity to work closely with cross-functional teams to deploy and manage systems, drive operational efficiency through automation, and troubleshoot issues across multiple environments.

If you’re a senior engineer with a good cloud infrastructure background that is comfortable with ambiguity, and aim to simplify and improve how infrastructure works, this could be the role for you. As a team, we’re responsible for designing, building, and operating the services we consume from AWS, along with the software we run on top like Kubernetes, Kafka, Redis, PostgreSQL and more. We’re also responsible for operating our network, and being on-call for the things we own and run.

To achieve this, we’re organised into three teams within the Platform Universe; Platform Engineering, Data Engineering, and Operations. Each squad is responsible for solving a specific set of problems for our customers and our engineers. We’re looking for engineers who are interested in joining our Operations SRE squad.

This role is a remote opportunity.

What will you be responsible for?

Designing and implementing automation tools and frameworks to streamline our operations and deployment processes. This will involve creating new tools as well as improving existing ones
Leading efforts to design and implement scalable fault tolerant systems that can handle our increasing user base and traffic
Identifying areas of performance optimisation and conduct capacity planning to automate future growth
Participating in architecture and design reviews to ensure that our systems are scalable, reliable, and secure. You will be working with other engineers to make sure that our systems are designed and built for the long term
Building, maintaining and continuously improving our monitoring, alerting, and logging systems. This includes setting up new tools and constantly finding ways to improve our existing ones
Identifying and troubleshooting production issues and provide quick resolution. You will be responsible for identifying problems and finding solutions, as well as working with other teams to ensure that they are resolved quickly
Collaborating with development teams to ensure that our systems are designed and built for reliability and scalability. You will be working with other teams to make sure that our systems are designed and built to be robust and scalable
Monitoring and reporting on system performance and availability. You will be responsible for monitoring our systems to ensure that they are performing well and are available to our users

What do we expect from you?

Strong experience with Amazon Web Services (AWS) is a must. You should have a deep understanding of AWS services and how to use them effectively
Experience with migration projects and migrating environments is a must
Experience with containerization technologies such as Docker and Kubernetes
Extensive experience with Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Ansible
Excellent problem solving and troubleshooting skills
Strong experience with at least one programming language such as Python, Java, or Go
Strong networking experience
Experience with monitoring and logging tools such as Grafana, ELK stack, DataDog, Splunk or others
Experience with CI/CD pipelines and tools such as TeamCity, GitLab, or CircleCI
Strong understanding of networking concepts and protocols
Excellent communication skills and ability to work in a team environment