Site Reliability Engineer
Summary of This Role
We are looking for a detail-oriented and technically strong Site Reliability Engineer to join our API Operations team. In this critical role, you will be responsible for monitoring, diagnosing, and resolving production incidents across our Apigee API Implementations. You’ll work closely with API engineering, Developer Services, Product Management, platform, and governance teams to ensure the stability, reliability, and performance of deployed models and agentic solutions across the enterprise. You will join a dynamic team passionate about learning, applying cutting-edge and cost effective technologies, and innovating to deliver high-quality, and highly available API solutions.
What Part Will You Play?
Serve as the first line of defense for production incidents, ensuring rapid triage, root cause analysis, and resolution.
Monitor system health and performance of deployed APIs and integrating applications
Track and investigate issues related to latency, failures, or broken integrations, escalating to the API engineering group where appropriate.
Collaborate with platform engineers to implement observability, logging, and alerting best practices for API services
Build diagnostic tools, runbooks, and automated workflows to improve incident response time and reduce manual intervention.
Maintain knowledge bases and playbooks for repeatable troubleshooting and knowledge transfer.
Partner with governance and compliance teams to ensure incidents are documented and remediated in line with internal policy.
Contribute to retrospectives and continuous improvement efforts to harden production systems.
What Are We Looking For in This Role?
Minimum Qualifications
3+ years of experience in production support, site reliability engineering (SRE), or DevOps—preferably supporting Apigee APIs.
Strong understanding of cloud infrastructure ( GCP preferred) and observability tools
Proficiency in Python or shell scripting for automation and troubleshooting.
Strong analytical, communication, and incident management skills.
Bachelor’s degree in Computer Science, Engineering, or a related field.
Proficiency in programming languages such as Python and JavaScript
Excellent problem-solving and analytical skills.
Excellent communication and collaboration skills.
English proficiency at B2-C1 level and Czech/ Polish proficiency at B1-B2 level.
Preferred Qualifications
Experience with CI/CD tools and Alerts/Monitoring automation
Familiarity with API integrations.
What Are Our Desired Skills and Capabilities?
Ability to work proactively with a high level of initiative and accuracy.
Ability to manage multiple assignments effectively and meet established deadlines.
Strong interpersonal skills to interact professionally with staff and stakeholders.
Excellent organizational skills and attention to detail.
Critical thinking ability ranging from moderately to highly complex tasks.
Flexibility in adapting to changing business needs and priorities.
Ability to work creatively and independently with minimal supervision.
Ability to utilize experience and judgment in accomplishing goals.
Experience in navigating organizational structures and collaborating across teams.
What will you get from us:
working in a global environment with international market-focused projects
using English language on daily base
private medical care
onboarding training in first days of work – you will get to know our company better
training for employees: with us you will develop your professional and personal potential
lunch pass/Pluxee
multisport cards at preferential prices
possibility to join a group UNUM life insurance
fresh fruits every Wednesday and delicious coffee from Praska Palarnia every
Site Reliability Engineer
Site Reliability Engineer