Site Reliability Engineer

Egnyte Poland

Poznań

Type of work

Undetermined

Experience

Mid

Employment Type

Permanent

Operating mode

Remote

Egnyte Poland

Egnyte is the secure multi-cloud platform for content security and governance that enables organizations to better protect and collaborate on their most valuable content. Established in 2008, Egnyte has democratized cloud content security for more than 22,000 + organizations, helping customers improve data security, maintain compliance, prevent and detect ransomware threats, and boost employee productivity on any app, any cloud, anywhere. For more information, visit www.egnyte.com.

Company profile

Tech stack

Linux

advanced

Python / Golang

regular

Terraform

regular

Google Cloud Platform

regular

Jenkins

regular

Ansible

junior

Puppet

junior

Prometheus

junior

Kubernetes

nice to have

ELK

nice to have

Job description

Online interview

Egnyte is a product-focused company, not a software outsourcing business. We build and maintain our flagship software: a secure content platform called Egnyte, used by companies like Red Bull and Yamaha. With 200+ people working in our Poznań office, we remain a people-first workplace.

You will be ensuring reliability for large-scale software - we’re talking 16k+ customers, over 6000 instances across geo-distributed Data Centers and Cloud providers, as well as an average of 2k API requests per second as per New Relic. For us, people who own their work from start to finish are integral to Egnyte’s success. Our engineers are part of the whole process: from design through coding and testing to the deployment and back again for further iterations. We are looking for an engineer who is eager to apply software development approaches to operations. You can, and will, touch every level of the infrastructure depending on the day and the project you are working on. This role requires you to take on complex problems and execute end-to-end solutions.

Your day-to-day at Egnyte:

Drive focused initiatives that improve operational efficiencies, reliability, and scalability of the platform and its applications
Participate in big projects like migrating solutions from self-hosted environments to the cloud, from virtual machines to Kubernetes, from monolith to microservices
Proactively propose and implement automation and observability solutions focusing on improving our core business
Address performance challenges, optimize and fine-tune production environments
Implement best SRE practices in making and documenting improvements to the infrastructure
Maintain and monitor our environments in a 16/5 rotation system

About you:

2+ years of experience in an SRE/SysAdmin/DevOps/NOC, software development, or equivalent role
Coding skills in Python or Golang
Experience with Linux Operating System administration
Knowledge of both self-hosted and cloud environments (preferably the Google Cloud Platform)
Experience with metric-based monitoring solutions
Experience handling large numbers of diverse systems with configuration management systems like Puppet, Ansible, Terraform
Practical knowledge of CI/CD solutions, GitLab CI or similar (Jenkins, Travis, Circle CI, etc. preferred)
Troubleshooting skills to hunt down the root causes of issues and persistence in preventing them from happening again
Incident management skills - must be able to own, cooperate and resolve large scale incidents under time pressure
Good English skills to effectively communicate about technical matters

Bonus points:

Practical knowledge of container orchestration (Kubernetes, Docker)
Experience with Linux HA solutions such as HAProxy, LVS, Corosync & Pacemaker, etc.
Experience with message brokers (RabbitMQ, Kafka, or others) and databases (MySQL or others)
Operational knowledge of the ELK stack
GCP certificate, CCNA certificate, RHCE, or equivalent
Being an active user and open source projects contributor