Site Reliability Engineer (SRE)

Hyper Science

New York

Type of work

Undetermined

Experience

Mid

Employment Type

Permanent

Operating mode

Office

Tech stack

Python

regular

Java

regular

Go

regular

C++

regular

Linux / Unix

regular

TCP/IP

regular

Job description

Job description

At HyperScience we bring AI to the enterprise. Our products help enterprises and government institutions function by automating certain kinds of office work and reducing bureaucratic burden both on businesses and their customers. We take a heterogeneous approach to AI, using a blend of what are traditionally considered different fields of ML: deep learning, computer vision, and NLP among others. We believe that AI is destined to be the biggest event in the history of human labor since the Industrial Revolution, and we want to be a part of it. We're looking to hire our first Site Reliability Engineer in NYC. We're looking for people who are hybrid systems and software engineers who are responsible and take ownership for reliability, automation, and other issues related to 'keeping the lights on.' SREs are integrated within the core engineering team and we're looking for engineers who want to be a part of developing infrastructure software, maintaining it, and scaling it.

What You'll Achieve:

You'll ensure reliability, scalability and performance. You will tackle problems relating to critical services and prevent problem recurrence.
Creating and managing build/deployment pipelines for continuous integration and continuous delivery to improve the quality and availability of business products.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.

Desired Experience:

You love analyzing, monitoring, and troubleshooting large-scale distributed systems.
You have extensive knowledge of networking and operating systems (e.g. processes, threads, concurrency).
You are comfortable using at least one programming language like Python, Go, C++, Java, Ruby, and scripting languages like Shell and Perl.
You're familiar with algorithms, data structures, and complexity analysis.
Experience with Unix/Linux operating systems internals and administration (e.g., filesystems, inodes, system calls) or networking (e.g., TCP/IP, routing, network topologies and hardware, SDN).

The Nice-to-Have's:

Expertise in designing, analyzing and troubleshooting large-scale systems.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Ability to debug and optimize code and automate routine tasks.