As a Senior Site Reliability Engineer, you will be working alongside our autonomous cross-functional squads. You will advocate high-quality engineering and best-practice in production software as well as providing the infrastructure to both build rapid prototypes and launch production-quality services. You must be a strong communicator who can explain what is required to build and deliver top quality software products. You will be keen to work with the rest of the team and develop collaboratively.
You will promote test-driven-development and other Agile best-practices for ensuring the software is resilient enough for our scientists to rely upon. You will be a core team member building and maintaining the underlying infrastructure that supports our AI-driven technology. You will also be adding your input into diverse areas such as authentication, network topology, sharded databases, scalable web services and interfaces to external data sources and APIs.
Responsibilities:
- Implementing software solutions for cloud infrastructure in accordance with specification and best engineering practices.
- Working towards improving long-term infrastructure availability and reliability.
- Monitoring and handling incident response of the infrastructure, platforms and core engineering services.
- Constructing pipelines to automate infrastructure and software deployments.
- Troubleshooting infrastructure, network and software issues.
- Staying up to date with recent technology trends and tools.
- Automating repetitive manual processes and procedures.
- Participating in on-call rotation to support Benevolent employees in their day-to-day activities.
We are looking for:
- Ability to code and fluency in at least one programming language (Python/Java/Go/C++ preferred).
- Hands-on experience with Kubernetes.
- Good understanding and experience in administering cloud technologies(we work with AWS, but experience with other cloud providers is also a benefit!).
- Comfortable working with Unix-based operating systems.
- Good understanding of infrastructure-as-code and tools such as Ansible, Terraform, Helm.
- Experience with cloud networking, cloud operations, automation and workload orchestration.
- Basic understanding of network protocols such as TCP, HTTP/S and Load Balancing and the contexts in which they are used(for example, understanding the differences between AWS Application Load Balancer and AWS Network Load Balancer).
- Understanding of quality of service measurement tools (SLIs, SLOs, SLAs).
- Experience with monitoring and alerting solutions (for example InfluxDB/Grafana/Prometheus).
- High-level understanding of database technologies(for example, relational, NoSQL, Graph) and their basic use cases.