As a Senior Site Reliability Engineer at Fortanix, you will be at the forefront of ensuring the reliability, scalability, and performance of our cutting-edge production environments. You’ll design and build operations as code, architecting automated solutions that enhance system stability. Partnering closely with our product engineering teams, you'll have a hands-on role in continuously improving the reliability of our platforms, ensuring our systems are robust and resilient. You'll develop and implement a comprehensive, actionable monitoring framework that detects and prevents issues before they impact our users.
In this role, you'll be a critical part of our production on-call rotation, responding to incidents with agility and executing post-incident reviews to drive continuous improvement. If you’re passionate about automation, enjoy tackling complex reliability challenges, and thrive in a fast-paced, high-impact environment, this role is for you!
Join us to shape the future of secure computing with a focus on building reliable, scalable, and secure production systems.
Key Responsibilities
- System Architecture & Design
- Collaborate with software development teams to design scalable, reliable, and secure systems.
- Architect and build robust infrastructure to handle growth and ensure system uptime.
- Automation & Infrastructure as Code (IaC)
- Automate infrastructure deployment and management using tools like Terraform, Ansible, or CloudFormation.
- Implement continuous integration and continuous deployment (CI/CD) pipelines for automated testing and deployment.
- Write automation scripts and code for scaling and self-healing systems.
- Monitoring & Incident Management
- Design and implement comprehensive monitoring and alerting solutions to detect anomalies and issues before they impact users.
- Implement logging and observability tools to gain insight into system health and performance (e.g., Prometheus, Grafana, ELK stack).
- Manage on-call rotations, ensure timely responses to incidents, and perform root cause analysis and post-mortems.
- Performance Tuning & Optimization
- Perform load testing and system benchmarking to identify performance bottlenecks.
- Optimize application and infrastructure performance, reducing latency and improving response times.
- Security & Compliance
- Ensure systems are secure by design, incorporating security best practices (e.g., encryption, firewalls, access controls).
- Stay up-to-date with security vulnerabilities and patch systems accordingly.
- Implement compliance standards (e.g., GDPR, HIPAA) where applicable.
- Collaboration & Mentoring
- Work closely with developers to ensure that applications are designed for reliability and scalability.
- Serve as a mentor to junior engineers, fostering a culture of reliability and best practices.
- Collaborate across teams (DevOps, Development, QA) to enhance system robustness.
- Disaster Recovery & High Availability
- Develop and maintain disaster recovery and business continuity plans.
- Ensure systems are highly available, designing systems that can withstand failures without service disruptions.
- Capacity Planning & Scalability
- Forecast future system demand and plan for capacity increases as needed.
- Design infrastructure that scales automatically to handle increased loads.
- Continuous Improvement & Reliability Culture
- Analyze incidents and failures to identify opportunities for improving system reliability.
- Drive a culture of reliability across the engineering organization, advocating for best practices and SRE principles.
- Cloud & Hybrid Infrastructure Management
- Manage cloud infrastructure (AWS, GCP, Azure) and hybrid environments, ensuring optimal usage of cloud resources.
- Implement cost optimization strategies for cloud resources while maintaining performance and reliability.
This role requires a deep understanding of both software engineering and infrastructure management, as well as strong collaboration and problem-solving skills
Requirements
Technical Experience
Demonstrated expertise in modern enterprise Site Reliability Engineering is essential for this role. In addition, experience in the following areas is highly beneficial:
- Proficiency in Programming/Scripting Languages - Strong coding skills in languages such as Python, Go, or similar. Familiarity with scripting languages like Bash or PowerShell is also important.
- Problem Solving - Advanced experience with Linux administration and automation. Experience with production debugging and the ability to implement fast workarounds.
- CI/CD & Devops - Advanced experience in managing software deployment on Cloud via pipelines (example: bitbucket/Gitlab). Understanding DevOps practices on how modern software is deployed, upgraded and monitored.
- Containers & Orchestration - Strong hands-on experience with container technologies like Docker and Kubernetes, and other orchestration tools like Helm or OpenShift. Experience with both managed (AKS, EKS, GKE.) and unmanaged (on-prem) Kubernetes.
- Monitoring & Observability - Expertise with monitoring, alerting, and logging tools such as Prometheus, Grafana, Datadog, ELK stack, or similar. Understanding of metrics collection and analysis.
- Networking/Infra - Solid understanding of networking concepts (TCP/IP, DNS, VPN, load balancing, firewalls, etc.) and network performance tuning in cloud environments. Experience with high-level Network Fnfrastructure for Datacentre and Cloud
Key Requirements
- Bachelors/Masters in Computer Science, Engineering or a related field.
- Engineering: 8+ Years of engineering experience with 3+ Years of core Site reliability engineering experience.
- Experience with managing and resolving high-severity incidents in production environments. Ability to lead post-mortems and implement improvements.
- Solid understanding of Cloud technologies.
- Strong experience with automation practices and principles to reduce manual work and improve efficiency.
- Experience working in a cross-functional team environment, often collaborating with developers, QA, and security teams.
- Must be a team player.
Certifications (Optional but Preferred)
- Cloud Certifications: AWS Certified Solutions Architect, Google Cloud Certified - Professional Cloud Architect, Microsoft Certified: Azure Solutions Architect Expert.
- DevOps Certifications: Certified Kubernetes Administrator (CKA), HashiCorp Terraform Associate, or similar certifications.
Benefits
Top range of market compensation
A friendly culture that brings the best out of everybody
Mediclaim Insurance – Employees and their eligible dependents including dental coverage
Personal Accident Insurance
Internet Reimbursement