Bengaluru, India

Full-Time

As a Senior Site Reliability Engineer at Fortanix, you will be at the forefront of ensuring the reliability, scalability, and performance of our cutting-edge production environments. You’ll design and build operations as code, architecting automated solutions that enhance system stability. Partnering closely with our product engineering teams, you'll have a hands-on role in continuously improving the reliability of our platforms, ensuring our systems are robust and resilient. You'll develop and implement a comprehensive, actionable monitoring framework that detects and prevents issues before they impact our users.

In this role, you'll be a critical part of our production on-call rotation, responding to incidents with agility and executing post-incident reviews to drive continuous improvement. If you’re passionate about automation, enjoy tackling complex reliability challenges, and thrive in a fast-paced, high-impact environment, this role is for you!

Join us to shape the future of secure computing with a focus on building reliable, scalable, and secure production systems.

Key Responsibilities

System Architecture & Design

Collaborate with software development teams to design scalable, reliable, and secure systems.
Architect and build robust infrastructure to handle growth and ensure system uptime.

Automation & Infrastructure as Code (IaC)

Automate infrastructure deployment and management using tools like Terraform, Ansible, or CloudFormation.
Implement continuous integration and continuous deployment (CI/CD) pipelines for automated testing and deployment.
Write automation scripts and code for scaling and self-healing systems.

Monitoring & Incident Management

Design and implement comprehensive monitoring and alerting solutions to detect anomalies and issues before they impact users.
Implement logging and observability tools to gain insight into system health and performance (e.g., Prometheus, Grafana, ELK stack).
Manage on-call rotations, ensure timely responses to incidents, and perform root cause analysis and post-mortems.

Performance Tuning & Optimization

Perform load testing and system benchmarking to identify performance bottlenecks.
Optimize application and infrastructure performance, reducing latency and improving response times.

Security & Compliance

Ensure systems are secure by design, incorporating security best practices (e.g., encryption, firewalls, access controls).
Stay up-to-date with security vulnerabilities and patch systems accordingly.
Implement compliance standards (e.g., GDPR, HIPAA) where applicable.

Collaboration & Mentoring

Work closely with developers to ensure that applications are designed for reliability and scalability.
Serve as a mentor to junior engineers, fostering a culture of reliability and best practices.
Collaborate across teams (DevOps, Development, QA) to enhance system robustness.

Disaster Recovery & High Availability

Develop and maintain disaster recovery and business continuity plans.
Ensure systems are highly available, designing systems that can withstand failures without service disruptions.

Capacity Planning & Scalability

Forecast future system demand and plan for capacity increases as needed.
Design infrastructure that scales automatically to handle increased loads.

Continuous Improvement & Reliability Culture

Analyze incidents and failures to identify opportunities for improving system reliability.
Drive a culture of reliability across the engineering organization, advocating for best practices and SRE principles.

Cloud & Hybrid Infrastructure Management

Manage cloud infrastructure (AWS, GCP, Azure) and hybrid environments, ensuring optimal usage of cloud resources.
Implement cost optimization strategies for cloud resources while maintaining performance and reliability.

This role requires a deep understanding of both software engineering and infrastructure management, as well as strong collaboration and problem-solving skills

Requirements

Technical Experience

Demonstrated expertise in modern enterprise Site Reliability Engineering is essential for this role. In addition, experience in the following areas is highly beneficial:

Proficiency in Programming/Scripting Languages - Strong coding skills in languages such as Python, Go, or similar. Familiarity with scripting languages like Bash or PowerShell is also important.
Problem Solving - Advanced experience with Linux administration and automation. Experience with production debugging and the ability to implement fast workarounds.

CI/CD & Devops - Advanced experience in managing software deployment on Cloud via pipelines (example: bitbucket/Gitlab). Understanding DevOps practices on how modern software is deployed, upgraded and monitored.
Containers & Orchestration - Strong hands-on experience with container technologies like Docker and Kubernetes, and other orchestration tools like Helm or OpenShift. Experience with both managed (AKS, EKS, GKE.) and unmanaged (on-prem) Kubernetes.
Monitoring & Observability - Expertise with monitoring, alerting, and logging tools such as Prometheus, Grafana, Datadog, ELK stack, or similar. Understanding of metrics collection and analysis.

Networking/Infra - Solid understanding of networking concepts (TCP/IP, DNS, VPN, load balancing, firewalls, etc.) and network performance tuning in cloud environments. Experience with high-level Network Fnfrastructure for Datacentre and Cloud

Key Requirements

Bachelors/Masters in Computer Science, Engineering or a related field.
Engineering: 8+ Years of engineering experience with 3+ Years of core Site reliability engineering experience.
Experience with managing and resolving high-severity incidents in production environments. Ability to lead post-mortems and implement improvements.
Solid understanding of Cloud technologies.

Strong experience with automation practices and principles to reduce manual work and improve efficiency.
Experience working in a cross-functional team environment, often collaborating with developers, QA, and security teams.
Must be a team player.

Certifications (Optional but Preferred)

Cloud Certifications: AWS Certified Solutions Architect, Google Cloud Certified - Professional Cloud Architect, Microsoft Certified: Azure Solutions Architect Expert.
DevOps Certifications: Certified Kubernetes Administrator (CKA), HashiCorp Terraform Associate, or similar certifications.

Benefits

Top range of market compensation

A friendly culture that brings the best out of everybody

Mediclaim Insurance – Employees and their eligible dependents including dental coverage

Personal Accident Insurance

Internet Reimbursement

Apply for this job

Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!

Get hired quicker

Be the first to apply. Receive an email whenever similar jobs are posted.

Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

SRE Engineer Q&A's

Report this job

Fortanix is hiring a

Senior SRE Engineer