HashiCorp is hiring a

Sr. Site Reliability Engineer - Incident Excellence (Hybrid)

Bengaluru, India

 

About HashiCorp

HashiCorp solves development, operations, and security challenges in infrastructure so organizations can focus on business-critical tasks. We build products to give organizations a consistent way to manage their move to cloud-based IT infrastructures for running their applications. Our products enable companies large and small to mix and match AWS, Microsoft Azure, Google Cloud, and other clouds as well as on-premises environments, easing their ability to deliver new applications.

We use the Tao of HashiCorp as our guiding principles for product development and operate according to a strong set of company principles for how we interact with each other. We value top-notch collaboration and communication skills, both among internal teams and in how we interact with our users.


Our Team

The HashiCorp Incident Excellence team is responsible for improving HashiCorp’s incident response while maximizing learning from incidents. Our focus is on helping all engineers feel confident when they are on-call and improving communication to efficiently resolve incidents and build trust in our brand. We partner closely with teams to drive a holistic incident management strategy and share learnings to help our business continuously improve.

About this Role

This engineering role is on a nascent engineering team. The team is responsible for products that touch many areas of engineering organizations at HashiCorp, so applicants will need to excel at collaboration, have product-focused mindsets, and be comfortable iterating in an agile manner towards solutions.

You will provide expert execution of the incident command process, including running and managing high-severity incident bridges and driving transparent communication that promotes maximum levels of internal and external customer satisfaction.

Collaborate with an array of technical stakeholders and executives to drive resolution during incidents and improve overall response for future incidents and technical escalations. 

Utilize top-notch troubleshooting techniques to identify, organize, and advocate for novel solutions to remediate customer impact on complex interconnected systems.

Participate in a closed-loop post-incident learning process driving insights and meaningful action

Iterative improvements in response through consistent drills, tabletops, and game-day exercises

Push the boundaries of innovation in incident management to deliver best-in-class incident response.

In this role, you can expect to:

  • Be responsible for and drive incident management capabilities and culture.
  • Contribute to incident command on-call
  • Build technical skills and relationships within a team of engineers and SREs.
  • Lead and refine our incident response strategy, ensuring rapid and effective response to operational disruptions.
  • Analyze incident trends and root causes to drive continuous improvements in system reliability and response processes.
  • Develop and maintain tools for incident detection, analysis, and resolution, automating responses where possible to minimize human intervention.
  • Create comprehensive incident response documentation and conduct training sessions to prepare all relevant teams for effective incident handling.
  • Work closely with development, operations, and security teams to coordinate incident response efforts and post-incident analyses.

 

You may be a good fit for our team if:

  • 5+ years of experience in site reliability engineering, systems administration, or software engineering, with a significant focus on incident response and operational reliability.
  • 1+ years managing, coordinating, and ensuring resolution of major incidents.
  • Professional experience with incident management in cloud environments.
  • Enjoy working on a variety of scopes spanning software engineering, cloud infrastructure, and SRE.
  • Proven track record of managing and resolving incidents in cloud-based environments, with expertise in major public cloud platforms (AWS, GCP, Azure).
  • Understanding of fundamental network technologies like DNS, Load Balancing, SSL, TCP/IP, HTTP
  • Strong understanding of monitoring and alerting systems, with the ability to develop metrics and alarms that accurately reflect system health and operational risks.
  • Experience with incident management tools and practices, including post-mortem analysis and root cause investigation.
  • Passion for consistently responding to and leading complex incidents in a 24x7x365 environment utilizing a globalized follow-the-sun model.
  • Customer-centric attitude with a focus on providing best-in-class incident response for customers and stakeholders
  • Familiarity with HashiCorp’s product suite and infrastructure automation tools is a plus.
  • Demonstrate strong leadership skills during periods of significant business impact, remaining calm and professional during high-pressure situations
  • A strong desire to drive customer success with partner teams and management on high-profile issues critical to the long-term success of the business
  • Outstanding verbal and written communication skills with the ability to convey information in a meaningful way to both engineers and executive-level management, during and outside of incidents
  • Adaptable to a wide variety of technologies and capable of incident response and troubleshooting activities in complex interconnected environments #LI-Hybrid

 

Apply for this job

Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!

Get hired quicker

Be the first to apply. Receive an email whenever similar jobs are posted.

Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Senior Site Reliability Engineer Q&A's
Report this job
Apply for this job