Overview: The Incident Manager is responsible for owning the outcomes of the incident management process and leading a team of 24/7 site reliability engineers within the technology department. This role involves strategic oversight, resource management, and effective coordination of response efforts to minimize disruptions. The Incident Manager ensures continuous improvement of incident management processes, drives root cause analysis, and fosters communication among stakeholders.
Key Responsibilities:
-
Leadership & Oversight: Provide strategic direction for the team, and meticulous oversight of the incident management process, ensuring smooth navigation through the incident life cycle.
-
Resource Management: Allocate resources effectively, including personnel and tools, to address incidents promptly and provide the necessary 24/7 coverage.
-
Develop and maintain:Oversee the development of automation scripts and tools to reduce manual intervention and improve system efficiency using our APM tools.
-
Coordination & Communication: Coordinate with cross-functional teams, manage communication with stakeholders, and provide regular status updates.
-
Decision-Making & Problem-Solving: Guide teams in making informed decisions and implementing solutions during incident responses. Leverage existing runbooks to minimize customer impact.
-
Root Cause Analysis: Lead investigations to determine root causes and implement corrective actions to prevent recurrence.
-
Continuous Improvement: Conduct post-incident reviews, analyze trends, and apply insights to enhance incident management processes.
-
Documentation: Ensure comprehensive documentation of incidents and responses for future analysis and improvement.
Essential Skills:
- The ideal candidate will have a strong background in cloud technologies and a proactive approach to identifying and resolving issues before they impact the business.
- Proficiency in using monitoring and alerting tools (e.g., New Relic, Datadog).
- Ability to analyze and interpret alerts and logs to pinpoint the source of the issue.
- Ability to quickly identify and prioritize critical issues.
- Experience with incident management processes and tools (e.g., PagerDuty ).
- Strong problem-solving skills to diagnose and resolve system and application issues.
- Proficiency in using diagnostic tools and techniques (e.g., logs analysis, tracing, profiling).
- Strong working knowledge of operating systems (Linux/Windows) and system administration tasks.
- Familiarity with key system components like CPU, memory, disk, and network.
- Basic knowledge of database management and troubleshooting (e.g., MySQL, PostgreSQL, MS-SQL).
- Experience with managing cloud resources and troubleshooting cloud-specific issues.
- Clear and concise communication skills to convey the status and impact of the outage to stakeholders.
- Ability to coordinate effectively with different teams (e.g., development, operations, support).
- Ability to remain calm and focused under pressure.
- Effective time management to handle multiple tasks and prioritize urgent issues.
- Ability to document the incident, including steps taken to diagnose and resolve the issue.