Incident Engineer
TLDR
Own the end-to-end incident response across an AI and platform stack, ensuring rapid detection, triage, communication, and resolution of incidents affecting customers and internal systems.
Own the incident lifecycle: detection, triage, escalation, resolution, and postmortems
Act as the central command during major incidents (war rooms, stakeholder updates)
Define and enforce SLAs/SLOs, incident severity frameworks, and runbooks
Collaborate with Engineering, ML, and Integrations teams to resolve issues quickly
Monitor system health across integrations (agent desks, LLMs, ASR/TTS pipelines)
Drive root cause analysis (RCA) and preventive actions
Improve observability, alerting, and incident tooling
Maintain clear internal and customer-facing communication during incidents
3–6 years in Incident Management / SRE / Production Support roles
Strong understanding of distributed systems, APIs, and cloud environments (AWS)
Experience with observability tools (e.g., DataDog)
Familiarity with AI/ML systems, especially LLM integrations and voice stacks (ASR/TTS), is a plus
Experience with monitoring/tracing tools like Langfuse or similar
Excellent communication and stakeholder management skills
Ability to stay calm under pressure and drive structured resolution
Exposure to OpenAI or similar LLM platforms
Experience supporting customer-facing SaaS products
Automation mindset (runbooks, alert tuning, incident tooling)
Netomi builds an agentic AI platform designed for enterprise customer experience, helping large global brands like Delta Airlines and MetLife automate customer interactions at scale. Our no-code solution is all about speed and efficiency, allowing businesses to implement AI-driven customer support quickly and manage it seamlessly across their entire customer journey.
- Founded
- Founded 2015
- Employees
- 51-200 employees
- Industry
- Internet Software & Services