Drive incident response coordination across cross-functional teams while analyzing trends in production issues and implementing improvements through post-incident reviews in a collaborative technical
Serve as Incident Commander for major incidents — coordinating cross-functional response teams, driving investigation, making escalation decisions, and ensuring incidents are resolved within SLA targets.
Own all incident communications: draft and send clear, timely updates to senior leadership, Customer Success, and partner/customer contacts throughout the incident lifecycle, and manage customer-facing status page updates (status.xsolla.com).
Facilitate blameless Post-Incident Reviews (PIRs) for major incidents — leading root cause identification, assigning corrective actions with clear owners and deadlines, and tracking them to closure.
During non-incident periods, proactively analyze incident trends, recurring issues, and production bugs — identify patterns, create Problem tickets, and report findings and recommendations to product and engineering teams on a regular cadence.
Enforce the incident management framework across the organization, including the severity model, priority matrix, SLA targets, escalation procedures, and deployment readiness gates.
Oversee and mentor the Operations Engineer on your shift — coaching on triage, investigation, runbook execution, and documentation quality while conducting regular knowledge transfer sessions to build depth across the service portfolio.
Produce shift handoff reports and deliver regular operational reporting: incident trends, KPI performance (MTTD, MTTA, MTTR), SLA adherence, proactive detection rate, and repeat incident analysis.
Audit service catalogue completeness on a regular cadence and govern JIRA Service Management workflows for incident, PIR, and problem management.
Cover for the Operations Engineer role during vacations, absences, breaks, or surge incidents — including monitoring, triage, ticket creation, and runbook execution. Participate in weekend on-call rotation for major incidents.
6+ years of experience in incident management, SRE, NOC leadership, or technical operations in a production environment supporting high-availability, high-transaction systems (payments, e-commerce, SaaS, or gaming platforms preferred).
Proven incident management experience — coordinating multi-team response, making real-time escalation decisions, and communicating with executive stakeholders under pressure.
Excellent written and verbal communication skills in English — ability to draft clear, concise executive updates at 3 AM under pressure, facilitate blameless PIRs, present operational metrics to senior leadership, and communicate incident status to customers and partners with clarity and professionalism.
Strong ITIL foundation — understanding of incident, problem, and change management lifecycles with practical experience implementing or operating ITIL-aligned workflows.
Technical depth across the observability stack — ability to read and interpret logs, traces, and metrics in Datadog (or equivalent: Grafana, Splunk, New Relic). Understanding of APM, SLOs, error budgets, burn-rate alerting, and synthetic monitoring.
Hands-on experience with incident tooling: Datadog, PagerDuty or OpsGenie, JIRA or JIRA Service Management, Slack, and Confluence.
Analytical mindset — ability to identify trends, patterns, and recurring issues from incident data and translate them into actionable recommendations for product and engineering teams.
Experience with SLA/SLO-driven operations where MTTD, MTTA, and MTTR are measured, reported, and improved.
Experience with or strong interest in AI/ML-assisted operations: anomaly detection, alert correlation, predictive alerting, automated remediation, or self-healing automation.
Comfort with 24x7 shift-based operations as part of a follow-the-sun model with handoff overlaps. Weekend on-call (rotating) for critical severities is required.
Experience in the gaming, payments, or fintech industry, and with customer/partner-facing incident communications and status page management.
JIRA Service Management administration experience (workflows, SLA timers, automation rules) and familiarity with Datadog Service Catalog, scorecards, and SLOs — especially burn-rate alerts and multi-window SLOs.
Experience building an operations function from scratch — defining processes, writing runbooks, establishing governance cadences. Background in Kubernetes, cloud infrastructure (GCP preferred), microservices architecture, or distributed systems. ITIL certification (Foundation or higher) is a plus.
The duties and responsibilities of this position may evolve over time to support the organization's goals and individual growth. This job description is intended to outline the general nature and level of work being performed and is not intended to be an exhaustive list of all duties, responsibilities, and qualifications required. By submitting your application, you consent to Xsolla conducting background checks, where permitted by law, after the final interview stage. All checks will comply with local regulations, and your information will be handled confidentially. Xsolla takes your privacy seriously and will not sell or externally distribute any personal data received during the hiring process. In accordance with applicable data protection laws, Xsolla is committed to protecting your personal information and respecting your privacy.
For any inquiries related to data privacy, please contact: [email protected]
For more vacancies: Careers | Xsolla
Xsolla is a global commerce company that empowers game developers by providing tools and services to tackle the complexities of the video game industry. Catering to both indie and AAA developers, Xsolla partners with them to enhance funding, distribution, marketing, and monetization of their games. With a mission to connect opportunities and innovate resources, Xsolla has supported over 1,500 game creators in expanding their reach and growing their businesses worldwide.
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!
Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.
Operations Lead Q&A's