Monitor and investigate production issues across a global platform while contributing to improved incident response and communication practices during production incidents.
Serve as the primary dashboard monitor during your shift — continuously watch the GTO Operational Dashboard in Datadog, detect anomalies by correlating signals across APM, logs, metrics, synthetic tests, and Real User Monitoring, and determine whether alerts warrant an incident ticket or can be resolved through immediate investigation.
Triage and investigate production incidents — create incident tickets in JIRA Service Management, perform initial technical investigation using Datadog (traces, logs, infrastructure and application metrics), determine blast radius and likely root cause domain, and route to the correct team (Product SRE, Infrastructure SRE, or Engineering) using the smart routing model.
Own lower-severity incidents end-to-end from detection through resolution — diagnose, execute runbook procedures, and resolve without escalation where possible. Escalate promptly when an incident is unresolved within defined thresholds or requires a code-level fix.
Support the TSO Lead during major incidents as the technical right hand in the war room — surface real-time data (error rates, impact scope, deployment history, related alerts), maintain the incident ticket with live timeline entries and linked evidence, and execute mitigation actions as directed.
Draft incident communications under TSO Lead direction, including internal Slack updates, stakeholder notifications, and customer-facing status page updates (status.xsolla.com). Support clear, timely communication throughout the incident lifecycle.
During non-incident periods, analyze incident trends, recurring issues, and production bugs — compile data from Datadog, JIRA, and Slack, identify patterns, and contribute findings to regular reports for product and engineering teams.
Compile incident timelines and draft initial PIR documents for Post-Incident Review preparation. Track PIR action items post-session and flag overdue items to the TSO Lead.
Build and maintain operational automation (alert enrichment scripts, incident templates, Slack workflows, dashboard widgets) and contribute to runbook development — documenting new resolution procedures so they can be repeated by any Operations Engineer on any shift.
Conduct structured shift handoffs covering active incidents, at-risk services, upcoming deployments, and follow-up items. Participate in knowledge transfer sessions with SREs to continuously expand independent resolution capability.
Cover for the TSO Lead during vacations, absences, or emergencies — including severity classification, escalation decisions, stakeholder communications, and basic Incident Commander functions.
Publish health reports of critical apps periodically.
4+ years of experience in SRE, DevOps, production operations, NOC, or technical operations in a high-availability environment. Experience with platforms that handle payments, e-commerce, SaaS, or gaming workloads is preferred.
Strong troubleshooting and investigation skills — ability to take an alert or user-reported symptom and methodically trace it through the stack: application logs, APM traces, infrastructure metrics, database queries, and network paths.
Hands-on experience with Datadog (or equivalent observability platform: Grafana, Splunk, New Relic, Elastic) — navigating APM, building log queries, reading infrastructure dashboards, interpreting SLO burn rates, and configuring monitors and alerts.
Proficiency in at least one scripting language: Python, Go, or Bash. You will write automation scripts, build operational tooling, and work with APIs.
Clear written and verbal communication skills in English — ability to write incident tickets, investigation notes, Slack updates, shift handoff reports, status page communications, and PIR drafts that are clear, concise, and useful to both technical and non-technical audiences.
Working knowledge of Kubernetes and cloud infrastructure (GCP preferred, AWS/Azure acceptable) — understanding of pods, deployments, services, ingress, node health, and how to investigate Kubernetes-related production issues.
Understanding of SLOs, error budgets, and burn-rate alerting — knowing what a multi-window burn-rate alert means, how error budgets deplete, and how SLO breaches translate into incident severity.
Experience with incident management tooling: JIRA or JIRA Service Management, PagerDuty or OpsGenie, Slack, and Confluence.
Experience with or strong interest in AI/ML-assisted operations: anomaly detection, alert correlation, predictive monitoring, or automated remediation.
Comfort with 24x7 shift-based operations as part of a follow-the-sun model with handoff overlaps. Weekend on-call (rotating) is required.
Experience in the gaming, payments, or fintech industry — particularly environments where transaction processing, checkout flows, or player-facing services must meet strict uptime requirements.
Familiarity with Datadog Service Catalog, synthetic monitoring, and RUM; exposure to database operations (MySQL, PostgreSQL, Redis, Kafka); and experience with CI/CD pipelines and deployment tooling (GitLab CI, ArgoCD, Helm).
JIRA Service Management administration experience (workflows, automation rules, SLA timers) or ITIL Foundation certification — practical experience matters more than credentials.
The duties and responsibilities of this position may evolve over time to support the organization's goals and individual growth. This job description is intended to outline the general nature and level of work being performed and is not intended to be an exhaustive list of all duties, responsibilities, and qualifications required. By submitting your application, you consent to Xsolla conducting background checks, where permitted by law, after the final interview stage. All checks will comply with local regulations, and your information will be handled confidentially. Xsolla takes your privacy seriously and will not sell or externally distribute any personal data received during the hiring process. In accordance with applicable data protection laws, Xsolla is committed to protecting your personal information and respecting your privacy.
For any inquiries related to data privacy, please contact: [email protected]
For more vacancies: Careers | Xsolla
Xsolla is a global commerce company that empowers game developers by providing tools and services to tackle the complexities of the video game industry. Catering to both indie and AAA developers, Xsolla partners with them to enhance funding, distribution, marketing, and monetization of their games. With a mission to connect opportunities and innovate resources, Xsolla has supported over 1,500 game creators in expanding their reach and growing their businesses worldwide.
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!
Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.
Operations Engineer Q&A's