Senior Site Reliability Engineer (SRE)

AI overview

Lead reliability initiatives and improve platform performance while mentoring engineering teams and participating in on-call rotations.

As a Senior SRE at Salla, you will lead reliability initiatives, handle complex incidents, improve platform performance, and guide engineering teams toward building resilient systems. You will also participate in the on-call rotation as part of our commitment to platform reliability.

Requirements

Reliability & Incident Management

  • Lead high-severity incident response and drive post-incident reviews.
  • Troubleshoot complex issues across applications, infrastructure, and networks.
  • Improve MTTR through better monitoring, alerts, and diagnostic tooling.
  • Participate in the on-call rotation supporting production systems.

Performance & Scalability

  • Identify and resolve performance bottlenecks and scaling challenges.
  • Conduct load testing and capacity planning for high-traffic scenarios.

Infrastructure & Operations

  • Enhance cloud-native infrastructure, deployment processes, and automation.
  • Improve resilience, fault-tolerance, and recovery mechanisms across systems.

Observability

  • Build and refine dashboards, alerts, metrics, logs, and traces.
  • Define SLIs/SLOs and improve visibility into system behavior.

Tooling & Automation

  • Develop tools that reduce operational toil and increase reliability.
  • Contribute to infrastructure-as-code, CI/CD pipelines, and GitOps workflows.

Collaboration

  • Work closely with engineering teams to ensure services are robust and production-ready.
  • Mentor engineers on reliability, debugging, and operational best practices.

Required Skills

  • Strong experience with Kubernetesservice mesh technologies, and cloud platforms (AWS/GCP/Azure).
  • Deep understanding of Linux, networking, distributed systems, and load balancers.
  • Hands-on with Terraform or similar IaC tools.
  • Experience with PrometheusGrafanaLokiMimirElastic, or similar observability tools.
  • Proficiency in scripting/programming (Bash, Python, Go).
  • Experience with CI/CD and GitOps.
  • Strong debugging, incident response, and performance analysis skills.

Bonus Skills

  • Background in large-scale, high-traffic systems.
  • Experience with fault-tolerant design, DR, and HA patterns.
  • Familiarity with SLOs, SLIs, and error budgets.

Location Preference

  • Candidates located within GMT 0 to +6 time zones are preferred to align with team collaboration and on-call coverage.

سهّلنا لك التجارة الإلكترونية حيث يمكنك الآن إنشاء متجر إلكتروني في دقائق معدودة دون أي عمولة على المبيعات!

View all jobs
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Senior Site Reliability Engineer Q&A's
Report this job
Apply for this job