Qode
Sr. Site Reliability Engineer
TLDR
This role involves designing SLI/SLO-driven monitoring, incident prediction, and AI-enhanced operations for complex financial services platforms.
Role: Sr. Site Reliability Engineer (SRE) – Unified Observability & AIOps
Location: Austin, TX / Fort Mill, SC (Hybrid)
Job Type: Full Time
Role Summary
We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures.
Key Responsibilities
Observability & Reliability Engineering
- Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
- Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
- Build actionable dashboards for operations, engineering, and leadership
- Implement alerting strategies using static and dynamic thresholds
Proactive Detection & AIOps
- Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
- Transition monitoring from reactive alerts to proactive insights
- Implement noise reduction, alert correlation, and root cause analysis
- Apply baseline modeling, seasonality detection, and anomaly scoring
Distributed Systems & Dependency Analysis
- Monitor and troubleshoot multi-service architectures involving:
- Microservices
- Downstream APIs
- Kafka / streaming platforms
- Cloud infrastructure (Terraform, IaC)
- Identify whether issues originate from:
- Upstream/downstream dependencies
- Streaming platform
- Infrastructure
- Application code
Tooling & Platforms
- Deep hands-on experience with Dynatrace (mandatory)
- Experience with:
- OpenTelemetry
- Prometheus / Grafana
- ELK / EFK
- Cloud-native monitoring (AWS/Azure/GCP)
- Strong JSON-based telemetry manipulation and enrichment
GenAI & LLM Enablement
- Apply GenAI / LLMs for:
- Incident summarization
- Root cause explanation
- Runbook recommendations
- Auto-remediation suggestions
- Collaborate with platform teams to operationalize GenAI safely
Required Skills & Experience
✅ 15+ years in SRE / Production Engineering
✅ Strong Unified Observability background (not infra-only)
✅ Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
✅ SLI/SLO engineering experience in production systems
✅ Experience implementing dynamic thresholds and anomaly detection
✅ Knowledge of AI/ML concepts applied to Ops (AIOps)
✅ Distributed systems troubleshooting expertise
✅ Experience with Kafka or streaming data platforms
Differentiators (Highly Valued)
- Experience in financial services or regulated environments
- Proven reduction of alert noise and MTTR using AIOps
- GenAI / LLM integration into operations workflows
Qode is a technology-driven platform that transforms how recruiters and candidates connect by leveraging data and automation. Our solutions streamline the hiring process through machine learning, creating private talent pools and automating workflows, ultimately enhancing the quality of candidate evaluation and decision-making. With our no-code tools, we empower organizations to develop tailored recruitment strategies without needing extensive technical skills.
- Industry
- Internet Software & Services
Senior Site Reliability Engineer