Software Engineer (SDE-2) – DevOps, SRE & MLOps Platform Engineering
Location: Bengaluru
Employment Type: Full-time
Team: Platform Engineering / Reliability

About Blue Machines

Blue Machines powers large-scale, real-time Voice AI platforms and Agentic Workflows for global enterprises across BFSI, Healthcare, HRTech and customer experience domains.
Built and scaled from India, our platform has processed 14.5M+ minutes of production-grade AI agent conversations, operating latency-sensitive, always-on voice systems across geographies.

About the Role

We are hiring a hands-on DevOps / SRE engineer who owns platform reliability, observability and automation and grows into MLOps and AI platform engineering.
This role focuses on designing, operating and evolving the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale, driving uptime, performance and resilience.

Key Responsibilities

Platform Reliability & SRE

Own 99.9%+ platform uptime for real-time Voice AI workloads.
Participate in on-call rotations, incident response and post-incident reviews.
Lead root cause analysis (RCA) and drive permanent reliability improvements.
Design and implement self-healing systems using automation, retries, circuit breakers and failover strategies.

Kubernetes & Cloud Infrastructure

Design, operate and scale Kubernetes clusters in public cloud environments.
Work with managed Kubernetes platforms such as GKE, and apply cloud-native best practices.
Implement auto-scaling strategies (HPA, VPA, node pools, GPU workloads).
Manage infrastructure using Infrastructure as Code (Terraform).
Optimize infrastructure for performance, reliability and cost efficiency.

Observability & Incident Intelligence

Build and maintain monitoring, logging and alerting systems using Prometheus, Grafana, Loki and OpenTelemetry.
Define SLIs, SLOs and error budgets for platform and AI workloads.
Drive signal-based alerting to reduce noise and improve response quality.
Implement anomaly detection and predictive alerting for infrastructure and AI pipelines.

CI/CD & Platform Automation

Design and maintain CI/CD pipelines for services and infrastructure.
Build internal automation tooling for:

Progressive and canary deployments
Auto-scaling and capacity planning
Faster incident diagnosis and recovery

Enable self-service DevOps workflows for engineering teams.

MLOps & AI Platform Reliability

Own reliability and performance of STT, TTS and LLM inference pipelines.
Design provider routing, failover and SLA enforcement mechanisms.
Deploy, version and roll back AI models and inference services.
Monitor inference latency, quality and drift in production systems.
Operate GPU-backed inference workloads where applicable.

Security, Compliance & Resilience

Enforce DevSecOps practices across build and deploy pipelines.
Implement network policies, encryption, secrets management and access controls.
Drive disaster recovery, backup strategies and resilience testing.
Contribute to SOC2 / ISO compliance and audits.

Collaboration & Engineering Excellence

Partner with backend, AI and platform teams on architecture and reliability.
Influence system design through a reliability-first mindset.
Mentor junior engineers and raise the overall bar for operational excellence.

Qualifications

Must-Have

3–6 years of experience in DevOps, SRE or Platform Engineering roles.
Strong hands-on experience with Kubernetes and Docker in production environments.
Familiarity with public cloud platforms and managed Kubernetes services (such as GKE).
Strong understanding of distributed systems and production debugging.
Hands-on experience with observability systems.
Proficiency with Infrastructure as Code (Terraform).
Strong incident ownership and communication skills.

Good-to-Have

Experience with MLOps or AI inference platforms.
Familiarity with LLM pipelines, real-time streaming or telephony systems.
Experience operating GPU workloads.
Knowledge of AIOps, anomaly detection or intelligent alerting.
Cloud cost optimization experience.

Why Blue Machines

Build global-scale AI infrastructure from India.
Operate real-time Voice AI systems with 14.5M+ minutes in production.
Work on low-latency, high-reliability platforms.
Grow from DevOps/SRE into MLOps and AI platform engineering.
High ownership, deep technical impact and real production scale

MLOps and Platform Engineer (AI Platform Reliability )

AI overview