MLOps and Platform Engineer (AI Platform Reliability )

AI overview

Design and operate the infrastructure for real-time Voice AI systems while driving platform reliability and scalability with advanced automation tools and practices.

Software Engineer (SDE-2) – DevOps, SRE & MLOps Platform Engineering
Location: Bengaluru
Employment Type: Full-time
Team: Platform Engineering / Reliability

About Blue Machines

Blue Machines powers large-scale, real-time Voice AI platforms and Agentic Workflows for global enterprises across BFSI, Healthcare, HRTech and customer experience domains.
Built and scaled from India, our platform has processed 14.5M+ minutes of production-grade AI agent conversations, operating latency-sensitive, always-on voice systems across geographies.

About the Role

We are hiring a hands-on DevOps / SRE engineer who owns platform reliability, observability and automation and grows into MLOps and AI platform engineering.
This role focuses on designing, operating and evolving the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale, driving uptime, performance and resilience.

Key Responsibilities

Platform Reliability & SRE

  • Own 99.9%+ platform uptime for real-time Voice AI workloads.
  • Participate in on-call rotations, incident response and post-incident reviews.
  • Lead root cause analysis (RCA) and drive permanent reliability improvements.
  • Design and implement self-healing systems using automation, retries, circuit breakers and failover strategies.

Kubernetes & Cloud Infrastructure

  • Design, operate and scale Kubernetes clusters in public cloud environments.
  • Work with managed Kubernetes platforms such as GKE, and apply cloud-native best practices.
  • Implement auto-scaling strategies (HPA, VPA, node pools, GPU workloads).
  • Manage infrastructure using Infrastructure as Code (Terraform).
  • Optimize infrastructure for performance, reliability and cost efficiency.

Observability & Incident Intelligence

  • Build and maintain monitoring, logging and alerting systems using Prometheus, Grafana, Loki and OpenTelemetry.
  • Define SLIs, SLOs and error budgets for platform and AI workloads.
  • Drive signal-based alerting to reduce noise and improve response quality.
  • Implement anomaly detection and predictive alerting for infrastructure and AI pipelines.

CI/CD & Platform Automation

  • Design and maintain CI/CD pipelines for services and infrastructure.
  • Build internal automation tooling for:
    • Progressive and canary deployments
    • Auto-scaling and capacity planning
    • Faster incident diagnosis and recovery
  • Enable self-service DevOps workflows for engineering teams.

MLOps & AI Platform Reliability

  • Own reliability and performance of STT, TTS and LLM inference pipelines.
  • Design provider routing, failover and SLA enforcement mechanisms.
  • Deploy, version and roll back AI models and inference services.
  • Monitor inference latency, quality and drift in production systems.
  • Operate GPU-backed inference workloads where applicable.

Security, Compliance & Resilience

  • Enforce DevSecOps practices across build and deploy pipelines.
  • Implement network policies, encryption, secrets management and access controls.
  • Drive disaster recovery, backup strategies and resilience testing.
  • Contribute to SOC2 / ISO compliance and audits.

Collaboration & Engineering Excellence

  • Partner with backend, AI and platform teams on architecture and reliability.
  • Influence system design through a reliability-first mindset.
  • Mentor junior engineers and raise the overall bar for operational excellence.

Qualifications

Must-Have

  • 3–6 years of experience in DevOps, SRE or Platform Engineering roles.
  • Strong hands-on experience with Kubernetes and Docker in production environments.
  • Familiarity with public cloud platforms and managed Kubernetes services (such as GKE).
  • Strong understanding of distributed systems and production debugging.
  • Hands-on experience with observability systems.
  • Proficiency with Infrastructure as Code (Terraform).
  • Strong incident ownership and communication skills.

Good-to-Have

  • Experience with MLOps or AI inference platforms.
  • Familiarity with LLM pipelines, real-time streaming or telephony systems.
  • Experience operating GPU workloads.
  • Knowledge of AIOps, anomaly detection or intelligent alerting.
  • Cloud cost optimization experience.

Why Blue Machines

  • Build global-scale AI infrastructure from India.
  • Operate real-time Voice AI systems with 14.5M+ minutes in production.
  • Work on low-latency, high-reliability platforms.
  • Grow from DevOps/SRE into MLOps and AI platform engineering.
  • High ownership, deep technical impact and real production scale

Founded in 2019, the Apna mobile app is India’s largest professional networking platform dedicated to helping India’s burgeoning working class to unlock unique professional networking, and skilling opportunities. The app is currently live in 14 cities - Mumbai, Delhi-NCR, Bengaluru, Hyderabad, Pune, Ahmedabad, Jaipur, Ranchi, Kolkata, Surat, Lucknow, Kanpur, Ludhiana, and Chandigarh. Having raised $90+ million from marquee investors like Insight Partners, Tiger Global, Lightspeed India, Sequoia Capital, Rocketship.vc and Greenoaks Capital, Apna is on a mission to enable livelihoods for billions in India. With over 10 million users, present in 14 cities and counting, and over 100,000 employers that trust the platform - India has a new destination to discover relevant opportunities.

View all jobs
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Platform Engineer Q&A's
Report this job
Apply for this job