Lead DevOps Engineer
TLDR
Owns cloud infrastructure, CI/CD, and observability for a Data & AI platform, building reusable IaC, Kubernetes ops, security controls, and cross-team stakeholder alignment.
-
Strategy, Roadmap & Vision – build the roadmap for devops and observability in Data & AI teams.
-
Design and build cloud infrastructure as code with Terraform (or Pulumi / CloudFormation), packaging reusable modules for AWS, Azure or GCP
-
Own CI/CD pipelines in GitHub Actions, Jenkins or GitLab CI — build, test, security scanning, blue-green or canary deploys, and automated rollback
-
Operate Kubernetes clusters (EKS, AKS or GKE) and container workloads with LENS, Helm, ArgoCD or Flux — including autoscaling, ingress, secrets and policy
-
Build observability with Prometheus, Grafana, OpenTelemetry, ELK or Datadog — metrics, logs, traces, dashboards and SLO-driven alerting
-
Implement security and compliance controls: IAM, SSO, secrets management (Vault / KMS), vulnerability scanning, policy-as-code (OPA, Checkov) and PCI-aware patterns
-
Lead incident response — on-call, runbooks, blameless post-mortems, and continuous reliability work to drive down MTTR and toil
-
Partner with developers on local dev experience, golden paths, internal platform tooling and developer self-service.
-
Help shape internal platform standards as the stack evolves, contributing to design reviews and sharing knowledge across the India and U.S. teams
-
Participate in a collaborative DevOps environment, working closely with developers, AI engineers, QA, DBAs and product partners across environments
-
8+ years of professional DevOps, SRE or platform-engineering experience operating production services
-
3+ years of hands-on work building CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI or CircleCI) and managing infrastructure as code (Terraform, Pulumi or CloudFormation)
-
Working knowledge of Kubernetes (EKS, AKS or GKE) and container tooling (Docker, Helm, ArgoCD or Flux)
-
Strong scripting skills in Python, Bash or Go; solid SQL skills and strong comfort with at least one cloud platform (AWS, Azure or GCP)
-
Hands-on experience with observability stacks: New Relic, Prometheus, Grafana, OpenTelemetry, ELK or Datadog
-
Solid understanding of cloud security and compliance practices, particularly in PCI-compliant or regulated environments
-
Proven ability to work independently and within a team, managing priorities across concurrent projects and time zones, including on-call rotations
-
Strong written and verbal communication skills; able to work effectively with both technical and non-technical stakeholders
-
Bonus Skills:
-
Experience operating Dataiku DSS, Snowflake, or other large-scale data and analytics platforms in production
-
Experience with service meshes (Istio, Linkerd), API gateways, and zero-trust networking
-
Experience with policy-as-code (OPA / Rego, Checkov, tfsec) and supply-chain security (SBOM, Sigstore)
-
Experience with FinOps practices and cloud cost optimization
-
Experience supporting ML or LLM workloads — GPU scheduling, model-serving infra, vector databases or LangSmith / Langfuse
-
Experience with database administration / reliability for PostgreSQL, MySQL or Snowflake
-
AWS / Azure / GCP Professional, CKA / CKAD, or HashiCorp Terraform certification
-
Experience in loyalty, martech, adtech or a comparable data-rich B2B domain
Benefits
Health Insurance
comprehensive health coverage
Paid Time Off
flexible time off
Wellness Stipend
well-being perks that support our teammates and their dependents