Manager, DevOps
TLDR
Lead a devops team ensuring robust AWS infrastructure and Kubernetes platform while driving infrastructure-as-code improvements through Terraform and collaborating closely with Product Engineering.
Manage, coach, and grow a team of 3-6 DevOps and platform engineers; own hiring, performance, growth plans, and 1:1s.
Set quarterly priorities aligned to engineering and business goals; communicate progress and risk clearly to leadership.
Build a healthy on-call culture: balanced rotations, blameless postmortems, and continuous reduction of toil.
Own the architecture, cost, and reliability of AppZen's AWS footprint across multiple regions and accounts.
Drive infrastructure-as-code standards using Terraform; champion modular, reviewable, version-controlled infrastructure.
Partner with Security and Compliance on SOC 2, ISO 27001, GDPR, and customer audit requirements; harden IAM, network, and secrets management.
Manage cloud spend: visibility, forecasting, and ongoing optimization (Savings Plans, rightsizing, multi-tenant efficiency).
Hands on ownership of PostgreSQL in production: schema reviews, index and query tuning, vacuum/bloat management, replication, failover, point-in-time recovery, and major-version upgrades (RDS / Aurora).
Run and scale Elasticsearch / OpenSearch clusters: shard and index design, JVM and heap tuning, snapshot strategy, hot-warm tiers, and incident response under heavy ingest or query load.
Operate supporting datastores such as Redis (caching, queues), Kafka or SQS/SNS (streaming and async), and S3-backed data lakes; define patterns for high availability, durability, and disaster recovery.
Partner with engineering on capacity planning, performance benchmarking, data tier cost optimization, backup/restore drills, and customer data isolation for multi-tenant workloads.
Operate and improve our EKS-based Kubernetes platform: cluster lifecycle, autoscaling, multi tenancy, and workload isolation.
Define golden paths for service teams using Helm, Kustomize, and GitOps tooling such as ArgoCD or Flux.
Set patterns for service mesh, ingress, and zero-downtime deployments.
Lead the design of internal developer platform capabilities so product teams can ship safely and quickly without infra friction.
Maintain and improve build, test, and deploy pipelines (e.g., GitHub Actions, Jenkins, ArgoCD); enforce supply-chain security and artifact provenance.
Drive measurable improvements in DORA metrics: lead time, deploy frequency, change failure rate, and MTTR.
Own the observability stack (e.g., Datadog, Prometheus, Grafana, OpenTelemetry); ensure consistent metrics, logs, and traces across services.
Define and operationalize SLOs and error budgets in partnership with service owners.
Lead incident command for high-severity events and convert learnings into durable systemic fixes.
8+ years of experience in DevOps, SRE, infrastructure, or platform engineering, with at least 2 years leading or managing engineers (formal or tech-lead capacity).
Deep, hands-on AWS experience across compute, networking, IAM, data, and observability services; comfortable designing for multi-account, multi-region SaaS.
Strong production experience with Kubernetes (preferably EKS), including upgrades, autoscaling, and securing multi-tenant clusters.
Demonstrated hands on operations experience with PostgreSQL at scale — query and index tuning, replication, HA/failover, backups, and version upgrades — and with Elasticsearch / OpenSearch (cluster sizing, shard strategy, ingest tuning, and incident response).
Working knowledge of additional datastores commonly used in SaaS: Redis, Kafka or other message brokers, and object storage; comfortable evaluating tradeoffs between managed services (RDS, Aurora, ElastiCache, MSK, OpenSearch Service) and self-managed options.
Proficient with Terraform and modern IaC patterns; clear opinions on module design, state management, and PR-driven workflows.
Solid scripting and automation skills in at least one of Python, Go, or Bash.
Track record of designing and operating CI/CD pipelines at scale (GitHub Actions, Jenkins, ArgoCD, or similar).
Experience running production workloads under SOC 2 or comparable compliance frameworks; comfortable partnering with Security on audits and remediation.
Excellent communication and stakeholder skills; able to translate infrastructure tradeoffs into language product, finance, and customer teams understand.
Experience supporting AI/ML or data heavy SaaS workloads (GPU fleets, vector stores, large async pipelines).
Familiarity with service mesh (Istio, Linkerd) and progressive delivery (Argo Rollouts, feature flags).
Background scaling FinOps practices and managing cloud spend at $5M+ annual run-rate.
Experience operating multitenant SaaS with strict data isolation requirements for enterprise finance customers.
Exposure to multi-cloud or hybrid-cloud environments (Azure, GCP).
AppZen builds autonomous spend-to-pay software that leverages artificial intelligence to help organizations efficiently understand and manage their enterprise spending. Targeting global enterprises, including a third of the Fortune 500, AppZen streamlines financial processes by integrating with existing systems, enabling smarter decision-making and reducing fraud and waste.
- Founded
- Founded 2012
- Employees
- 201-500 employees
- Industry
- Internet Software & Services
- Total raised
- $100M raised