AppZen is the leader in autonomous spend-to-pay software. Its patented artificial intelligence accurately and efficiently processes information from thousands of data sources so that organizations can better understand enterprise spend at scale to make smarter business decisions. It seamlessly integrates with existing accounts payable, expense, and card workflows to read, understand, and make real-time decisions based on your unique spend profile, leading to faster processing times and fewer instances of fraud or wasteful spend. Global enterprises, including one-third of the Fortune 500, use AppZen’s invoice, expense, and card transaction solutions to replace manual finance processes and accelerate the speed and agility of their businesses. To learn more, visit us at www.appzen.com.
About the Role:
We are seeking a highly skilled Senior DevOps Engineer to lead the design, implementation, and continuous improvement of our cloud infrastructure, kubernetes, CI/CD pipelines, observability systems, and reliability practices. This role is critical in ensuring platform stability, scalability, security, and operational excellence across production and non-production environments. You will work closely with Engineering, Security, and Product teams to build resilient, automated, and high-performing infrastructure systems.
Key Responsibilities:
Infrastructure & Cloud Engineering: Design, implement, and manage scalable cloud infrastructure (AWS preferred)
Lead infrastructure-as-code initiatives (Terraform / CloudFormation)
Improve high availability, disaster recovery, and multi-region resilience
Optimize cloud cost and resource utilization
Kubernetes & Container Platform: Architect and manage production-grade Kubernetes clusters
Improve cluster reliability, auto-scaling, and performance
Implement workload monitoring, alerting, and SLO-based reliability standards
Enforce namespace isolation and resource governance
CI/CD & Automation: Design and optimize CI/CD pipelines (Jenkins, ArgoCD)
Implement zero-downtime deployment strategies
Automate environment provisioning (fully touchless builds with seed data)
Improve deployment reliability and rollback mechanisms
Observability & Reliability: Own monitoring, alerting, and logging strategy (Prometheus, Grafana, Datadog, etc.)
Ensure 100% monitoring coverage for critical services
Reduce Sev1/Sev2 incidents caused by infrastructure
Create and maintain runbooks (COPs) for incident response
Define SLOs, SLIs, and error budgets
Security & Compliance: Implement IAM best practices and least privilege access
Improve secrets management and credential rotation
Partner with security team on audits and compliance controls
Incident Management. Lead root cause analysis for major incidents
Drive postmortems and preventive improvements
Improve MTTR and overall operational maturity
Required Skills & Experience:
6+ years in DevOps / SRE / Cloud Engineering
Strong experience with AWS (VPC, IAM, EC2, S3, RDS, EKS, etc.)
Deep Kubernetes experience (production clusters)
Strong understanding of networking and Linux systems
Experience with Infrastructure as Code (Terraform preferred)
Experience implementing monitoring & alerting systems (Datadog, prometheus.Grafana)
Strong scripting skills (Python / Bash )
Experience managing production systems with high availability requirements
Good understanding on databases like Postgres, MySQL
Strong communication written and verbal skills
Ability to follow structured processes while being proactive in identifying improvements.
Analytical and problem-solving mindset.
Willingness to work in night shift on a long-term basis.