Shape the utilities market of the future with us!
We are looking for a highly experienced
Senior DevOps Engineer to lead the installation, automation, and operational reliability of a modern
open-source data and integration platform. The platform underpins business-critical data pipelines and integrations built on technologies such as Apache Airflow, Apache NiFi, Apache Spark, Kafka, PostgreSQL, MQTT brokers, Docker, and Kubernetes.
This is a hands-on, senior individual contributor role with ownership across
infrastructure, reliability, security, automation, and operational excellence, supporting deployments on both
private and public cloud environments.
What is the role about?
Key Responsibilities
Platform Installation, Configuration & Operations
- Install, configure, upgrade, and maintain distributed open-source components including:
- Apache Airflow, Apache NiFi, Apache Spark
- Apache Kafka and its ecosystem
- PostgreSQL
- MQTT brokers
- Ensure platform stability, scalability, high availability, and fault tolerance.
- Perform capacity planning, performance tuning, and lifecycle management of all components.
Containerization & Orchestration
- Design, deploy, and operate containerized workloads using Docker.
- Build and manage production-grade Kubernetes clusters.
- Implement Kubernetes best practices for networking, storage, scaling, and security.
- Package and manage platform services using Helm or equivalent tooling.
Infrastructure as Code & Automation
- Design and maintain Infrastructure as Code (IaC) using Terraform for cloud and on-prem environments.
- Build configuration management and automation workflows using Ansible.
- Enable repeatable, environment-agnostic deployments across development, staging, and production.
- Automate provisioning, configuration, upgrades, scaling, and recovery processes.
Cloud, Hybrid & Private Infrastructure
- Deploy and operate workloads on public cloud platforms (AWS, Azure, GCP) and private/on-prem infrastructure.
- Design hybrid architectures with secure connectivity between environments.
- Optimize infrastructure design for resilience, performance, and cost efficiency.
Observability, Reliability & Incident Management
- Design and implement comprehensive monitoring, logging, and alerting for infrastructure and applications.
- Define, measure, and maintain SLAs, SLIs, and SLOs for critical platform services.
- Own incident response, root cause analysis, and post-incident reviews.
- Proactively identify risks, bottlenecks, and failure modes before they impact users.
Security & Secrets Management
- Implement infrastructure and platform security best practices across containers, Kubernetes, and networks.
- Manage secrets and credentials using tools such as Vault, Kubernetes Secrets, or cloud-native solutions.
- Own certificate lifecycle management, including rotation and renewal.
- Design and enforce network security controls, access policies, and zero-trust principles where applicable.
- Support compliance with internal security and governance requirements.
Backup, Disaster Recovery & Data Protection
- Design and implement automated backup strategies for Kafka, PostgreSQL, and other stateful services.
- Own disaster recovery planning and testing, including restore validation.
- Support multi-cluster or cross-region strategies where required.
- Ensure data durability, integrity, and recoverability.
Cost & Resource Optimization
- Implement infrastructure cost monitoring and visibility across environments.
- Right-size clusters, storage, and compute resources to balance performance and cost.
- Continuously optimize resource usage for cloud and hybrid deployments.
CI/CD & Release Engineering
- Build and maintain CI/CD pipelines for platform and infrastructure components.
- Enable safe deployment strategies such as rolling, blue-green, or canary deployments.
- Support Git-based workflows and infrastructure promotion across environments.
Documentation, Enablement & Collaboration
- Create and maintain operational documentation, runbooks, and architectural diagrams.
- Enable self-service capabilities for engineering teams wherever possible.
- Work closely with data engineers, backend engineers, and architects to support platform needs.
- Reduce operational friction through automation, standardization, and tooling improvements.
Required skills and qualifications
-
5+ years of hands-on experience in DevOps, Platform Engineering, or Site Reliability Engineering.
- Strong experience operating distributed, open-source systems in production.
- Proven expertise with:
- Docker and Kubernetes
- Terraform and Ansible
- Linux systems and networking fundamentals
- Hands-on experience with Kafka, Spark, Airflow, NiFi, PostgreSQL, and messaging systems (including MQTT).
- Experience supporting business-critical platforms with uptime and reliability requirements.
- Strong scripting skills (Bash, Python, or equivalent).
- Excellent troubleshooting and systems-level problem-solving skills.
Preferred skills and qualifications
- Experience with GitOps tools such as ArgoCD or Flux.
- Experience with observability stacks (Prometheus, Grafana, ELK/OpenSearch).
- Familiarity with service meshes, ingress controllers, and API gateways.
- Experience operating data-intensive or streaming platforms at scale.
- Prior experience in hybrid or on-prem-first environments.