Shape the utilities market of the future with us!

We are looking for a highly experienced Senior DevOps Engineer to lead the installation, automation, and operational reliability of a modern open-source data and integration platform. The platform underpins business-critical data pipelines and integrations built on technologies such as Apache Airflow, Apache NiFi, Apache Spark, Kafka, PostgreSQL, MQTT brokers, Docker, and Kubernetes.
This is a hands-on, senior individual contributor role with ownership across infrastructure, reliability, security, automation, and operational excellence, supporting deployments on both private and public cloud environments.

What is the role about?

Key Responsibilities

Platform Installation, Configuration & Operations

Install, configure, upgrade, and maintain distributed open-source components including:
- Apache Airflow, Apache NiFi, Apache Spark
- Apache Kafka and its ecosystem
- PostgreSQL
- MQTT brokers
Ensure platform stability, scalability, high availability, and fault tolerance.
Perform capacity planning, performance tuning, and lifecycle management of all components.

Containerization & Orchestration

Design, deploy, and operate containerized workloads using Docker.
Build and manage production-grade Kubernetes clusters.
Implement Kubernetes best practices for networking, storage, scaling, and security.
Package and manage platform services using Helm or equivalent tooling.

Infrastructure as Code & Automation

Design and maintain Infrastructure as Code (IaC) using Terraform for cloud and on-prem environments.
Build configuration management and automation workflows using Ansible.
Enable repeatable, environment-agnostic deployments across development, staging, and production.
Automate provisioning, configuration, upgrades, scaling, and recovery processes.

Cloud, Hybrid & Private Infrastructure

Deploy and operate workloads on public cloud platforms (AWS, Azure, GCP) and private/on-prem infrastructure.
Design hybrid architectures with secure connectivity between environments.
Optimize infrastructure design for resilience, performance, and cost efficiency.

Observability, Reliability & Incident Management

Design and implement comprehensive monitoring, logging, and alerting for infrastructure and applications.
Define, measure, and maintain SLAs, SLIs, and SLOs for critical platform services.
Own incident response, root cause analysis, and post-incident reviews.
Proactively identify risks, bottlenecks, and failure modes before they impact users.

Security & Secrets Management

Implement infrastructure and platform security best practices across containers, Kubernetes, and networks.
Manage secrets and credentials using tools such as Vault, Kubernetes Secrets, or cloud-native solutions.
Own certificate lifecycle management, including rotation and renewal.
Design and enforce network security controls, access policies, and zero-trust principles where applicable.
Support compliance with internal security and governance requirements.

Backup, Disaster Recovery & Data Protection

Design and implement automated backup strategies for Kafka, PostgreSQL, and other stateful services.
Own disaster recovery planning and testing, including restore validation.
Support multi-cluster or cross-region strategies where required.
Ensure data durability, integrity, and recoverability.

Cost & Resource Optimization

Implement infrastructure cost monitoring and visibility across environments.
Right-size clusters, storage, and compute resources to balance performance and cost.
Continuously optimize resource usage for cloud and hybrid deployments.

CI/CD & Release Engineering

Build and maintain CI/CD pipelines for platform and infrastructure components.
Enable safe deployment strategies such as rolling, blue-green, or canary deployments.
Support Git-based workflows and infrastructure promotion across environments.

Documentation, Enablement & Collaboration

Create and maintain operational documentation, runbooks, and architectural diagrams.
Enable self-service capabilities for engineering teams wherever possible.
Work closely with data engineers, backend engineers, and architects to support platform needs.
Reduce operational friction through automation, standardization, and tooling improvements.

Required skills and qualifications

5+ years of hands-on experience in DevOps, Platform Engineering, or Site Reliability Engineering.
Strong experience operating distributed, open-source systems in production.
Proven expertise with:
- Docker and Kubernetes
- Terraform and Ansible
- Linux systems and networking fundamentals
Hands-on experience with Kafka, Spark, Airflow, NiFi, PostgreSQL, and messaging systems (including MQTT).
Experience supporting business-critical platforms with uptime and reliability requirements.
Strong scripting skills (Bash, Python, or equivalent).
Excellent troubleshooting and systems-level problem-solving skills.

Preferred skills and qualifications

Experience with GitOps tools such as ArgoCD or Flux.
Experience with observability stacks (Prometheus, Grafana, ELK/OpenSearch).
Familiarity with service meshes, ingress controllers, and API gateways.
Experience operating data-intensive or streaming platforms at scale.
Prior experience in hybrid or on-prem-first environments.

Senior DevOps Engineer – Data & Integration Platform

TLDR