Verda

Principal Cluster Engineer, Training Infrastructure

(EU)

Full-Time

Remote

TLDR

Drive the evolution and optimization of InfiniBand-connected GPU training infrastructure for next-gen machine learning workloads in a high-performing and collaborative environment.

Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud — with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.

We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job — we offer a career-defining opportunity to be part of building something big.

As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.

About the role

We’re looking for a Principal Cluster Engineer to own and evolve our InfiniBand-connected GPU training infrastructure. This is a highly technical role focused on building and operating large-scale AI and HPC clusters that power the next generation of machine learning workloads.

You will work closely with ML researchers, cloud platform teams, datacenter operations, and procurement to ensure Verda’s GPU infrastructure is fast, reliable, and ready to support cutting-edge training workloads. In this role you will architect and operate large-scale InfiniBand fabrics, push storage and compute performance to their limits, build automation and observability tooling, and help define the technical and operational standards the team works to.

You’ll play a key role in translating customer and product requirements into real infrastructure capabilities, ensuring clusters are designed for performance, reliability, and scale.

Why Verda

Competitive cash and equity package, plus benefits (healthcare, lunch, wellbeing, etc.
Profitable operations with rapid, sustained growth
A genuine once-in-a-lifetime opportunity to join one of Finland’s few true explosive growth stories, shaping a category-defining AI cloud from the ground up
Work alongside world-class engineers, researchers, and partners across the global AI ecosystem
A small, high-performing team of around 70 people representing 27 nationalities

Practicalities

Location: Remote - EU
Start Date: As soon as possible
Contract Type: Full-time
Working Language: English

Your responsibilities

Design, deploy, and continuously improve large-scale InfiniBand-connected GPU training clusters
Drive cluster-level storage performance, translating customer SLAs into internal throughput and IOPS performance targets
Build and maintain automation for cluster provisioning, OS imaging, firmware management, and day-two operations using Python
Contribute to infrastructure-as-code and CI/CD pipelines for cluster and platform management
Establish and own performance baselines across compute, network fabric, and storage layers
Identify, diagnose, and resolve performance bottlenecks across the full cluster stack
Implement and maintain observability tooling including metrics, alerting, and anomaly detection systems
Work closely with datacenter operations, cloud platform teams, ML researchers, and procurement to translate requirements into infrastructure architecture
Participate in the on-call rotation and help maintain production reliability of the training clusters

Your key competencies

7+ years of hands-on infrastructure or systems engineering experience
Experience operating large-scale HPC or AI training clusters (1000+ GPU nodes)
Strong production experience with InfiniBand fabrics
Experience working with NVIDIA GPU hardware in training workloads (Hopper or newer preferred)
Proven experience leading or tech-leading engineering teams, setting technical direction, reviewing work, and mentoring engineers
Experience with automation and scripting (Python preferred)
Experience working with infrastructure-as-code tools such as Terraform, Ansible, or Salt

Nice to have

Experience with the NVIDIA HPC software stack or UFM
Knowledge of NCCL and debugging distributed GPU training workloads
Experience tuning Linux kernels or using eBPF for performance optimization in HPC environments

Success criteria for this role in the next 6-12 months

Optimized production AI/HPC clusters with measurable improvements in reliability, performance, and job success rates
Implemented automation and tooling that significantly reduces operational overhead and speeds up incident resolution
Established strong operational practices for monitoring, alerting, capacity planning, and incident management
Built strong collaboration with datacenter operations, ML researchers, and cloud platform teams to translate workload requirements into infrastructure improvements
Mentored engineers and helped build deeper internal expertise in GPU cluster operations and performance engineering

How the process looks like

Introduction chat with the TA Partner (45 mins): Learn more about Verda and share your career aspirations.
Conversation with the CTO (30 mins): A focused discussion with our CTO to explore technical vision, infrastructure strategy, and how your experience aligns with the future of Verda’s AI platform.
Technical interview with the team (60 mins): Learn about the role and its requirements and dive deeper into your expertise and discuss technical challenges.
Final interview (45 mins): Meet with our COO for a culture-fit conversation.

What's next

Apply sooner than later. This job ad will be removed when we’ve found the right person.

Please submit your application through our Careers page. We don’t accept applications sent by email.

Apply for this job

Verda

Verda is creating a comprehensive European AI cloud platform designed for builders, researchers, and enterprises to effectively train, experiment with, and deploy AI models at scale. We power our infrastructure exclusively with renewable energy, fostering a more sustainable approach to AI development.

View company profile

Engineer

Report this job