Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud — with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.
We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job — we offer a career-defining opportunity to be part of building something big.
As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.
About the role
We’re looking for a Principal Cluster Engineer to own and evolve our InfiniBand-connected GPU training infrastructure. This is a highly technical role focused on building and operating large-scale AI and HPC clusters that power the next generation of machine learning workloads.
You will work closely with ML researchers, cloud platform teams, datacenter operations, and procurement to ensure Verda’s GPU infrastructure is fast, reliable, and ready to support cutting-edge training workloads. In this role you will architect and operate large-scale InfiniBand fabrics, push storage and compute performance to their limits, build automation and observability tooling, and help define the technical and operational standards the team works to.
You’ll play a key role in translating customer and product requirements into real infrastructure capabilities, ensuring clusters are designed for performance, reliability, and scale.
Why Verda
- Competitive cash and equity package, plus benefits (healthcare, lunch, wellbeing, etc.
- Profitable operations with rapid, sustained growth
- A genuine once-in-a-lifetime opportunity to join one of Finland’s few true explosive growth stories, shaping a category-defining AI cloud from the ground up
- Work alongside world-class engineers, researchers, and partners across the global AI ecosystem
- A small, high-performing team of around 70 people representing 27 nationalities
Practicalities
- Location: Remote - EU
- Start Date: As soon as possible
- Contract Type: Full-time
- Working Language: English
Your responsibilities
- Design, deploy, and continuously improve large-scale InfiniBand-connected GPU training clusters
- Drive cluster-level storage performance, translating customer SLAs into internal throughput and IOPS performance targets
- Build and maintain automation for cluster provisioning, OS imaging, firmware management, and day-two operations using Python
- Contribute to infrastructure-as-code and CI/CD pipelines for cluster and platform management
- Establish and own performance baselines across compute, network fabric, and storage layers
- Identify, diagnose, and resolve performance bottlenecks across the full cluster stack
- Implement and maintain observability tooling including metrics, alerting, and anomaly detection systems
- Work closely with datacenter operations, cloud platform teams, ML researchers, and procurement to translate requirements into infrastructure architecture
- Participate in the on-call rotation and help maintain production reliability of the training clusters
Your key competencies
- 7+ years of hands-on infrastructure or systems engineering experience
- Experience operating large-scale HPC or AI training clusters (1000+ GPU nodes)
- Strong production experience with InfiniBand fabrics
- Experience working with NVIDIA GPU hardware in training workloads (Hopper or newer preferred)
- Proven experience leading or tech-leading engineering teams, setting technical direction, reviewing work, and mentoring engineers
- Experience with automation and scripting (Python preferred)
- Experience working with infrastructure-as-code tools such as Terraform, Ansible, or Salt
Nice to have
- Experience with the NVIDIA HPC software stack or UFM
- Knowledge of NCCL and debugging distributed GPU training workloads
- Experience tuning Linux kernels or using eBPF for performance optimization in HPC environments
Success criteria for this role in the next 6-12 months
- Optimized production AI/HPC clusters with measurable improvements in reliability, performance, and job success rates
- Implemented automation and tooling that significantly reduces operational overhead and speeds up incident resolution
- Established strong operational practices for monitoring, alerting, capacity planning, and incident management
- Built strong collaboration with datacenter operations, ML researchers, and cloud platform teams to translate workload requirements into infrastructure improvements
- Mentored engineers and helped build deeper internal expertise in GPU cluster operations and performance engineering
How the process looks like
-
Introduction chat with the TA Partner (45 mins): Learn more about Verda and share your career aspirations.
-
Conversation with the CTO (30 mins): A focused discussion with our CTO to explore technical vision, infrastructure strategy, and how your experience aligns with the future of Verda’s AI platform.
-
Technical interview with the team (60 mins): Learn about the role and its requirements and dive deeper into your expertise and discuss technical challenges.
-
Final interview (45 mins): Meet with our COO for a culture-fit conversation.
What's next
Apply sooner than later. This job ad will be removed when we’ve found the right person.
Please submit your application through our Careers page. We don’t accept applications sent by email.