Senior Machine Learning Engineer

TLDR

Build and operate core systems for large-scale machine learning training and inference on TensorWave's GPU platform, enhancing performance and operational efficiency.

Our mission at Tensorwave Cloud is to build seamless, secure, reliable, and resilient AI infrastructure at scale, eliminating barriers and challenging the status quo to empower builders and support AI innovation.

About the role

We are seeking a Senior Machine Learning Engineer to build and operate the core systems that power large-scale ML training and inference across TensorWave’s GPU platform.

This role spans workload orchestration, cluster operations, performance optimization, and developer enablement for production ML workloads.

Responsibilities

  • Design, operate, and improve ML infrastructure systems supporting distributed training and inference workloads

  • Build reliable, repeatable workload execution and orchestration patterns across shared GPU environments

  • Troubleshoot performance, reliability, and scalability issues across the ML stack

  • Partner with ML, systems, and platform teams to improve developer experience and operational efficiency

Required Experience

  • Bachelor of Science in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience

  • Expertise supporting production ML systems using SLURM and Kubernetes

  • Strong understanding of GPU-accelerated workloads and distributed systems concepts

  • Solid Linux fundamentals and experience debugging infrastructure-level issues

  • Ability to build automation and tooling - Python, Go, etc.

Preferred Experience

  • Experience working across schedulers, orchestration platforms, or cluster managers

  • Familiarity with large-scale GPU environments or HPC-style systems

  • Experience improving infrastructure reliability, utilization, or performance at scale

What We Bring

  • Mission driven company

  • Competitive Salary

  • Stock Options

  • 100% paid Medical, Dental, and Vision insurance

  • Flexible PTO

  • Paid Holidays

  • 401(k)

  • Parental Leave

  • Flexible Spending Account

  • Short Term Disability Insurance

  • Life and Voluntary Supplemental Insurance

  • Mental Health Benefits through Spring Health

We’re looking for resilient, adaptable people to join our team, people who believe in the mission and think at massive scale. The solutions that worked on a handful of devices will not work at Exascale. Be prepared to be pushed daily, to learn a lot, and literally build the future.

Tensorwave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, national origin, or veteran status.

Benefits

Health Insurance

100% paid Medical, Dental, and Vision insurance

Mental Health Benefits

Mental Health Benefits through Spring Health

Paid Parental Leave

Parental Leave

Paid Time Off

Flexible PTO

TensorWave delivers a high-performance cloud computing platform that leverages AMD Instinct™ GPUs to supercharge AI research and advanced workloads. Tailored for developers and researchers in the AI space, our platform removes infrastructure hurdles, enabling innovators to focus on pushing the boundaries of technology.

View all jobs
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Senior Machine Learning Engineer Q&A's
Report this job
Apply for this job