Build and operate core systems for large-scale machine learning training and inference on TensorWave's GPU platform, enhancing performance and operational efficiency.
Our mission at Tensorwave Cloud is to build seamless, secure, reliable, and resilient AI infrastructure at scale, eliminating barriers and challenging the status quo to empower builders and support AI innovation.
About the role
We are seeking a Senior Machine Learning Engineer to build and operate the core systems that power large-scale ML training and inference across TensorWave’s GPU platform.
This role spans workload orchestration, cluster operations, performance optimization, and developer enablement for production ML workloads.
Responsibilities
Design, operate, and improve ML infrastructure systems supporting distributed training and inference workloads
Build reliable, repeatable workload execution and orchestration patterns across shared GPU environments
Troubleshoot performance, reliability, and scalability issues across the ML stack
Partner with ML, systems, and platform teams to improve developer experience and operational efficiency
Required Experience
Bachelor of Science in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience
Expertise supporting production ML systems using SLURM and Kubernetes
Strong understanding of GPU-accelerated workloads and distributed systems concepts
Solid Linux fundamentals and experience debugging infrastructure-level issues
Ability to build automation and tooling - Python, Go, etc.
Preferred Experience
Experience working across schedulers, orchestration platforms, or cluster managers
Familiarity with large-scale GPU environments or HPC-style systems
Experience improving infrastructure reliability, utilization, or performance at scale
What We Bring
Mission driven company
Competitive Salary
Stock Options
100% paid Medical, Dental, and Vision insurance
Flexible PTO
Paid Holidays
401(k)
Parental Leave
Flexible Spending Account
Short Term Disability Insurance
Life and Voluntary Supplemental Insurance
Mental Health Benefits through Spring Health
We’re looking for resilient, adaptable people to join our team, people who believe in the mission and think at massive scale. The solutions that worked on a handful of devices will not work at Exascale. Be prepared to be pushed daily, to learn a lot, and literally build the future.
Tensorwave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, national origin, or veteran status.
Health Insurance
100% paid Medical, Dental, and Vision insurance
Mental Health Benefits
Mental Health Benefits through Spring Health
Paid Parental Leave
Parental Leave
Paid Time Off
Flexible PTO
TensorWave delivers a high-performance cloud computing platform that leverages AMD Instinct™ GPUs to supercharge AI research and advanced workloads. Tailored for developers and researchers in the AI space, our platform removes infrastructure hurdles, enabling innovators to focus on pushing the boundaries of technology.
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!
Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.
Senior Machine Learning Engineer Q&A's