Site Reliability Engineer

TLDR

Play a critical role in designing resilient infrastructure and automating deployment systems with modern tooling while collaborating closely with engineering teams.

Our mission at Tensorwave Cloud is to build seamless, secure, reliable, and resilient AI infrastructure at scale, eliminating barriers and challenging the status quo to empower builders and support AI innovation.

About the role

We are seeking a Site Reliability Engineer with a strong background in software engineering to build and maintain highly scalable, secure, and resilient infrastructure.

You’ll play a critical role in designing low-level systems, automating infrastructure with modern tooling, and ensuring platform reliability.

This role is ideal for someone who’s comfortable working at the intersection of systems programming and DevOps - writing code in Go, Javascript, Rust, C, or Zig while also managing infrastructure with NixOS, Kubernetes, and Terraform.

Responsibilities

  • Design, build, and maintain infrastructure systems using Linux and NixOS

  • Manage infrastructure-as-code with Terraform to provision and scale resources

  • Architect and operate Kubernetes clusters with a focus on performance, security, and automation

  • Write high-performance tooling and internal utilities in Go or Rust

  • Develop and maintain CI/CD pipelines for infrastructure and code deployments

  • Monitor system performance, resolve issues, and improve reliability through observability tooling

  • Collaborate closely with engineering teams to support deployment strategies and development workflows

Required Experience

  • Bachelor of Science in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience

  • 5+ years in DevOps, Site Reliability, or Infrastructure Engineering roles

  • Proficiency in one or more low-level languages Rust or Go

  • Deep experience with Linux systems and configuration management

  • Hands-on experience with Terraform, Kubernetes, and containerized environments

  • Strong understanding of systems programming, performance tuning, and operating system internals

  • Familiarity with CI/CD practices and infrastructure monitoring/alerting tools

What We Bring

  • Mission driven company

  • Competitive Salary

  • Stock Options

  • 100% paid Medical, Dental, and Vision insurance

  • Flexible PTO

  • Paid Holidays

  • 401(k)

  • Parental Leave

  • Flexible Spending Account

  • Short Term Disability Insurance

  • Life and Voluntary Supplemental Insurance

  • Mental Health Benefits through Spring Health

We’re looking for resilient, adaptable people to join our team, people who believe in the mission and think at massive scale. The solutions that worked on a handful of devices will not work at Exascale. Be prepared to be pushed daily, to learn a lot, and literally build the future.

Tensorwave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, national origin, or veteran status.

Benefits

Health Insurance

100% paid Medical, Dental, and Vision insurance

Life and Voluntary Supplemental Insurance

Paid Parental Leave

Parental Leave

Paid Time Off

Flexible PTO

Mental Health Benefits

Mental Health Benefits through Spring Health

TensorWave delivers a high-performance cloud computing platform that leverages AMD Instinct™ GPUs to supercharge AI research and advanced workloads. Tailored for developers and researchers in the AI space, our platform removes infrastructure hurdles, enabling innovators to focus on pushing the boundaries of technology.

View all jobs
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Site Reliability Engineer Q&A's
Report this job
Apply for this job