Serve as the technical owner of Tempest, ensuring reliability and alignment with evolving infrastructure needs while driving performance improvements and resource management across systems.
Who we are
At Domino, we build software that helps the largest, AI-driven organizations build and operate advanced data science and AI solutions at scale. Our platform integrates a streamlined model development environment, MLOps capabilities, and novel features for collaboration, reuse, and reproducibility — all of which make data science teams more productive, reduce time to value, and ensure compliance. Our customers — like Johnson & Johnson, GSK, Bristol Myers, UBS, FINRA and the US Navy — are using our software to solve some of the most important challenges in the world, such as developing new medicines, securing our financial markets, or protecting our country. Backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake and other leading investors, we have been in business for a decade but are still a small team operating with the spirit of a startup. Especially in the world of AI today, we believe that the future is still being invented — and we want to be the ones building it. For more information, visit www.domino.ai
What we are building
The Automation Team at Domino acts as a force multiplier for engineering, building the tools and systems that enable teams to ship code confidently and consistently. A core part of this mission is Tempest, an in-house platform that orchestrates realistic, long-duration workloads against live Kubernetes clusters and validates the results against real observability data. Today, when scale testing surfaces a bottleneck, a resource misconfiguration, or a regression in system behavior, the team can identify and report the issue — but we need someone who can take the next step: profiling services, tracing root causes through Prometheus and New Relic data, and partnering with platform engineers to drive durable fixes. Focused on iteration and continuous improvement, the team looks for targeted enhancements that create outsized impact, and this role will close the gap between detection and resolution at the infrastructure level.
What your impact will be
In your first year, you will:
What we look for in this role
What we value
#LI-Remote
The annual US base salary range for this role is listed below. For sales roles, the range provided is the role's On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range will be narrowed during the interview process based on a number of factors, including the candidate's experience, qualifications, and location. Additional benefits for this role may include: equity, company bonus or sales commissions/bonuses; 401(k) plan; medical, dental, and vision benefits; and wellness stipends.
Domino Data Lab develops a powerful platform for AI-driven organizations to build and deploy advanced data science solutions at scale. Our software enhances collaboration, accelerates model development, and streamlines MLOps, enabling data science teams to tackle critical global challenges efficiently. Serving top enterprises like Johnson & Johnson and the US Navy, we help improve productivity and ensure compliance in high-stakes environments.
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!
Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.
Reliability Engineer Q&A's