TO BE CONSIDERED FOR THIS ROLE, PLEASE SUBMIT AN UPDATED RESUME TRANSLATED TO ENGLISH
Why Housecall Pro?
Help us build solutions that build better lives. At Housecall Pro, we show up to work every day to make a difference for real people: the home service professionals that support America’s 100 million homes. We’re all about the Pro, and dedicate our days to helping them streamline operations, scale their businesses, and—ultimately—save time so they can be with their families and live well. We care deeply about our customers and foster a culture where our company, employees, and Pros grow and succeed together. Leadership is as focused on growing team members’ careers as they expect their teams to be on creating solutions for Pros.
🤜🤛 WHAT’S IN IT FOR YOU?
We know what you are thinking…WHAT IS THE ROLE AND WHAT WOULD YOU BE DOING? 👀
As a Staff Machine Learning Operations Engineer - Devops/SRE, you’ll anchor operations for our LLM- and ML‑powered services running on AWS, Kubernetes, Snowflake, and Datadog, all built and governed as code with Terraform. As a staff‑level engineer, you’ll combine deep hands‑on expertise with strong communication, project leadership, and architectural judgment to raise the bar on performance, resilience, observability, and maintainability.
Our team is passionate, empathetic, hard working, and above all else focused on improving the lives of our service professionals (our Pros). Our success is their success.
In your day to day, you will:
Day‑to‑day reliability & operations
Own SRE fundamentals for AI/ML services: define SLIs/SLOs, manage error budgets, triage incidents, lead on‑call, and drive blameless post‑mortems to durable fixes.
Operate and scale EKS‑based workloads (real‑time inference, batch jobs, data/feature pipelines), including autoscaling (HPA/cluster autoscaler/Karpenter), rollout strategies, and capacity planning.
Build and maintain proactive observability in Datadog (APM/traces, metrics, logs, synthetics, SLOs, dashboards, alert pipelines) with actionable, low‑noise alerts.
Keep environments healthy and consistent via Terraform (modular IaC, policies, drift detection), immutable builds, and standardized deployment patterns.
LLMOps & MLOps platform stewardship
Operate reliable model‑serving stacks (LLMs and traditional models), including traffic shaping/canary releases, versioning, and rollback safety.
Ensure retrieval/feature pipelines are robust and cost‑efficient end‑to‑end—data sourcing, transformation, validation, scheduling, and monitoring.
Manage data plane integrations with Snowflake (warehouses, tasks, streams, materialized views), tuning for performance/credits and enforcing governance/roles.
Instrument model pathways for latency, throughput, token/compute cost, drift, guardrails, and quality/evaluation signals; surface these in Datadog.
Performance & cost (FinOps)
Continuously reduce p95/p99 latency and variability across services and pipelines.
Optimize AWS (right‑sizing, spot/adaptive capacity, storage classes), Snowflake (warehouse sizing, auto‑suspend/resume, clustering/partitioning, caching), and Kubernetes (requested/limits hygiene, bin‑packing) for measurable savings.
Publish cost dashboards and unit‑economics (e.g., cost per 1k requests/tokens/model run) and drive roadmap items that improve both cost and performance.
Architecture, security, and delivery
Design resilient, multi‑AZ architectures with clear backup/restore, DR, and change‑management guardrails.
Strengthen least‑privilege access, secrets management, and data protections (IAM/KMS, network boundaries, Snowflake roles/shares).
Lead projects end‑to‑end: scope, plan, communicate milestones/risks, align stakeholders, and deliver reliably.
We think this role is for you if have...
Staff‑level mastery (design + deep hands‑on) with:
AWS (EKS, EC2, VPC, IAM/IRSA, ALB/NLB, S3, KMS; comfort with scaling, networking, security boundaries).
Kubernetes (workload autoscaling, rollout strategies, Helm/GitOps patterns, capacity & cost optimization).
Terraform (modular design, environment separation, policy-as-code, drift control).
Datadog (APM/tracing, logs, metrics, synthetics, SLOs; building actionable dashboards and alert pipelines).
Snowflake (warehouse sizing and tuning, tasks/streams, performance optimization, cost/credit governance, RBAC).
Proven experience running LLM/ML production systems (model serving, data/feature pipelines, evaluation, and guardrails).
Strong communication and stakeholder management; able to lead cross‑functional projects and set architectural direction.
Track record of improving performance, resiliency, observability, and maintainability in complex, distributed systems.
Solid incident command, on‑call ownership, and post‑mortem leadership
What will help you succeed???
✨ Let’s talk numbers! ✨
Our compensation range for this role begins at $7,500 USD per month 💵
Housecall Pro is a fintech company founded in 2013. We built a SaaS platform that helps Home Service Professionals operate their businesses. We created the application for plumbers, electricians, and other Pros in the home improvement/trades industries.
Housecall Pro is a simple, cloud-based field service management software platform aimed at helping companies keep track of jobs, monitor technician activity, and produce invoices easily.
Our core product helps our clients with scheduling, dispatching, job management, invoicing, payment processing, marketing, and more. They used to struggle with the ton of paperwork after their hours. Now they can save time, and manage their business in one app.
We support more than 27,000 businesses and have over 1,300 ambitious, mission-driven employees in San Diego, Denver, and all over the world (including 200+ talented and innovative Engineers). #LI-Remote
Housecall Pro is the #1 software solution for home service businesses, empowering over 40,000 professionals with award-winning mobile software for streamlined business operations and growth.
Please mention you found this job on AI Jobs. It helps us get more startups to hire on our site. Thanks and good luck!
Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.
Machine Learning Engineer Q&A's