Senior ML Ops / LLM Ops Engineer
TLDR
Drive the technical execution of the ML Ops pipeline, ensuring data governance and reliability through innovative ingestion and evaluation systems.
Overview:
This role focuses on building and operating the ML Ops / LLM Ops pipeline that closes it: ingest production signal, redact it, store it, slice it, classify it, surface the failures, mine new eval cases, and alert on regressions. You drive the toolchain decisions, the data-governance posture, and the day-to-day reliability of the pipeline itself. The Head of AI sets vision and priorities and you own the technical execution end-to-end.
What will you do?
- Design and build a source-agnostic ingestion pipeline for production ML / LLM traffic
- Design storage tiering based on automotive and company requirements, policy-driven retention windows, and privacy requirements
- Build slicing dashboards and the query path engineers use to debug production at 11p.m.
- Enable autoraters and lightweight LLM classifiers across production traffic
- Build the rule-based triage layer for obvious failures
- Stand up the eval-mining workflow and wire regression alerts to model and prompt deploys
- Implement PII redaction at the ingestion boundary and safety / abuse classification on inbound content
- Define dashboard architecture, wipeout mechanisms, tool and hosting selection, and operate the pipeline end-to-end
What are we looking for?
Must Have
- Proven experience building and operating data or ML platform systems in production, covering ingest, schema, storage, access control, alerts, and on-call
- Hands-on experience building and running ML / LLM evaluation systems in production (offline regression sets, online autoraters, LLM-as-judge pipelines, golden datasets)
- Hands-on experience with LLM tracing and observability tooling
- Experience shipping PII redaction or comparable data-handling controls in a regulated or multi-tenant environment, with a pragmatic approach to data governance
- Strong understanding of how ML and LLM-based systems fail in production: hallucination, retrieval failures, agent loops that don’t terminate, ASR / TTS degradation, and prompt or model regressions across deploys
- Production Python proficiency; hands-on engineer, not advisory. Comfortable leveraging AI in everything you build
Nice to Have
- Preferable multi-tenant or white-label SaaS experience with per-tenant data isolation
- Azure experience and ability to make self-host vs managed SaaS calls on tradeoffs
- Experience with autorater methodology and contamination defenses
- Knowledge of vector databases, embedding-based clustering or unsupervised failure-mode discovery
- Experience with data-versioning tooling (LakeFS, DVC, Delta Lake)
- GDPR / right-to-erasure work
- Embedded, automotive, or another constrained environment context
- Working knowledge of a language beyond English sufficient to validate non-English failure modes
- Prior experience using Cloud (Microsoft Azure and AWS);
- Prior experience with Claude Code;
- Prior experience with GitHub;
- Languages: Python primary, SQL, and some TypeScript for dashboards;
- LLM APIs: Claude (Anthropic), OpenAI, open-source models as needed
- Android/AAOS ecosystem as clients
What can you expect from us?
- A permanent job contract for a long term project;
- Tech equipment + SIM Card + personal smartphone;
- Health and Life Insurance;
- Social events and team buildings;
- The commitment of letting you grow with us, and be rewarded accordingly;
- A dynamic and young team that will be always there to support you;
- Training in the latest technologies;
- Coffee, fruits, snacks and a warm welcoming when you pass by the office.
Caixa Mágica Software develops advanced software solutions tailored for the automotive industry, including embedded systems and Android Automotive applications, enabling seamless in-vehicle experiences. Targeting both consumers and businesses, they modernize existing SAP systems while providing a diverse range of IT and business consulting services.