About the Role

We are looking for a talented teammate to play a key role in bridging the gap between the laboratory and production. You will be responsible for deploying AI/ML models securely and scalably into real-world environments.

We are seeking an engineer who goes beyond classic DevOps processes (CI/CD, IaC) and gets excited about managing GPU workloads, model versioning, and optimizing AI infrastructures. If you are someone who digs into logs instead of just rebooting, and refuses to accept "it works on my machine" as an excuse, we want to meet you.

Responsibilities (What You Will Do)

Infrastructure Management: Manage, secure, and optimize Linux-based servers, Container engines (Docker), and Orchestration (Kubernetes) environments.

MLOps Pipelines: Build and automate pipelines that handle the lifecycle of ML models, from Training to Inference.

GPU Resource Management: Efficient management of GPU clusters, configuring Nvidia Container Toolkit, and monitoring GPU utilization.

CI/CD & Automation: Design secure pipelines on Jenkins/GitLab CI/GitHub Actions adhering to the "Build Once, Deploy Anywhere" principle.

Infrastructure as Code (IaC): Manage infrastructure using Terraform, handling state management, locking, and drift detection.

Observability: Monitor the health of systems and models (Prometheus, Grafana, ELK, etc.) and establish robust alerting mechanisms.

Requirements

Experience: Minimum 2 years of professional experience in DevOps, System Administration, or MLOps.

Linux Mastery: Deep understanding beyond basic commands; knowledge of disk management (LVM, Partitioning), process management, network troubleshooting, and kernel modules.

Container Technologies: In-depth knowledge of Docker internals (Layers, Volumes, Networking, Multi-stage builds).

CI/CD: Experience building pipelines with modern tools, with a strong discipline in Secrets Management and Artifact handling.

Troubleshooting: Ability to perform Root Cause Analysis (RCA) by reading logs and analyzing metrics rather than relying on restarts.

Version Control: Proficiency in Git workflows (Git Flow / Trunk Based).

Scripting: Ability to automate operational tasks using Python or Bash.

Preferred Qualifications (Nice-to-Haves)

GPU Workloads: Practical experience with Nvidia Docker, CUDA drivers, and GPU scheduling on Kubernetes.

MLOps Tools: Familiarity with at least one MLOps tool such as MLflow, Kubeflow, DVC, or Airflow.

Terraform: Experience with modular infrastructure design and remote backend management.

DevOps/MLOps Engineer (Mid-Level)

AI overview