Sr Linux System Administrator

AI overview

Manage and optimize the lifecycle of a GPU server fleet, ensuring peak performance and implementing robust automation and security measures.

You are an expert Linux systems operator who keeps fleets of servers healthy, secure, and performant at scale. At fal, you will be responsible for the bare-metal and OS-level foundation that our entire GPU cloud runs on. From provisioning and imaging thousands of GPU nodes to kernel tuning, storage management, and security hardening, you will ensure every machine in our fleet is production-ready and running at peak efficiency. You are deeply comfortable in a terminal, you think in terms of uptime and automation, and you take pride in infrastructure that just works.

 

Key Responsibilities

  • Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers.
  • Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale.
  • Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes).
  • Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage.
  • Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation.
  • Own system observability: deploy and maintain node-level metrics collection, log aggregation, and alerting using Prometheus, node_exporter, Loki, and Grafana.
  • Collaborate with the Compute platform team to ensure smooth integration between our infrastructure layer (K8s, Nomad, FluxCD) and the underlying Linux hosts.

Requirements

  • 8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments.
  • Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar).
  • Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines.
  • Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning.
  • Familiarity with the NVIDIA GPU software stack: drivers, CUDA toolkit, nvidia-smi, MIG, and container runtimes (nvidia-container-toolkit).
  • Proficiency in Python and Bash scripting for automation, monitoring, and fleet management tooling.
  • Excellent communication and a self-starter mindset—you take ownership and constantly seek improvement.

Nice to Have

  • Experience operating Kubernetes on bare metal (kubeadm, Kubespray) and managing GPU scheduling in K8s (device plugins, MIG slicing).
  • Hands-on experience with BMC/IPMI/Redfish for out-of-band server management and firmware lifecycle automation.
  • Familiarity with fleet-scale observability: Prometheus federation, Thanos, or Victoria Metrics for multi-cluster monitoring.
  • Contributions to open-source infrastructure tooling or Linux distributions.
  • Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001).

What we offer at fal

  • Interesting and challenging work
  • Competitive salary and equity
  • A lot of learning and growth opportunities
  • We offer visa sponsorship and will help you relocate to San Francisco.
  • Health, dental, and vision insurance (US)
  • Regular team events and offsite

Location

  • Remote

Perks & Benefits Extracted with AI

  • Health Insurance: Health, dental, and vision insurance (US)
  • Visa Sponsorship: We offer visa sponsorship and will help you relocate to San Francisco.

In the modern era, content is shifting from being human-made and algorithm-distributed to being generated on demand - personalized in real time for every audience, context, and moment. We’re Fal, and we’re building the infrastructure powering this transformation. Our platform is the first of its kind: a generative media stack for developers that enables real-time, AI-generated content across image, video, and audio.   At the core is our serverless Python runtime, purpose-built to run massive ML models across thousands of GPUs with unmatched speed and efficiency. Applications built on Fal already serve millions of users - and we’re just getting started. Founded in 2021, we're scaling fast and backed by top investors including a16z, Bessemer, and Kindred. If you're an ambitious builder who wants to define the future of AI and media, we’d love to meet you.

View all jobs
Ace your job interview

Understand the required skills and qualifications, anticipate the questions you may be asked, and study well-prepared answers using our sample responses.

Linux System Administrator Q&A's
Report this job
Apply for this job