You are an expert Linux systems operator who keeps fleets of servers healthy, secure, and performant at scale. At fal, you will be responsible for the bare-metal and OS-level foundation that our entire GPU cloud runs on. From provisioning and imaging thousands of GPU nodes to kernel tuning, storage management, and security hardening, you will ensure every machine in our fleet is production-ready and running at peak efficiency. You are deeply comfortable in a terminal, you think in terms of uptime and automation, and you take pride in infrastructure that just works.

Key Responsibilities

Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers.
Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale.
Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes).
Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage.
Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation.
Own system observability: deploy and maintain node-level metrics collection, log aggregation, and alerting using Prometheus, node_exporter, Loki, and Grafana.
Collaborate with the Compute platform team to ensure smooth integration between our infrastructure layer (K8s, Nomad, FluxCD) and the underlying Linux hosts.

Requirements

8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments.
Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar).
Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines.
Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning.
Familiarity with the NVIDIA GPU software stack: drivers, CUDA toolkit, nvidia-smi, MIG, and container runtimes (nvidia-container-toolkit).
Proficiency in Python and Bash scripting for automation, monitoring, and fleet management tooling.
Excellent communication and a self-starter mindset—you take ownership and constantly seek improvement.

Nice to Have

Experience operating Kubernetes on bare metal (kubeadm, Kubespray) and managing GPU scheduling in K8s (device plugins, MIG slicing).
Hands-on experience with BMC/IPMI/Redfish for out-of-band server management and firmware lifecycle automation.
Familiarity with fleet-scale observability: Prometheus federation, Thanos, or Victoria Metrics for multi-cluster monitoring.
Contributions to open-source infrastructure tooling or Linux distributions.
Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001).

What we offer at fal

Interesting and challenging work
Competitive salary and equity
A lot of learning and growth opportunities
We offer visa sponsorship and will help you relocate to San Francisco.
Health, dental, and vision insurance (US)
Regular team events and offsite

Location

Remote

Sr Linux System Administrator

AI overview

Key Responsibilities

Requirements

Nice to Have

What we offer at fal

Location

Perks & Benefits Extracted with AI