Easy Apply
Easy Apply
Responsible for the lifecycle management of GPU servers, including provisioning, automation, security hardening, and performance tuning for AI workloads.
You are an expert Linux systems operator who keeps fleets of servers healthy, secure, and performant at scale. At fal, you will be responsible for the bare-metal and OS-level foundation that our entire GPU cloud runs on. From provisioning and imaging thousands of GPU nodes to kernel tuning, storage management, and security hardening, you will ensure every machine in our fleet is production-ready and running at peak efficiency. You are deeply comfortable in a terminal, you think in terms of uptime and automation, and you take pride in infrastructure that just works.
Key Responsibilities- Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers.
- Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale.
- Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes).
- Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage.
- Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation.
- Own system observability: deploy and maintain node-level metrics collection, log aggregation, and alerting using Prometheus, node_exporter, Loki, and Grafana.
- Collaborate with the Compute platform team to ensure smooth integration between our infrastructure layer (K8s, Nomad, FluxCD) and the underlying Linux hosts.
- 8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments.
- Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar).
- Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines.
- Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning.
- Familiarity with the NVIDIA GPU software stack: drivers, CUDA toolkit, nvidia-smi, MIG, and container runtimes (nvidia-container-toolkit).
- Proficiency in Python and Bash scripting for automation, monitoring, and fleet management tooling.
- Excellent communication and a self-starter mindset—you take ownership and constantly seek improvement.
- Experience operating Kubernetes on bare metal (kubeadm, Kubespray) and managing GPU scheduling in K8s (device plugins, MIG slicing).
- Hands-on experience with BMC/IPMI/Redfish for out-of-band server management and firmware lifecycle automation.
- Familiarity with fleet-scale observability: Prometheus federation, Thanos, or Victoria Metrics for multi-cluster monitoring.
- Contributions to open-source infrastructure tooling or Linux distributions.
- Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001).
- Interesting and challenging work
- Competitive salary and equity
- A lot of learning and growth opportunities
- We offer visa sponsorship and will help you relocate to San Francisco.
- Health, dental, and vision insurance (US)
- Regular team events and offsite
- Remote
Top Skills
Ansible
Apparmor
Bash
Cuda
Gpu
Grafana
Kubernetes
Linux
Nfs
Nvme
Prometheus
Python
Raid
Selinux
Terraform
Similar Jobs
Cloud • Information Technology
As a Senior Linux System Administrator, you will mentor junior staff, manage infrastructure, respond to incidents, and perform advanced troubleshooting.
Top Skills:
BashCaching SolutionsCentosCephCumulus LinuxDatabasesDebianFirewallsKubernetesLibvirtLinuxLoad BalancingNetworkingPHPPythonUbuntuVirtualizationWeb Servers
Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Lead sales efforts for Silvus MIMO-MANET tactical communications products into Federal Civilian and Law Enforcement Agencies. Build and manage customer relationships, develop pipeline and CRM forecasts, support events, stay current on market and competitor activity, and meet annual new-business targets while supporting capture and proposal activities.
Top Skills:
Mimo-Manet,Manet,Mimo,Radio (Tactical Communications),Crm,C5Isr,Ew,Sigint,C2,Autonomous Systems,Crada
Healthtech • Other • Social Impact • Software • Telehealth
Lead launch and scaling of AI-powered patient-facing features that augment therapy. Partner with clinicians, engineers, data scientists, and designers to prototype, validate, and ship clinically-grounded, safe, and personalized AI experiences across the care journey.
Top Skills:
AIData ScienceMlMl EngineeringPromptingRag
What you need to know about the Los Angeles Tech Scene
Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.
Key Facts About Los Angeles Tech
- Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
- Key Industries: Artificial intelligence, adtech, media, software, game development
- Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
- Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering



