Luma AI Logo

Luma AI

Software Engineer - Reliability

Posted Yesterday
In-Office or Remote
2 Locations
170K-360K Annually
Expert/Leader
In-Office or Remote
2 Locations
170K-360K Annually
Expert/Leader
As a Software Engineer in Reliability, you'll architect and manage multi-cloud GPU infrastructure, ensuring performance, security, and scale while debugging complex hardware/software issues.
The summary above was generated by AI
About Luma AI
Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work.

Where You Come In
We are looking for a hands-on, first-principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure.
You will build, maintain, and scale Luma’s infrastructure across on-prem and multi-vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams.

What You’ll Do
  • Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next-generation infrastructure operates.
  • Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
  • Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast-moving AI startup environment.
  • Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level.
  • Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
  • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.

Who You Are
  • 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment.
  • Deep Linux Mastery: You possess deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance.
  • Cloud Infrastructure Expert: You have strong experience with providers like AWS or OCI. 
  • Tenacious Troubleshooter: You thrive on solving complex, low-level problems where hardware and software intersect.
  • Startup DNA: You are energetic and thrive in a less structured, fast-paced environment.
  • Security-Minded: You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
  • Expert in High-Performance Networking: You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs.

What Sets You Apart (Bonus Points)
  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
  • Experience managing large-scale GPU clusters for AI/ML workloads (training or inference).
  • Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
Compensation
The base pay range for this role is $170,000 – $360,000 per year.

Top Skills

Amd
AWS
Bash
Go
Gpu
Infiniband
Linux
Nvidia
Oci
Python
Rdma

Similar Jobs

20 Days Ago
Easy Apply
Remote
United States
Easy Apply
195K-270K Annually
Expert/Leader
195K-270K Annually
Expert/Leader
Artificial Intelligence • Fintech • Machine Learning • Social Impact • Software
As a Principal Software Engineer on the SRE team, lead best practices adoption, mentor engineers, and improve system reliability and user experience through automation and collaboration.
Top Skills: CdkCloudFormationDatadogGoJavaScriptPrometheusPythonTerraformTypescript
Yesterday
Easy Apply
Remote
USA
Easy Apply
150K-210K Annually
Senior level
150K-210K Annually
Senior level
Big Data • Cybersecurity
The Senior Software Engineer will enhance AI system reliability, performance, and scalability, focusing on distributed services and collaborating with ML researchers.
Top Skills: JavaKotlinKubernetesLoggingMetricsPythonRelational DatabasesScalaTracing
7 Days Ago
Easy Apply
Remote
USA
Easy Apply
186K-219K Annually
Senior level
186K-219K Annually
Senior level
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
The role involves improving software reliability, automating processes, collaborating with teams on system optimization, and mentoring engineers to establish reliability as a core value.
Top Skills: AWSAzureDatadogDockerEc2GCPGoKibanaKubernetesRubyTerraform

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account