Fluidstack

Director, SRE

Posted 23 Days Ago

Be an Early Applicant

In-Office

4 Locations

Senior level

In-Office

4 Locations

Senior level

The Director of SRE will build and lead the Site Reliability Engineering team, focusing on ensuring maximum performance of GPU infrastructure through automation, monitoring, incident management, and effective customer support.

The summary above was generated by AI

About Fluidstack

We build and operate high-performance GPU clusters so the most ambitious teams can move fast, stay focused, and scale without friction. Our clusters power top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.

Our team is highly motivated, and focused on providing a world class supercomputing experience. We put our customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.

We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.

You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.

About the Role

The Director, SRE will build our Site Reliability Engineering team from scratch, creating a team responsible for guaranteeing the maximum availability and performance of our GPU infrastructure.

This role involves building reliability into our Slurm and Kubernetes platforms from the ground up. You will work directly with customers on a daily basis to support workload installation, monitoring, and debugging.

Key responsibilities include implementing systems to detect and drain broken nodes across Fluidstack-operated infrastructure. You will collaborate closely with the Infrastructure team to develop provisioning and configuration automation using Infrastructure as Code and DevOps best practices.

Focus

Build comprehensive monitoring with active and passive health checks
Define SLIs and SLOs for our managed Slurm + Kubernetes clusters
Create actionable alerts that wake people up only when necessary
Write runbooks that anyone can follow at 3am
Implement Infrastructure as Code for all cluster deployments
Prepare disaster recovery plans
Reduce toil through aggressive automation
Design and implement incident management processes
Drive postmortems that prevent repeat failures
Mentor engineers on SRE principles and practices
Implement and improve CI/CD processes

About You

5+ years of SRE experience, including exposure to architecture an design
You've scaled infrastructure at a fast-growing company
You have experience with GPU workloads and HPC environments
You've managed Kubernetes or Slurm clusters in production
You write code to solve operational problems
You think in systems, not individual servers
You've automated yourself out of repetitive tasks
You can debug complex distributed systems under pressure
You've worked directly with demanding enterprise customers
You measure everything and make data-driven decisions
You've been on-call and improved the experience for others
You can explain complex systems simply

Nice to haves

Multi-region or multi-cloud deployments
Contributions to open source infrastructure tools
Familiarity with high-throughput network topologies for storage backplanes (e.g., RoCE, RDMA, InfiniBand)..
Excited to work with cutting edge AI training & inference hardware and networks
Experience with bare metal automation

Benefits

Competitive total compensation package (cash + equity)
Retirement or pension plan, in line with local norms
Health, dental, and vision insurance
Generous PTO policy, in line with local norms

Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

Top Skills

Ci/Cd

Gpu

Infrastructure As Code

Kubernetes

Slurm

Similar Jobs

Optimal (formerly Effective Spend)

Paid Media Specialist (Entry-Level)

11 Hours Ago

Hybrid

Austin, TX, USA

48K-56K Annually

Entry level

48K-56K Annually

Entry level

Agency • Digital Media • eCommerce • Social Media • Business Intelligence

The Paid Media Specialist will execute and manage paid media campaigns, utilizing analytical and creative skills to optimize performance and achieve client goals.

Top Skills: Facebook AdsGoogle AdsGoogle AnalyticsMicrosoft Office SuitePower BI

Wells Fargo

2026 Investment Banking Summer Associate Program (Houston) - Early Careers

11 Hours Ago

Hybrid

Houston, TX, USA

175K-175K

Junior

175K-175K

Junior

Fintech • Financial Services

Participate in a 10-week Investment Banking Summer Associate Program, focusing on valuation, research, and client engagement within a high-performing team.

Top Skills: Microsoft Office (ExcelPowerPointWord)

Wells Fargo

Product Owner

11 Hours Ago

Hybrid

Senior level

Fintech • Financial Services

The Fraud AI Product Owner will manage product development for fraud-related AI tools, overseeing Agile teams and collaborating on product life cycle management.

Top Skills: AgileAIDatabase SystemsExcelMachine LearningPowerPointScrum

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering