Andromeda (andromeda.ai) Logo

Andromeda (andromeda.ai)

Site Reliability Engineer - AI Infrastructure

Reposted 7 Days Ago
In-Office or Remote
Hiring Remotely in United States
Senior level
In-Office or Remote
Hiring Remotely in United States
Senior level
The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.
The summary above was generated by AI

Site Reliability Engineer - AI Infrastructure

Location: Global Remote / San Francisco · Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets.

We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

What You’ll Do
  • Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.

  • Build automation and tooling to streamline cluster deployments and integrations.

  • Debug customer issues across networking, storage, scheduling, and system layers.

  • Improve reliability and scalability of both training and inference infrastructure.

  • Design and implement monitoring, alerting, and observability for critical systems.

  • Collaborate with engineering and product teams to plan and deliver infrastructure for new services.

  • Participate in on-call and incident response, leading postmortems and reliability improvements.

    What We’re Looking For

  • 5+ years experience in SRE, DevOps, or infrastructure engineering roles.

  • Strong Linux systems and networking fundamentals.

  • Deep experience with Kuber

Kubernetes and container orchestration at scale.
  • Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.).

  • Strong automation and scripting skills (Python, Go, or Bash).

  • Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.).

  • Track record of operating production systems and leading incident response.

Nice to Have
  • Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.).

  • Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph).

  • Customer-facing support or consulting experience.

Why You’ll Love It Here

This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Similar Jobs

20 Days Ago
In-Office or Remote
USA
Senior level
Senior level
Artificial Intelligence • Cloud • Information Technology • Software
As a Staff SRE, you will ensure the reliability and performance of Andromeda's GPU infrastructure, lead incident responses, build observability systems, and mentor engineers, while collaborating closely with engineering and customers.
Top Skills: AnsibleCudaGoHelmKubernetesLinuxNcclNvidiaPythonRustSlurmTerraform
7 Days Ago
In-Office or Remote
United States
Senior level
Senior level
Artificial Intelligence • Cloud • Information Technology • Software
Design and operate large-scale GPU infrastructure for distributed AI training, ensuring reliability, performance, and efficient customer partnerships.
Top Skills: AnsibleCudaDeepspeedFsdpGpuHelmInfinibandKubernetesLinuxMegatronNcclNvidia A100Nvidia B200Nvidia H100NvlinkPyTorchRoceTerraform
A Minute Ago
Remote or Hybrid
2 Locations
77K-214K Annually
Junior
77K-214K Annually
Junior
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
As a Financial Services Tax Senior Associate, you'll advise clients on tax obligations, manage complex tax scenarios, mentor junior staff, and enhance efficiency through technology in a team-oriented setting.
Top Skills: Data Visualization ToolsDigitization Solutions

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account