NVIDIA

Principal Cloud Engineer, HPC

Posted 2 Days Ago

Remote

3 Locations

272K-426K

Expert/Leader

Remote

3 Locations

272K-426K

Expert/Leader

Design and architect AI-oriented compute services, build distributed infrastructure for model training, coordinating with multifunctional teams.

The summary above was generated by AI

NVIDIA is on the journey to build the best cloud offering for AI workloads and to bring its latest GPU technology to our clients as a set of managed services under the DGX Cloud umbrella. We want to be able to innovate on behalf of our clients and provide an easy no-hassle way of using the latest and greatest NVIDIA products through scalable managed self service APIs.

We are looking for a Principal HPC / Slurm engineer to drive the technical design and develop a new set of high performing cloud services for Artificial Intelligence and high performance computing. This is a unique opportunity to be a founding member of a team building at the intersection of a highly scalable fault tolerant cloud services and AI. We are looking for an engineer who has a deep understanding of large scale distributed cloud services, multi-tenant architectures, serverless compute. HPC experience is a plus.

What you'll be doing:

Design and architect a set of new AI oriented compute services large training workloads
Build the distributed computing infrastructure and training services for creating large scale distributed model training
Plan and coordinate across multi-functional teams, partners and vendors for execution of infrastructure build-outs
Work with engineering teams across all of NVIDIA to ensure their requirements are correctly translated into infrastructure needs

What we need to see:

Solid technical foundation in distributed computing and storage, including substantial experience with all of the following: server systems, storage, I/O, networking, and system software
Bachelors degree or equivalent experience
12+ years of system software engineering experience on large-scale production systems
12+ years of architecting high performance computing infrastructure at scale
Proven experience in high performance computing, Deep Learning, and/or GPU accelerated computing domains
Ability to understand and communicate complex designs, distributed infrastructure, and requirements to peers, customers, and vendors
General shared storage knowledge such as NFS, LustreFS, GlusterFS, etc.
Familiarity with system level architecture, such as interconnects, memory hierarchy, interrupts, and memory-mapped IO.

Ways to stand out from the crowd:

Large-scale distributed system, HPC, ML and Training experience with Slurm and Kubernetes
Deep knowledge of both software and hardware knowledge in HPC and ML infrastructure

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for great people like you to help us accelerate the next wave of artificial intelligence.

The base salary range is 272,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

Cloud Services

Distributed Computing

Glusterfs

Hpc

Kubernetes

Lustrefs

Nfs

Slurm

Similar Jobs

Amentum

Engineer 3

22 Days Ago

Remote

Mid level

Security • Cybersecurity

The Engineer will support design projects, conduct facility assessments, develop technical documentation, and interface with clients. Responsibilities include project planning, document standards compliance, and troubleshooting engineering issues.

Top Skills: AutocadCost Estimating SoftwareEngineering Computer Scheduling Software

CACI International Inc

Lifecycle Engineer

22 Days Ago

Remote

State Road, IL, USA

58K-118K Annually

Mid level

58K-118K Annually

Mid level

Information Technology • Consulting • Defense

The Lifecycle Engineer supports engineering life cycle management for military ships, providing technical support, developing maintenance plans, and assisting with reporting and documentation.

Top Skills: Electrical EngineeringMarine EngineeringMechanical EngineeringNaval Architecture

Patterson Companies

Senior Software Engineer (Remote)

19 Days Ago

Remote

St. Paul, MN, USA

140K-160K Annually

Senior level

140K-160K Annually

Senior level

Healthtech

The Senior Software Engineer will design, develop, and maintain software services, leading research and defining APIs, while mentoring junior engineers and addressing performance issues.

Top Skills: .Net Core.Net FrameworkAzureAzure Container AppsC#Cosmos DbDockerMicroservicesRestful ApisSQL

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering