NVIDIA Logo

NVIDIA

Principal Site Reliability Engineer, AI Infrastructure

Sorry, this job was removed at 08:08 p.m. (PST) on Wednesday, Aug 13, 2025
In-Office or Remote
6 Locations
In-Office or Remote
6 Locations

Similar Jobs

17 Days Ago
Remote
USA
Senior level
Senior level
Artificial Intelligence • Software
The Senior Site Reliability Engineer will design, build, and maintain scalable AI infrastructure, ensuring system reliability, performance, and effective CI/CD pipelines while mentoring team members.
Top Skills: AnsibleAWSAzureCloudFormationDockerElkGCPGrafanaKubernetesPrometheusPythonTerraform
8 Hours Ago
Easy Apply
Remote
United States
Easy Apply
160K-200K Annually
Senior level
160K-200K Annually
Senior level
Fintech • Insurance • Machine Learning • Other • Analytics • Financial Services • Automation
Responsible for financial modeling and forecasting for strategic planning and budgeting in the insurance industry, including building models for reinsurance, managing stakeholders, and mentoring teams.
Top Skills: AdaptiveExcelGoogle SuiteIntacctLookerPowerPointWorkday
8 Hours Ago
Easy Apply
Remote
United States
Easy Apply
210K-270K Annually
Expert/Leader
210K-270K Annually
Expert/Leader
Fintech • Insurance • Machine Learning • Other • Analytics • Financial Services • Automation
The VP of Enterprise Technology will lead technology strategy and execution, driving AI adoption, managing enterprise systems, and ensuring cybersecurity compliance.
Top Skills: Ai TechnologiesEnterprise Data ArchitectureMiddleware TechnologiesSalesforceWorkday

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you! NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for over 30 years. It’s an outstanding legacy of innovation that’s fueled by phenomenal technology and exceptional people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and exceptional talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.
 

What You Will Be Doing:

  • Architect, lead, and scale globally distributed production systems supporting AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments.

  • Design and lead implementation of automation frameworks that reduce manual tasks, promote resilience, and uphold standard methodologies for system health, change safety, and release velocity.

  • Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing approaches for sophisticated distributed systems.

  • Lead cross-organizational efforts to assess operational maturity, address systemic risks, and establish long-term reliability strategies in collaboration with engineering, infrastructure, and product teams.

  • Pioneer initiatives that influence NVIDIA’s AI platform roadmap, participating in co-development efforts with internal partners and external vendors, and staying ahead of academic and industry advances.

  • Publish technical insights (papers, patents, whitepapers) and drive innovation in production engineering and system design.

  • Lead and mentor global teams in a technical capacity, participating in recruitment, design reviews, and developing standard methodologies in incident response, observability, and system architecture.

What We Need to See:

  • 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure, with a strong track record of leading platform-scale efforts and high-impact programs.

  • Deep expertise in Linux/Unix systems engineering and public/private cloud platforms (AWS, GCP, Azure, OCI).

  • Expert-level programming in Python and one or more languages such as C++, Go or Rust.

  • Demonstrated experience with Kubernetes at scale, CPU/GPU scheduling, microservice orchestration, and container lifecycle management in production.

  • Hands-on expertise in observability frameworks (Prometheus, Grafana, ELK, Loki, etc.) and Infrastructure as Code (Terraform, CDK, Pulumi).

  • Proficiency in Site Reliability Engineering concepts like error budgets, SLOs, distributed tracing, and architectural fault tolerance.

  • Ability to influence multi-functional collaborators and drive technical decisions through effective written and verbal communication.

  • Proven track record to complete long-term, forward-looking platform strategies.

  • Degree in Computer Science or related field, or equivalent experience

Ways to Stand Out from the Crowd:

  • Hands-on experience building platforms for large-scale AI training, inferencing, and data movement pipelines.

  • Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) and orchestration frameworks (e.g., Ray, Kubeflow).

  • Expertise in hardware fleet observability, predictive failure analysis, and power/resource-aware scheduling.

  • Experience leading operational readiness efforts and reliability engineering in GPU-heavy environments.

  • Track record of driving cultural improvements in incident management, root cause analysis, and postmortem processes across large teams.

Join us and build the infrastructure that powers the world’s most advanced AI. Apply now and make your mark at NVIDIA! Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 425,500 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until August 3, 2025.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account