NVIDIA Logo

NVIDIA

Director, Site Reliability and Software Engineering - DGX Cloud

Reposted 6 Days Ago
Be an Early Applicant
In-Office or Remote
Hiring Remotely in Santa Clara, CA
320K-575K Annually
Senior level
In-Office or Remote
Hiring Remotely in Santa Clara, CA
320K-575K Annually
Senior level
The role involves managing the DGX Cloud team's software and operations, leading projects, driving strategy, and fostering team development in a distributed and scalable environment.
The summary above was generated by AI

NVIDIA's invention of the GPUs ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company”. We are looking to grow our company, and grow our teams with the smartest people in the world. We are looking for you.

NVIDIA's GPU is hitting in market for Deep learning which is used in the research community and in industry to help solve many big data problems such as computer vision, speech recognition & translation, life science, image recognition, and natural language processing. NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform that runs everywhere. Data scientists and researchers can now rapidly build, train, and deploy neural network models to address some of the most complicated AI challenges. In this Environment, NVIDIA GPU Cloud computing team is looking for leaders to work for world class Deep learning platform.

What you'll be doing:

As a Site Reliability and Software Engineering leader in the DGXC Cloud Reliability organization, you will manage the software, automation, and operations of the multi-colo distributed NVIDIA GPU cloud clusters and contribute to product strategy. You will be the leader for all aspects of cluster automation and operational excellence planning and grow your team. You thrive in a fast-paced iterative engineering environment and have experience delivering scalable distributed systems. Most importantly, you will have a track record of having past teams and cross-functional partners respect you as both a technical leader and manager, and are able to work via influence and not direct authority when needed. NVIDIA GPU Cloud Computing team works with customers across the entire company, and the ability to work across multiple different levels of technical and organizational leadership is critical. Operating with scale and speed, our world-class software engineers are just getting started -- and as a leader, you guide the way to solve reliability both our internally critical and our externally-visible systems.

  • Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews.

  • Define team strategy and roadmap, and drive adoption of scalable SDLC practices, test infrastructure, and modern practices Nvidia’s DGX Cloud Computing environment.

  • Drive technical projects and provide leadership in an innovative and fast-paced environment.

  • Be responsible for the overall planning, tracking and success of technical projects.

  • Work closely with project and product management teams to ensure best-in-class product development.

  • Contribute technically to the technical projects for DGX Cloud Computing Services.

  • Interact with key internal stakeholders to provide operational and financial clarity on technical spend

  • Drive Decision making, visibility and operational rigor across business analytic initiatives such as budget and project & portfolio reporting. Lead efforts related to executive reporting, dashboards, and operational CTO metrics focusing on continuous improvement and evolution to maximize decision making and executive visibility.

What we need to see:

  • 12+ overall years of Experience in engineering management. 5+ years of leadership.

  • Bachelor / Master degree in Computer Science, or equivalent experience.

  • Experience in designing and implementing large-scale distributed systems. Experience in Containers / Virtualization environments/ Cluster solutions Experience in managing Technical Support / DevOps teams. Set appropriate technical excellent bars and deliver projects in tight deadlines.

  • Strong knowledge in Unix/Linux.

  • Experience implementing tools, process, internal instrumentation, methodologies and resolving blockages

  • Demonstrated people management and leadership skills, the proven track record of mentoring and coaching team members.

  • Ability to quickly learn and evaluate new technologies.

  • Ability to influence and establish relationships with other software and IT functional groups such as development, server, storage and security teams.

We have some of the most forward-thinking and hardworking people in the world working for us and, due to unprecedented growth, our exclusive engineering teams are rapidly growing. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you!

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 320,000 USD - 488,750 USD for Level 5, and 384,000 USD - 575,000 USD for Level 6.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 9, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

13 Minutes Ago
Remote
United States
194K-217K Annually
Senior level
194K-217K Annually
Senior level
Healthtech • Social Impact • Software • Telehealth
Manage and coach a security operations team responsible for detection, investigation, and incident response. Perform hands-on work designing/tuning detections, operating SIEM/SOAR, leading incident response, improving automation and runbooks, and managing vendor relationships to protect patient and provider data.
Top Skills: AthenaMdrMitre Att&CkOcsfPantherPythonSIEMSoarSQLTrino
18 Minutes Ago
Remote or Hybrid
United States
42K-42K Annually
Junior
42K-42K Annually
Junior
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Serve as primary contact for customers via phone and digital channels, resolving complex policy, billing, and coverage issues end-to-end. Use AI-powered tools and guided workflows, validate AI-generated summaries, document interactions per regulatory/privacy requirements, escalate when needed, and contribute feedback and continuous improvement. Participate in required training and ongoing skill development.
Top Skills: Ai-Powered ToolsAutomated SummarizationCopilotCrm PlatformsGuided Decision WorkflowsKnowledge Bases
18 Minutes Ago
Remote or Hybrid
United States
42K-42K Annually
Junior
42K-42K Annually
Junior
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Provide end-to-end customer support across phone and digital channels for disability/intake inquiries, using AI-guided tools and CRM systems to resolve complex issues, document interactions, escalate when needed, and contribute to process improvements while following compliance and privacy requirements.
Top Skills: Ai-Powered ToolsAutomated SummarizationCopilotCustomer Communication SystemsCustomer Relationship Management PlatformsGuided Decision WorkflowsKnowledge Bases

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account