Joining NVIDIA's DGX Cloud Lepton Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. DGX Lepton delivers NVIDIA-managed GPU/Kubernetes capacity for AI workloads.
As a Senior System Engineer, you’ll own Lepton platform’s reliability and ensure security is a first-class part of day-to-day operations. You’ll have the autonomy to drive meaningful projects with strong mentorship and support. We practice blameless postmortems, iterate continuously, and encourage thoughtful risk-taking. If you’re looking for an impactful, rewarding role, we invite you to apply.
What you’ll be doing:
Platform fundamentals: design, build, and operate core services and node/cluster foundations for Lepton platform; automate deployments, upgrades, and day-2 operations.
Vulnerability & patch management: own intake, prioritization, rollout, and rollback rhythms across OS, drivers/firmware, and platform components for Lepton product.
Security as a product quality: define, deliver, and maintain secure-by-default baselines (host hardening, workload isolation, network segmentation, least-privilege access) for AI infrastructure at scale.
Identity & access stewardship: standardize patterns for service identity, role scoping, secrets handling, and certificate hygiene.
Trusted releases: drive change control and release practices that ensure traceability and integrity of what runs in production.
Monitoring & incident practice: establish health signals and SLOs; lead investigations, root causes, and follow-through actions that improve both reliability and security.
Risk & readiness: partner with product, SRE, and security stakeholders to assess risks for new features and close gaps with pragmatic controls.
Documentation & mentorship: publish runbooks and standards; review designs and coach engineers on secure operational practices.
What we need to see:
7+ years in systems/platform engineering operating large-scale, production environments.
Demonstrated ability to deliver secure, reliable platforms (hardening, access control, isolation, monitoring, and strong operational runbooks).
Experience with containerized/managed cluster environments; familiarity with GPU-accelerated platforms or the ability to ramp quickly.
Automation mindset with infrastructure-as-code and CI/CD; disciplined change management.
Clear communication and documentation skills; ability to turn requirements into practical, supportable designs.
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
Ways to stand out from the crowd:
Hands-on engineering experience of delivering and driving platform security baselines in multi-tenant environments.
Production Kubernetes experience (EKS/AKS/GKE) at fundamental level, especially private clusters and PSA restricted defaults.
Supply-chain basics at scale: signed images (cosign) enforced via policy-as-code (Kyverno/OPA).
Familiarity with NVIDIA GPU platforms (GPU Operator/device plugin, MIG-aware operations)
You will also be eligible for equity and benefits.
Top Skills
Similar Jobs
What you need to know about the Los Angeles Tech Scene
Key Facts About Los Angeles Tech
- Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
- Key Industries: Artificial intelligence, adtech, media, software, game development
- Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
- Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering