Senior Technical Program Manager
Top focus
As a Senior Technical Program Manager with a passion for data-driven operations, you will lead the DGX Cloud Fleet Health reporting program — delivering real-time, actionable insights on the availability and reliability of our GPU fleet. A core focus of this role is advancing Mean-Time-Between-Interruption (MTBI): understanding the root causes of fleet interruptions, surfacing patterns in the data, and driving cross-functional programs to measurably extend fleet uptime.
You will partner closely with Capacity Operations, Infrastructure, SRE, and Engineering teams to translate complex fleet signals into decisions that directly improve customer experience. Join us in making a significant impact on the world's most powerful AI infrastructure.
What You’ll Be Doing: Define and own the metrics framework for measuring fleet health, reliability, and MTBI across a diverse and rapidly scaling GPU fleet. Lead hands-on data investigations — querying telemetry, correlating failure signals, and building statistical models — to identify the root causes of interruptions and quantify their impact.
Own and drive execution of cross-functional MTBI improvement programs end-to-end — from translating analytical findings into a prioritized roadmap, to holding teams accountable to milestones and delivering measurable reliability gains. Build and maintain dashboards, automated anomaly detection, and alerting frameworks that surface gaps in fleet health reporting in real time.
Anticipate and close reporting gaps with new cloud providers and hardware platforms by working closely with Infrastructure bring-up teams. Communicate complex data findings and program status clearly to senior leadership, turning raw signals into crisp narratives and recommendations.
What We Need to See: 8+ years of Technical Program Management experience, with at least 3 years in infrastructure, platform, or reliability-focused domains. Strong hands-on data analytics skills — comfortable writing SQL, working with large telemetry datasets, and building dashboards (Grafana, Superset, Databricks, or equivalent).
Demonstrated ability to define and operationalize reliability metrics (MTBI, MTTR, availability SLAs) and drive engineering teams toward measurable improvements. Proven ability to lead deep-dive investigations across ambiguous, multi-system problems and translate findings into long-term solutions.
Excellent executive communication skills — able to distill complex technical findings into clear, decision-ready narratives for senior leadership. MS in EE, CS, or equivalent experience. Ways to stand out from the crowd: Familiarity with NVIDIA GPU architectures and DGX/HGX infrastructure.
Experience with Databricks, Apache Spark, or other large-scale data processing platforms. Hands-on experience with Grafana, Superset, or similar observability/BI tooling. Background in cloud-native infrastructure, Kubernetes, or large-scale distributed systems.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 258,750 USD for Level 4, and 200,000 USD - 322,000 USD for Level 5. You will also be eligible for equity and benefits .
Applications for this job will be accepted at least until July 4, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer.
As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.