All jobs

Sr Manager, AI Systems Quality & Reliability , Annapurna AI Servers and Systems

Amazon Development Center U.S., Inc.3h ago
United StatesOnsiteFull-timeManager Level10+ yrs exp
H-1B verified · 2310 LCAs

Top focus

Qa ManagerSystems Engineer
  • AWS Annapurna Labs is seeking a Senior Manager of Quality & Reliability Engineering to lead the QnR function within the Trainium Manufacturing, Quality and Reliability organization. You will own quality and reliability outcomes for all Trainium AI server products — from component qualification through fleet performance — leading an engineering team across multiple concurrent chip and system generations. This role defines reliability strategy for liquid-cooled and air-cooled platforms at rapidly scaling volumes, builds quality systems across a multi-supplier global manufacturing base, drives fleet failure investigations to root cause
  • establishes the reliability characterization capabilities required for next-generation technologies. Key job responsibilities - Lead and grow a QnR engineering team, hiring, developing
  • retaining top reliability and quality engineering talent. - Set technical direction for component qualification, reliability testing (HALT, HTOL, thermal cycling, QRV), DFMEA
  • vendor quality standards across all Trainium programs. - Own quality and reliability outcomes end-to-end — from DFM input during design through fleet reliability performance. - Drive component specific manufacturing process quality improvements in partnership with Manufacturing Engineering, establishing incoming quality requirements and process controls at all supplier sites. - Build and maintain the reliability prediction and monitoring infrastructure — ensuring fleet performance is tracked against predictions, degradation trends are identified early
  • corrective actions are data-driven. - Establish systematic failure analysis processes that connect field failures back to manufacturing history, supplier data
  • component-level root cause for rapid containment. - Scale qualification processes to keep pace with multi-supplier, multi-generation production — including automation of qualification workflows and standardization of test methodologies across vendors. About the team Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro, Graviton
  • Trainium families of processors. Machine Learning Annapurna (MLA) functions as a vertically integrated team including software, firmware, hardware
  • silicon design in a single organization. We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems
  • Manufacturing, Quality
  • Reliability. This position leads the Quality and Reliability Engineering function within the Manufacturing, Quality and Reliability team.
  • Experience in root cause analysis and error correction, identifying changes to procedures and systems to implement long-term fixes and avoid repeating issues - - Bachelor's degree in Reliability Engineering, Electrical Engineering, Mechanical Engineering, Materials Science, Physics
  • related field - - 10+ years of reliability or quality engineering experience with server compute platforms, semiconductor packaging
  • high-volume electronics manufacturing - - 5+ years of people management experience leading reliability, quality
  • hardware engineering teams - - Experience establishing quality management systems and reliability programs across multiple manufacturing vendors or sites
  • Experience leading teams across multiple locations in complex manufacturing/production environments - Experience working in a fast-paced, rapidly changing operations environment - - Master's Degree or PhD in Reliability Engineering, Materials Science, or related field - - Experience with liquid cooling reliability (cold plate, TIM, coolant loop failure modes) - - Experience with advanced semiconductor packaging reliability (large-die BGA, warpage, solder joint fatigue) - - Demonstrated ability to establish vendor quality standards and drive compliance across ODM/CM partners - - Experience with reliability prediction methodologies (Weibull analysis, acceleration models, DFMEA) - - Working knowledge of manufacturing quality tools (SPC, FMEA, 8D, DOE) - - Strong executive communication skills — ability to translate technical reliability risk into business impact for senior leadership - - Meets/exceeds Amazon's leadership principles requirements for this role - - Meets/exceeds Amazon's functional/technical depth and complexity for this role Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees, supervisors, and staff
  • adhere to standards of excellence despite stressful conditions
  • communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service
  • and follow all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above, as well as the abilities to adhere to company policies, exercise sound judgment, effectively manage stress and work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company’s reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records. Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner. The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits . USA, TX, Austin - 208,300.00 - 281,800.00 USD annually USA, WA, Seattle - 208,300.00 - 281,800.00 USD annually

Required skills

Reliability EngineeringQuality EngineeringRoot Cause AnalysisManufacturing EngineeringReliability TestingDFMEAVendor Quality StandardsReliability PredictionSPCFMEA8DDOE
Posted on JobRush — the end-to-end AI job-search platform.