ML Kernel Performance Engineer, Edge AI and Science
Amazon.com Services LLC•4h ago
United StatesOnsiteFull-timeMid Level3+ yrs exp
H-1B verified · 2310 LCAs
Top focus
Performance Test Engineer
- Amazon Devices is an inventive research and development company that designs and engineers high-profile consumer products like the Kindle family, Fire Tablets, Fire TV, Health & Wellness devices, Amazon Echo
- Astro. We are building the next generation of edge AI capabilities through our advanced compression platform and custom neural accelerator silicon. Within Edge AI & Science, the AI Platform team builds a compression platform—the first of its kind—enabling 20-100x neural network compression for edge and cloud deployment. As model sizes grow from billions to hundreds of billions of parameters, compute efficiency becomes the single largest return on engineering investment during training. The gap between eager-mode Python and optimized GPU execution is where months of training time are won or lost. We are looking for an ML Kernel Performance Engineer to work at the hardware-software boundary of this platform, crafting high-performance CUDA and Triton kernels that make our compression algorithms run at peak efficiency during training, fine-tuning
- inference. You will build the tooling and kernel libraries that democratize GPU performance optimization across the team, enabling scientists and engineers to profile, diagnose
- fix kernel bottlenecks without needing to be CUDA experts themselves. Working alongside compression scientists and platform engineers, you will ensure that novel quantization schemes (ternary, nonary, mixed-precision) and sparse computation patterns translate into real throughput gains on GPU hardware. Your work will directly accelerate every training run in the organization and unlock deployment of compressed models to both edge devices and cloud inference. Key job responsibilities Design and implement high-performance CUDA and Triton kernels for quantization-aware training, sparse matrix operations
- low-bit inference on modern GPU accelerators Analyze and optimize kernel-level performance for compression training workloads, conducting detailed performance analysis using profiling tools to identify and resolve bottlenecks that slow model training from days to weeks Implement kernel-level optimizations such as operator fusion, tiling, memory access pattern optimization
- scheduling for compression-specific compute patterns Build a kernel development harness that enables any team member to profile kernel performance, test forward/backward accuracy
- validate at production scale, lowering the bar from "CUDA expert" to "any engineer with agents" Maintain and extend the team's training kernels library with clean interfaces, CI
- examples that enable scientists to contribute kernel improvements alongside platform engineers Collaborate closely with Applied Scientists, compiler engineers
- hardware architects to co-design ML-centric solutions that unify software and hardware for both cloud and edge deployment Develop inference kernels for cloud deployment (custom backends for quantized models that keep weights packed in memory and reconstruct on the fly for compute) Build and maintain performance regression tests and benchmarking infrastructure that track kernel efficiency as models scale from billions to hundreds of billions of parameters A day in the life A scientist files a ticket: "QAT training on our large model is 4x slower than expected." You pull up the profiler, identify that a custom quantizer kernel is thrashing shared memory at scale, write a Triton replacement that tiles correctly for the layer shapes at that model size, validate accuracy in the test harness
- push it to the kernels repo. By end of day, the training run that was taking four days now takes one. You will also build the tooling that makes this workflow repeatable by others. You will participate in design discussions with Applied Scientists, translate their algorithmic ideas into efficient GPU implementations
- work in a startup-like environment where every engineering hour directly accelerates the team's ability to ship compressed models. About the team The AI Platform team builds Amazon's neural network compression platform. We compress models using knowledge distillation, network restructuring
- advanced quantization to achieve 20-100x compression while preserving model quality. Our platform packages these into automated pipelines that deploy to both custom edge silicon and GPU-based cloud inference. As model sizes grow, the proprietary advantage shifts from the science to the software (making it work at hundreds of billions of parameters is the moat). GPU kernel performance is the biggest single lever on training throughput
- we expect AI-assisted development tooling to significantly multiply engineering productivity, meaning a small team with the right harness can operate at the scale of a much larger one. The ML Kernel Performance Engineer bridges science and platforms: you turn algorithmic innovations into production-grade GPU code that runs at scale. You will work alongside Applied Scientists, compiler engineers, hardware architects
- platform developers in a small, agile team building the next generation of edge AI for Amazon's consumer products.
- 3+ years of non-internship professional software development experience - 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience - Knowledge of Python and/or C++ programming - Experience with CUDA kernels or ML/low-level kernels
- experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware
- Bachelor's degree in computer science or equivalent - 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience - Experience with GPU kernel optimization and GPGPU computing (CUDA, Triton, SYCL, or ROCm) - Proficiency in low-level performance optimization for GPUs - Understanding of GPU memory hierarchies and optimization strategies (shared memory, L1/L2 cache, register pressure, memory coalescing) - Experience developing high-performance libraries for ML or HPC applications - Knowledge of ML frameworks (PyTorch, TensorFlow) and their GPU backends - Experience implementing custom PyTorch operators (torch.autograd.Function, C++ extensions) - Experience with parallel programming and optimization techniques - Background in neural network compression (quantization, pruning, knowledge distillation, low-rank factorization) - Knowledge of mixed-precision training and inference (FP16, BF16, FP8, INT8, INT4) - Experience with inference optimization (TensorRT, ONNX Runtime, vLLM, or similar) - Familiarity with Transformer architectures, attention mechanisms, and their compute/memory profiles - Experience with AWS Trainium/Inferentia or the Neuron Kernel Interface (NKI) - Experience with edge deployment, model compilation, or hardware-aware optimization Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status. Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees, supervisors, and staff
- adhere to standards of excellence despite stressful conditions
- communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service
- and follow all federal, state, and local laws and Company policies. Criminal history may have a direct, adverse, and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above, as well as the abilities to adhere to company policies, exercise sound judgment, effectively manage stress and work safely and respectfully with others, exhibit trustworthiness and professionalism, and safeguard business operations and the Company’s reputation. Pursuant to the Los Angeles County Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records. Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner. The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits . USA, CA, Sunnyvale - 165,200.00 - 223,600.00 USD annually
Required skills
PythonC++CUDATritonGPGPUMLlow-level performance optimization