Site Reliability Engineer
Top focus
Your Role as a Cisco IT AI Infrastructure Site Reliability Engineer Are you passionate about shaping the future of artificial intelligence and ensuring its robust, enterprise-scale operation? Join Cisco IT as an AI Infrastructure Site Reliability Engineer, where you will play a key technical role in building, developing, optimizing, and supporting complex AI architectures built on cutting-edge NVIDIA DGX and Cisco-UCS based AI platforms.
You will help drive innovation and reliability across our foundational AI infrastructure, supporting Cisco’s most critical business and technology initiatives. In this role, you will work closely with peers, technical teams, and internal business clients to ensure the reliability, scalability, and efficiency of our AI solutions.
You will leverage Site Reliability Engineering (SRE) best practices and automation to continually improve operational excellence and service quality for Cisco’s AI platforms. Responsibilities include: Design and Implementation: Contribute to the design, deployment, and optimization of AI infrastructure, including high-performance compute (HPC) systems, data center power & cooling, network topologies, and the AI factory software stack.
System Administration: Configure, maintain, and troubleshoot AI platforms, including NVIDIA DGX and Cisco Unified Compute Systems with NVIDIA/AMD GPUs. Collaboration: Work with internal business clients and technical teams to understand technical requirements and translate them into robust, scalable AI platform solutions.
SRE Practices: Apply SRE principles to automate operational processes, reduce toil, and help maintain service level objectives (SLOs) for AI platforms. Automation: Develop and maintain CI/CD pipelines using Python, Ansible, Terraform, Go, and GitHub Actions to streamline operational capabilities and continuous improvements.
Capacity and Performance: Participate in capacity planning, performance analysis, instrumentation, and support non-functional requirements to ensure efficient and reliable AI infrastructure. Incident Response: Participate in incident management, troubleshooting, hardware break/fix support, and root cause analysis to resolve issues and ensure high availability.
Continuous Improvement: Contribute to ongoing process automation, optimization, and documentation to enhance service quality and operational efficiency. Who You Are You are an experienced Site Reliability Engineer or AI infrastructure engineer with a strong technical background in designing, building, and supporting complex AI and HPC systems.
You work well with others, communicate clearly, and enjoy solving challenging technical problems. You are eager to learn new technologies and approaches, and are passionate about building and operating high-performance, reliable AI infrastructure at scale.
Minimum Requirements Bachelor’s degree in Computer Science, Information Technology, or a related field; or equivalent years of experience. 5+ years of experience administering and supporting Linux-based operating systems in enterprise environments.
Strong hands-on experience in deploying and managing NVIDIA DGX or equivalent HPC clusters (e.g., Cray, HPE, IBM), including working with AI factory architectures. Experience working with business clients or technical teams to gather requirements and deliver AI platform solutions.
Proficient in Python, Go, or C/C++, with practical experience in Git and CI/CD systems (e.g., GitHub Actions, GitLab, Jenkins). Experience with Kubernetes clusters (RedHat OpenShift preferred), Docker, Terraform, Ansible, and GitOps methodologies.
Strong troubleshooting, incident management, and process automation skills. Preferred Qualifications Master’s degree in a relevant field. Certifications in NVIDIA, Cisco, Linux, Networking, Cloud, or related technologies. Experience contributing to the design and operation of AI factories or large-scale HPC systems.
Expertise with Kubernetes, Hybrid Cloud, Virtualization, and container technologies. Exposure to Agile and DevOps operating models and project tracking tools (e.g., Jira, Rally). Demonstrated ability to collaborate and communicate advanced technical concepts in a team environment.
Why Cisco? At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds.
These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint. Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless.
We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere. We are Cisco, and our power starts with you.