Site Reliability Engineer Lead

Bank of America•18h ago

United StatesOnsiteFull-timeMid Level5+ yrs exp

H-1B sponsor

Apply now

Top focus

Sre

Job Description

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day.

Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve.

Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations. At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact.

Join us! This job is responsible for partnering with engineering and technology teams to implement measures prescribed by the Site Reliability Engineer teams it leads. Key responsibilities include ensuring appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services, demonstrating technical expertise within domains, and decomposing objectives into work units.

Job expectations include advancing efficient solution delivery practices and promoting exceptional design, engineering, and organizational practices. The individual in this role is accountable for establishing and maintaining partnerships with Application Development and Production Support teams to implement the measures prescribed through the collaboration of the Senior Site Reliability Engineer (SRE) and the SRE team(s) they are leading.

This individual will include ensuring the appropriate instrumentation, tooling, ticketing, alerting and on-call routines are in place for key services. This role demonstrates a high level of technical expertise within one or more technical domains.

This role demonstrates the ability to decompose issues or objectives into units of work that can be assigned to other team members. This individual will advocate and advance more efficient solution delivery practices and evangelize great design, engineering and organizational practices

Responsibilities

Collaborates with Development and Infrastructure teams to understand technical solutions and implement monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior Site Reliability Engineer (SRE) Develops and maintains reliability scripts, tools and libraries and leverages them for common instrumentation, automation, and operational needs, and when mentoring SRE resources on reliability practices and established tools/capabilities Partners to implement code changes to make use of common reliability libraries and tools and helps Application Production Services and Application Development teammates understand how to use them Participates regularly in architecture community of practice meetings and communication via other channels Identifies vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and defines solutions to reduce manual support effort and/or improve system reliability Engages as a subject matter expert in major incident triage efforts and failure scenario modelling and diagnosis with Problem Manager root causes for major incident/problem management investigations see position summary required/desired qualifications Required Qualifications: 5+ years of experience in platform, systems, or infrastructure engineering, with a strong focus on automation and integration Proficiency in SRE best practices; Proven ability to reduce toil and improve observability of the environment Experience with automation and orchestration tools (e.g., Ansible or similar), and scripting with golang, Python, or equivalent Experience with supporting enterprise service mesh platforms Experience with Infrastructure as Code (IaC) concepts and CI/CD pipelines supporting automated builds, validation, and deployments Experience integrating provisioning workflows with platform services such as virtualization, networking, identity, monitoring, and configuration management systems Strong focus on testing and reliability, including automated integration/validation testing and troubleshooting of complex workflows Desired Qualifications: Linux System Administration Splunk Administration OpenShift Containers Dyantrace Administration Grafana Ansible Automation Horizon CI/CD (Jenkins, XLR, Artifactory, BitBucket) Azure/AWS\GCP Cloud Fast learner Proven ability to work independently with minimal supervision and as part of a team with direct responsibilities Systematic problem-solving approach, sense of ownership and drive Ability to juggle competing priorities and adapt to changes in project scope Skills: Automation Collaboration Influence Production Support Result Orientation Analytical Thinking Application Development Architecture Solution Design Stakeholder Management Other Terraform Shift: 1st shift (United States of America) Hours Per Week: 40

Required skills

PythonAnsibleGolangLinuxTerraformCI/CDSplunkOpenShiftGrafanaDyantrace