Director, Site Reliability Engineering (SRE) - Hybrid (Raleigh or Greensboro)

Posted:
8/7/2024, 5:00:00 PM

Location(s):
North Carolina, United States ⋅ Raleigh, North Carolina, United States ⋅ Greensboro, North Carolina, United States

Experience Level(s):
Senior

Field(s):
DevOps & Infrastructure ⋅ Software Engineering

Workplace Type:
On-site

With a company culture rooted in collaboration, expertise and innovation, we aim to promote progress and inspire our clients, employees, investors and communities to achieve their greatest potential. Our work is the catalyst that helps others achieve their goals. In short, We Enable Possibility℠.

The Director, Site Reliability Engineering (SRE) is a pivotal role in the technology infrastructure team, responsible for ensuring the highest levels of reliability, scalability, and performance.  This leadership role will set the vision and strategic direction for a skilled SRE team, aligning with the strategic objectives of the IT Infrastructure team, and fostering a culture of continuous improvement and operational excellence.  This role will require a deep understanding of cloud-based infrastructure services and technologies, distributed systems, product delivery platforms, DevOps, automation, monitoring and a proactive approach to preventing and mitigating potential issues.  The incumbent must also foster a culture of innovation and collaboration within a team of highly skilled engineers to meet the organization’s evolving needs and deliver a superior digital experience to our product teams and customers.

*This is a Hybrid, Twice-a-week onsite role at our Greensboro and Raleigh offices.

Leadership & Strategy

  • Develop and implement a comprehensive SRE strategy that aligns with the IT Infrastructure team, IT and company objectives.
  • Lead the SRE team, setting clear goals and expectations, and providing mentorship and career development opportunities.
  • Collaborate with cross-functional teams to enhance system reliability and efficiency.

Technical Expertise

  • Oversee systems related to the availability of our infrastructure ecosystem, including cloud services and internal tooling.
  • Ensure the team’s deep understanding and expertise in the system architecture, not limited to Kubernetes and OpenShift, but encompassing the entire product delivery stack.

Team Management

  • Manage the SRE team ensuring effective resource allocation and prioritization of POC’s and initiative prioritization. 
  • Drive the adoption of best practices in incident management and post-mortem analysis.

Incident Management

  • Be a leader in the response to high-impact infrastructure incidents, ensuring swift resolution and minimal disruption. 
  • Implement proactive monitoring and measures to prevent future incidents and improve system resilience.

Communications

  • Articulate the value and accomplishments of the SRE team to stakeholders at all levels.
  • Foster a transparent communication environment within the team and across the organization.
  • Work closely with shared infrastructure services teams (including other SRE teams) within the corporation to establish a productive and transparent partnership and help establish consistent SRE and Infrastructure practices across the company.

Knowledge & Skills:

  • Proven expertise in large-scale complex system engineering and administration including cloud-based infrastructure in Microsoft Azure.
  • Strong leadership skills with the ability to inspire and motivate a high-performing team.
  • Excellent problem-solving abilities and data-driven approach to decision-making.
  • Technical leadership skills, including collaboration, technical problem-solving, and leading complex, mission critical initiatives.
  • In-depth understanding of Kubernetes concepts, components, and APIs with hands-on experience in orchestration of containerized applications using OpenShift (on-premises or in the cloud) Experience with OpenShift’s added-value features such as advanced CI/CD pipelines for containerized product delivery.
  • Experience with GitHub, GitHub Actions, and/or Argo CD or similar technologies.
  • Strong background in working in an agile service delivery methodology arena focusing on iterative service improvement delivery. 

Education & Experience:

  • A bachelor’s degree in Computer Science, Engineering, or related field; a master’s degree is preferred.
  • At least 10 years of experience in IT Infrastructure, system administration, or reliability engineering with a minimum of 5 years in a leadership role.
  • A track record of managing complex infrastructure initiatives and leading incident response efforts.

#LI-Hybrid
#LI-ZP1
 

Do you like solving complex business problems, working with talented colleagues and have an innovative mindset? Arch may be a great fit for you. If this job isn’t the right fit but you’re interested in working for Arch, create a job alert! Simply create an account and opt in to receive emails when we have job openings that meet your criteria. Join our talent community to share your preferences directly with Arch’s Talent Acquisition team.