SRE

Posted:
3/11/2025, 9:55:04 PM

Location(s):
Arizona, United States ⋅ Scottsdale, Arizona, United States

Experience Level(s):
Mid Level ⋅ Senior

Field(s):
DevOps & Infrastructure ⋅ Software Engineering

Important Information

  • Experience: More than 4 years
  • Job Mode: Full-time
  • Work Mode: Hybrid

Job Summary

  • Site Reliability Engineering (SRE) is a discipline that blends software engineering with infrastructure and operations, aimed at building scalable and highly reliable software systems.
  • Focus on application monitoring, emergency response, and change management to ensure reliability and efficiency.
  • Collaborate with development teams throughout the software lifecycle to solve system-related issues and automate routine tasks.
  • Enhance system reliability, scalability, and performance by leveraging modern tools and processes.

Responsibilities and Duties

  • Application Monitoring: Utilize tools and automation for continuous application monitoring and reliability.
  • Emergency Response: Respond promptly to emergency incidents, perform root cause analysis, and resolve ongoing production issues.
  • Change Management: Manage and streamline release and change management processes to improve system performance.
  • Collaboration: Partner with development teams to solve system issues, automate routine tasks, and eliminate toil.
  • Reliability and Scalability: Ensure systems are highly reliable, scalable, and efficient to meet performance standards.

Qualifications and Skills

  • Strong understanding of monitoring tools such as Azure Monitoring, App Insights, Prometheus, and Grafana.
  • Experience with Infrastructure as Code tools like Terraform, ARM/Bicep, or Pulumi.
  • Proficiency in release management tooling such as ArgoCD, Harness, and Octopus.
  • Familiarity with incident alert tools like PagerDuty or Opsgenie.
  • Expertise in container orchestration tools like Kubernetes and AKS.
  • Proficiency in scripting (C#, Python, Bash, PowerShell -one of them is mandatory)
  • Strong collaboration and problem-solving abilities to resolve system issues effectively.
  • Knowledge of project tracking and version management tools like JIRA, SVN, and GitHub.

Role-specific Requirements

  • Proven experience in application monitoring and automated reliability processes.
  • Strong background in managing system reliability and performing root cause analysis during emergency responses.
  • Hands-on experience in change management processes and production environment releases.
  • Advanced knowledge of tools and practices for infrastructure automation and incident handling.
  • Familiarity with scalable system architecture principles and best practices.

Technologies

  • Monitoring Tools: Azure Monitoring, App Insights, Prometheus, Grafana
  • Infrastructure as Code: Terraform, ARM/Bicep, Pulumi
  • Release Management Tools: ArgoCD, Harness, Octopus
  • Incident Alert Tools: PagerDuty, Opsgenie
  • Container Orchestration: Kubernetes, AKS
  • Project Management Tools: JIRA, SVN, GitHub
  • Scripting: C#, Python, Bash or PowerShell

Skillset Competencies

  • Advanced monitoring and incident management techniques.
  • Infrastructure as Code and automation of routine workflows.
  • Expertise in release and change management processes.
  • Strong knowledge of container orchestration and scalable system design.
  • Excellent communication, collaboration, and problem-solving skills.
  • Ability to work effectively in cross-functional and virtual teams.

 About Encora

Encora is a trusted partner for digital engineering and modernization, working with some of the world’s leading enterprises and digital-native companies. With over 9,000 experts in 47+ offices worldwide, Encora offers expertise in areas such as Product Engineering, Cloud Services, Data & Analytics, AI & LLM Engineering, and more. At Encora, hiring is based on skills and qualifications, embracing diversity and inclusion regardless of age, gender, nationality, or background.