Important Information
- Experience: More than 4 years
- Job Mode: Full-time
- Work Mode: Hybrid
Job Summary
- Site Reliability Engineering (SRE) is a discipline that blends software engineering with infrastructure and operations, aimed at building scalable and highly reliable software systems.
- Focus on application monitoring, emergency response, and change management to ensure reliability and efficiency.
- Collaborate with development teams throughout the software lifecycle to solve system-related issues and automate routine tasks.
- Enhance system reliability, scalability, and performance by leveraging modern tools and processes.
Responsibilities and Duties
- Application Monitoring: Utilize tools and automation for continuous application monitoring and reliability.
- Emergency Response: Respond promptly to emergency incidents, perform root cause analysis, and resolve ongoing production issues.
- Change Management: Manage and streamline release and change management processes to improve system performance.
- Collaboration: Partner with development teams to solve system issues, automate routine tasks, and eliminate toil.
- Reliability and Scalability: Ensure systems are highly reliable, scalable, and efficient to meet performance standards.
Qualifications and Skills
- Strong understanding of monitoring tools such as Azure Monitoring, App Insights, Prometheus, and Grafana.
- Experience with Infrastructure as Code tools like Terraform, ARM/Bicep, or Pulumi.
- Proficiency in release management tooling such as ArgoCD, Harness, and Octopus.
- Familiarity with incident alert tools like PagerDuty or Opsgenie.
- Expertise in container orchestration tools like Kubernetes and AKS.
- Proficiency in scripting (C#, Python, Bash, PowerShell -one of them is mandatory)
- Strong collaboration and problem-solving abilities to resolve system issues effectively.
- Knowledge of project tracking and version management tools like JIRA, SVN, and GitHub.
Role-specific Requirements
- Proven experience in application monitoring and automated reliability processes.
- Strong background in managing system reliability and performing root cause analysis during emergency responses.
- Hands-on experience in change management processes and production environment releases.
- Advanced knowledge of tools and practices for infrastructure automation and incident handling.
- Familiarity with scalable system architecture principles and best practices.
Technologies
- Monitoring Tools: Azure Monitoring, App Insights, Prometheus, Grafana
- Infrastructure as Code: Terraform, ARM/Bicep, Pulumi
- Release Management Tools: ArgoCD, Harness, Octopus
- Incident Alert Tools: PagerDuty, Opsgenie
- Container Orchestration: Kubernetes, AKS
- Project Management Tools: JIRA, SVN, GitHub
- Scripting: C#, Python, Bash or PowerShell
Skillset Competencies
- Advanced monitoring and incident management techniques.
- Infrastructure as Code and automation of routine workflows.
- Expertise in release and change management processes.
- Strong knowledge of container orchestration and scalable system design.
- Excellent communication, collaboration, and problem-solving skills.
- Ability to work effectively in cross-functional and virtual teams.
About Encora
Encora is a trusted partner for digital engineering and modernization, working with some of the world’s leading enterprises and digital-native companies. With over 9,000 experts in 47+ offices worldwide, Encora offers expertise in areas such as Product Engineering, Cloud Services, Data & Analytics, AI & LLM Engineering, and more. At Encora, hiring is based on skills and qualifications, embracing diversity and inclusion regardless of age, gender, nationality, or background.