Site Reliability Engineer (SRE)

Posted:
7/10/2024, 5:00:00 PM

Location(s):
Sepang, Malaysia

Experience Level(s):
Junior ⋅ Mid Level ⋅ Senior

Field(s):
DevOps & Infrastructure ⋅ Software Engineering


Job Description

Founded in September 2020, Asia Digital Engineering (ADE) is a wholly-owned subsidiary of AirAsia Group based in KLIA2, Kuala Lumpur, Malaysia. ADE leverages the AirAsia Group Engineering Department’s best practices and unsurpassed combined experience in the region. ADE offers a range of aircraft services focused on the Airbus A320, A321 & A330 for line maintenance services, component and warehouse services, and engineering support services.

Site Reliability Engineer at ADE works within our Product & Technology team and plays a crucial role in maintaining and improving the reliability, availability, and performance of our digital services. You will work closely with our development and operations teams to build and support the infrastructure that powers our applications, ensuring they run smoothly and efficiently.

What you will do: 

  • Design, implement, and maintain scalable, reliable, and secure infrastructure using cloud technologies currently GCP.

  • Develop and automate monitoring, alerting, and incident response processes to ensure the highest service availability level.

  • Collaborate with development teams to enhance the reliability and performance of applications through best practices and automation.

  • Manage and resolve software development incidents or system failures by performing root cause analysis, implementing timely fixes, corrective measures, and conducting postmortems to prevent future occurrences.

  • Develop and maintain comprehensive documentation for infrastructure, processes, and procedures.

  • Participate in on-call rotations to provide 24/7 support for critical systems and respond to incidents promptly.

  • Continuously improve system observability and monitoring using tools such as Prometheus, Grafana, Datadog, etc.

  • Implement and manage CI/CD pipelines to streamline the deployment process and ensure rapid, reliable software releases.

  • Drive initiatives to optimize the cost, performance, and security of the infrastructure.

  • Stay up-to-date with industry trends and best practices in site reliability engineering and cloud technologies.

Your experience and skills:

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.

  • Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.

  • Strong knowledge of cloud platforms (AWS, GCP, Azure) and cloud-native technologies.

  • Proficiency in scripting and automation using languages such as JavaScript, Node.js, Python, Go, Bash, or similar.

  • Experience with configuration management tools (Terraform, Ansible, Chef, Puppet).

  • Experience with version control tools such as GitLab and GitHub, including their automation offerings for CI/CD pipelines and workflow integrations.

  • Solid understanding of networking concepts and protocols.

  • Familiarity with containerization technologies (Docker, Kubernetes).

  • Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, ELK stack).

  • Strong problem-solving skills and the ability to troubleshoot complex issues in a distributed system.

  • Excellent communication and collaboration skills.

  • Ability to work in a fast-paced, dynamic environment and manage multiple priorities.

  • Experience with microservices architecture and related technologies.

  • Knowledge of database administration and optimization (SQL, NoSQL).

  • Familiarity with security best practices and compliance standards.

  • Contributions to open-source projects or active participation in the SRE community.