Site Reliability Engineer (Manufacturing IT Operations)

Posted:
9/1/2024, 5:00:00 PM

Location(s):
Taguig, Metro Manila, Philippines ⋅ Metro Manila, Philippines

Experience Level(s):
Mid Level ⋅ Senior

Field(s):
DevOps & Infrastructure ⋅ IT & Security ⋅ Software Engineering

Job Location

Taguig City

Job Description

Overview of the job

As a Site Reliability Engineer (SRE) in the Manufacturing IT Operations – Incident Response Team, you will be responsible for leading incident response efforts, ensuring swift and effective resolution of critical system issues. You will also play a critical role in ensuring the reliability, scalability, and performance of our systems and services. SRE combines software engineering and operations to build, maintain, and support highly available and efficient infrastructure. Your expertise in troubleshooting and root cause analysis will be essential in identifying and addressing the underlying causes of incidents. You will work closely with software engineers, DevOps teams, and other stakeholders to implement preventive measures and enhance system resilience. Collaborating with cross-functional teams, you will design, implement, and automate robust systems, monitoring tools, and processes. With a strong focus on stability and uptime, you will proactively identify and resolve performance bottlenecks, optimize system architecture, and drive continuous improvement. Your keen eye for continuous improvement will also drive post-incident reviews and contribute to the creation of incident management best practices. By actively monitoring system health, responding to incidents in a timely manner, and implementing proactive measures, you will play a pivotal role in maintaining the stability and availability of our services, ensuring an exceptional user experience for our customers.

Your team

You will report directly to the Incident Response Engineering Leader within the Manufacturing IT Operations team, who will provide guidance, support, and mentorship as you navigate your role. As a valued member of our dynamic Incident Response Team, you will collaborate closely with technically skilled professionals, including software engineers, DevOps specialists, Subject Matter Experts, and other SREs. In addition, you will have the opportunity to directly collaborate with our site customers and users, ensuring their needs and expectations are met through reliable and high-performing systems. Working within a cross-functional and collaborative environment, you will contribute to the success of our Incident Response team, which is dedicated to ensuring the reliability and availability of our site's systems. Our Incident Response team fosters a culture of technical expertise, continuous learning, and knowledge sharing, where ideas are encouraged, and innovation is embraced.

How success looks like

Success as a Site Reliability Engineer (SRE) involves different areas of the role including incident response, monitoring and reliability, and effectively collaborating with customers and users, addressing their needs and expectations:

  • Incident Response: Swiftly respond to and resolve critical incidents, ensuring minimal impact on system availability and user experience while driving continuous improvement in incident management processes.
  • Reliability: Ensure high system availability and reliability through robust monitoring, optimization of system architecture, and cross-functional collaboration to design and implement resilient systems.
  • Monitoring: Implement comprehensive monitoring solutions to gain real-time insights into system performance, enabling proactive incident response and continuous improvement of system visibility and resource optimization.
  • Working with Customers/Users: Collaborate directly with customers and users to understand their needs, proactively address concerns, and provide exceptional customer support to ensure reliable and performant systems that meet their expectations.

Responsibilities of the role

Incident Response:

  • Lead incident response efforts, swiftly resolving critical incidents to minimize downtime and user impact.
  • Implement effective incident management processes, ensuring clear communication, coordination, and documentation.
  • Conduct root cause analysis, implementing preventive measures and driving continuous improvement.

Reliability:

  • Ensure high system availability through robust monitoring, alerting, and automated incident response systems.
  • Optimize system architecture and configurations for improved performance, scalability, and fault tolerance.
  • Collaborate cross-functionally to design and implement resilient systems using industry best practices.

Monitoring:

  • Implement comprehensive monitoring solutions, providing real-time insights into system performance and health.
  • Configure and manage monitoring tools, ensuring accurate and actionable alerts for proactive incident response.
  • Continuously evaluate and enhance monitoring strategies to improve system visibility and resource optimization.

Upskilling:

  • Stay updated with industry trends, technologies, and best practices in Site Reliability Engineering.
  • Continuously develop technical skills in system architecture, automation, cloud technologies, and incident response.
  • Share knowledge, mentor team members, and foster a culture of learning and upskilling.

Managing Users/Customers’ Needs and Expectations:

  • Collaborate directly with users and customers to understand their needs and pain points.
  • Proactively address customer/user concerns, ensuring reliable and performant systems.
  • Provide exceptional customer support, communicate updates, resolutions, and gather feedback for continuous improvement.

Job Qualifications

Role Requirements

Technical Expertise and Experience:

  • Knowledge or familiarity in system administration, including Linux/Unix environments, cloud platforms (such as AWS, Azure, or GCP).
  • Experience with configuration management tools and infrastructure-as-code frameworks (e.g., Terraform).
  • Proficiency in at least one programming language (e.g., Python, C#) and experience with scripting for automation tasks.
  • Understanding of networking protocols, network infrastructures, load balancing, and DNS management.
  • Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Familiarity with databases and proficiency in writing SQL queries.
  • Experience or familiarity with monitoring and observability tools (e.g., Prometheus, Grafana).
  • Knowledge of incident response methodologies, root cause analysis, and implementing preventive measures.
  • Understanding of security best practices and experience with implementing secure systems.
  • Experience in Manufacturing Execution Systems (e.g. Proficy) or Manufacturing Operations is a plus.

Soft Skills:

  • Strong problem-solving and troubleshooting skills, with an ability to analyze complex issues and devise effective solutions.
  • Excellent communication and collaboration skills to work effectively with cross-functional teams, stakeholders, and customers.
  • Ability to thrive in a fast-paced, dynamic environment, managing multiple priorities and adapting to changing circumstances.
  • Strong attention to detail and a commitment to delivering high-quality work.
  • Proactive and self-motivated, with a continuous learning mindset and a drive for staying updated with industry trends and technologies.
  • Strong teamwork and interpersonal skills, with an ability to build relationships and work effectively in a collaborative environment.
  • Ability to thrive under pressure and effectively manage incidents, ensuring timely resolutions and minimizing downtime.

This role requires a commitment to work a standard 5-day workweek, with 4 weekdays and at least one weekend day (Sunday or Saturday). The nature of the Site Reliability Engineer (SRE) position necessitates coverage and support across the week, ensuring the reliability and availability of our systems. This schedule allows for effective incident response and continuous monitoring of system health, as well as collaboration with cross-functional teams. We value work-life balance and will strive to provide a predictable and manageable schedule within this framework, while still meeting the needs of our customers and maintaining the stability of our services.

About us

We produce globally recognized brands and we grow the best business leaders in the industry. With a portfolio of trusted brands as diverse as ours, it is paramount our leaders are able to lead with courage the vast array of brands, categories and functions. We serve consumers around the world with one of the strongest portfolios of trusted, quality, leadership brands, including Always®, Ariel®, Gillette®, Head & Shoulders®, Herbal Essences®, Oral-B®, Pampers®, Pantene®, Tampax® and more. Our community includes operations in approximately 70 countries worldwide.

Visit http://www.pg.com to know more.

We are an equal opportunity employer and value diversity at our company. We do not discriminate against individuals on the basis of race, color, gender, age, national origin, religion, sexual orientation, gender identity or expression, marital status, citizenship, disability, HIV/AIDS status, or any other legally protected factor.

Job Schedule

Full time

Job Number

R000114677

Job Segmentation

Experienced Professionals (Job Segmentation)