Site Reliability Engineer I

Posted:
10/8/2024, 5:54:31 PM

Experience Level(s):
Junior ⋅ Mid Level ⋅ Senior

Field(s):
DevOps & Infrastructure ⋅ Software Engineering

Workplace Type:
Hybrid

It takes powerful technology to connect our brands and partners with an audience of hundreds of millions of people. Whether you’re looking to write mobile app code, engineer the servers behind our massive ad tech stacks, or develop algorithms to help us process trillions of data points a day, what you do here will have a huge impact on our business—and the world.

Job Summary:

We are looking for an experienced Site Reliability Engineer (SRE) with a focus on automation, cloud computing, and incident management. This role involves maintaining the reliability of large-scale systems through effective monitoring, automation, and cloud infrastructure management. You will be responsible for detecting and resolving complex issues, managing incidents, and ensuring smooth change management. The ideal candidate will also have a solid understanding of operating system principles and container-based solutions, along with proficiency in both scripting and system-level languages.

 Responsibilities

System Management, Issue Monitoring & Troubleshooting:

- Oversee system builds, configuration management, patching, and upgrades to ensure system integrity and performance.

- Monitor, detect, and triage issues across on-prem and cloud-based systems at scale.

- Apply creative and practical solutions to resolve non-routine technical problems using system methodologies (runbooks) and troubleshooting techniques.

- Proactively address performance bottlenecks and reliability concerns to ensure seamless operations.

- Collaborate with service, network, and facilities teams to streamline infrastructure management and resolve system issues.

Cloud Computing:

- Work with cloud platforms such as AWS, GCP to manage resources, optimize costs, and enhance performance.

- Optimize disaster recovery, backup, and failover mechanisms using cloud-native services.

Automation:

- Develop, implement, and maintain automation scripts/tools to simplify system deployment, deployment, and monitoring processes.

- Build and maintain CI/CD pipelines to automate the release process and improve deployment efficiency.

- Automate repetitive operational tasks and streamline cloud/on-prem infrastructure management.

Monitoring & Reliability:

- Maintain monitoring and alerting systems to track infrastructure and application health.

- Build dashboards to visualize key performance metrics and SLIs/SLOs.

- Proactively identify and resolve system bottlenecks, potential points of failure, and performance issues.

- Participate in on-call rotations to provide operational support and troubleshoot critical issues.

Incident and Change Management:

- Lead the incident management process by handling priority incidents, escalations, performing root cause analysis, and driving post-incident reviews.

- Ensure timely communication during incidents and collaborate with relevant teams to restore services as quickly as possible.

- Maintain detailed incident records, track incident metrics, and contribute to continuous improvement initiatives to minimize future incidents.

- Ensure that all system changes are properly planned, documented, and approved following the change management process.

Collaboration:

- Work closely with software development and IT teams to ensure seamless integration of services and deployments.

- Provide mentorship and knowledge-sharing sessions on automation best practices and cloud infrastructure management.

- Support teams in migrating legacy systems to the cloud and implementing cloud-native solutions.

The ideal candidates should have a range of computer science skills, be results-oriented, driven, and possess a demonstrable sense of ownership.

Qualifications

- Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent experience).

- 3+ years of experience in site reliability engineering, system engineer, or a related role.

- Proficiency with at least one programming language (Python, Java, Javascript, GoLang) for automation tasks and experience with infrastructure-as-code tools like Terraform, Ansible, or CloudFormation. 

- Hands-on experience with cloud platforms (AWS, GCP) and managing large-scale cloud infrastructures.

- Experience with Linux and a deep understanding of operating system principles.

- Hands-on experience with container-based solutions such as Docker and container orchestration tools like Kubernetes.

- Proficiency in monitoring tools such as Prometheus, Grafana, OpenSearch, and Chronosphere

- Knowledge of networking concepts such as TCP/IP, HTTP, as well as application security, monitoring, and storage systems.

- Solid understanding of CI/CD processes, system methodologies, and troubleshooting techniques

- Familiarity with ITIL frameworks and experience in incident and change management processes.

- Experience working with globally distributed teams.

Bonus Points:

- Cloud certifications (e.g., AWS Certified Solutions Architect, Google Professional Cloud Engineer).

- Certified Kubernetes Administrator (CKA) 

- ITIL 4 Foundation

Yahoo is proud to be an equal opportunity workplace. All qualified applicants will receive consideration for employment without regard to, and will not be discriminated against based on age, race, gender, color, religion, national origin, sexual orientation, gender identity, veteran status, disability or any other protected category. Yahoo is dedicated to providing an accessible environment for all candidates during the application process and for employees during their employment. If you need accessibility assistance and/or a reasonable accommodation due to a disability, please submit a request via the Accommodation Request Form (www.yahooinc.com/careers/contact-us.html) or call 408-336-1409. Requests and calls received for non-disability related issues, such as following up on an application, will not receive a response.

Yahoo has a high degree of flexibility around employee location and hybrid working. In fact, our flexible-hybrid approach to work is one of the things our employees rave about. Most roles don’t require specific regular patterns of in-person office attendance. If you join Yahoo, you may be asked to attend (or travel to attend) on-site work sessions, team-building, or other in-person events. When these occur, you’ll be given notice to make arrangements. 

If you’re curious about how this factors into this role, please discuss with the recruiter.

Currently work for Yahoo? Please apply on our internal career site.

Yahoo

Website: http://www.yahoo.com/

Headquarter Location: Sunnyvale, California, United States

Employee Count: 5001-10000

Year Founded: 1994

IPO Status: Delisted

Last Funding Type: Series B

Industries: Email ⋅ Internet ⋅ Native Advertising ⋅ Online Portals ⋅ Search Engine ⋅ Social Media