Job Description
SUMMARY OF JOB PURPOSE
The DevOps \ Site Reliability Engineer (SRE) possesses a strong background in deploying and managing infrastructure using modern DevOps practices, with expertise in Kubernetes, Terraform, and observability and monitoring platforms such as DataDog. The DevOps \ SRE works closely with the development and Operations team to ensure the reliability, scalability, and performance of our systems.
PRIMARY JOB RESPONSIBILITIES
- Designs, deploys, and maintains cloud infrastructure using Kubernetes and Terraform, ensuring scalability, reliability, and performance.
- Collaborates with development teams to implement CI/CD pipelines and automate deployment processes.
- Monitors system performance, troubleshoots issues, and implements solutions to optimize performance and ensure uptime.
- Develops and maintains monitoring and alerting systems using observability tools such as DataDog.
- Implements and manages microservices architectures, ensuring seamless communication and scalability.
- Troubleshoots and resolves issues related to infrastructure, deployments, and performance, ensuring high availability and reliability of our systems.
- Stays updated on emerging technologies and industry trends and incorporate them into our infrastructure and practices where applicable.
- Participates in on-call rotation to address issues and incidents during weekdays, ensuring system reliability and availability.
- Collaborates closely with all other members of the team to take shared responsibility for the overall efforts that the team has committed to for each sprint.
- Establishes and maintains positive working relationships with other members of the organization across departments, divisions, and locations.
- Maintains the confidentiality of proprietary and sensitive information, exercising sound judgment and discretion in any disclosure of information related to EM and its endeavors.
- Upholds the values of Engle Martin and Our Foundation.
REQUIRED EDUCATION & EXPERIENCE
- Bachelor’s degree in computer science, engineering, or a related field, or equivalent work experience
- At least 3-5 years of experience in a DevOps role required with experience as a Site Reliability Engineering preferred
- Prior experience with cloud platforms such as AWS, Azure, or Google Cloud Platform (GCP)
- Prior experience with observability and monitoring platforms such as DataDog, Dynatrace or Splunk
- Certification in relevant cloud technologies preferred (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator)
- Prior experience with Azure AKS preferred
- Experience with other DevOps tools and technologies such as Azure DevOps, Jenkins, GitLab CI/CD, etc. preferred
DESIRED KNOWLEDGE, SKILLS & ABILITIES
- Strong proficiency in Kubernetes and Terraform for managing and deploying infrastructure
- Solid understanding of microservices architecture and experience in deploying and managing microservices-based systems
- Proficiency in scripting languages such as Python, Shell, or Bash for automation tasks
- Familiarity with Agile methodologies and practices
- Knowledge of security best practices for cloud environments
- Excellent problem-solving skills and ability to troubleshoot complex issues in distributed systems
- Strong communication and collaboration skills, with the ability to work effectively across teams in a fast-paced, agile environment
- Willingness to participate in an on-call rotation to address issues during weekdays
- Commitment to professional and personal growth and development
WORKING CONDITIONS
Work is conducted primarily in an indoor office environment with protection from weather conditions and with exposure to noise typical of an office or administrative setting.
PHYSICAL ACTIVITIES AND REQUIREMENTS
Lifting and carrying up to 20 lbs.; Frequent sitting, standing, walking, and bending; occasional kneeling, reaching, and stooping; handling office equipment; periodic driving may be required; visual acuity to prepare, read, and organize detailed hard copy and electronic documents; ability to speak and to hear the spoken word in normal face-to-face, web-based, and telephonic business communications. Willingness to travel in a work capacity, including occasional evening, overnight, and weekend hours. Willingness to accommodate occasional meetings and work activities that may be scheduled after normal daytime business hours.