Site Reliability Engineer (SRE) – Observability, Incident Management & Cloud Automation

Posted:
2/10/2026, 3:19:47 PM

Location(s):
Hyderabad, Telangana, India ⋅ Telangana, India

Experience Level(s):
Senior

Field(s):
DevOps & Infrastructure ⋅ Software Engineering

Job Summary

Synechron is seeking a seasoned Lead Site Reliability Engineer (SRE) to oversee incident and problem management, enhance system observability, and ensure operational resilience across enterprise systems. This role involves implementing best practices for reliability, automation, and monitoring in cloud environments, with a focus on reducing downtime, optimizing performance, and proactively managing service health. You will collaborate with application, infrastructure, and support teams to drive continuous improvement and ensure the stability of critical business services. Your contributions will support Synechron’s strategic goal of delivering highly available, secure, and efficient digital platforms.


Software Requirements

Required:

  • Extensive hands-on experience with Splunk (search, dashboards, alerts) and Grafana (dashboards, queries, transformations) (7+ years)

  • Strong understanding of observability and monitoring principles, including log, metric, and trace-based monitoring

  • Deep knowledge of SRE principles: MTTR, MTTD, SLIs, SLOs, error budgets, and reliability practices

  • Practical experience with APM tools such as Dynatrace, App Insights, or similar (preferred)

  • Ability to use automation and scripting: Python, Shell, or PowerShell for operational automation (7+ years)

  • Familiarity with cloud environments: AWS and/or Azure (hands-on)


Preferred:

  • Experience with enterprise deployment tools and automation frameworks

  • Knowledge of cloud-native architectures and multi-cloud strategies

  • Exposure to security best practices and compliance requirements in cloud operations


Overall Responsibilities

  • Lead incident and problem management activities to resolve high-severity issues efficiently, minimizing downtime and impact

  • Develop and continuously improve observability through dashboards, logs, metrics, and traces to provide comprehensive system insights

  • Define, monitor, and improve SLIs, SLOs, error budgets, and service reliability metrics

  • Automate monitoring, remediation, and data collection processes for operational efficiency

  • Lead root cause analysis for recurring issues and implement permanent fixes to reduce incident recurrence

  • Collaborate with cross-disciplinary teams to align on reliability standards, capacity planning, and disaster recovery

  • Support deployment and release processes by ensuring system stability during changes

  • Drive proactive incident detection and reducing MTTR and MTTD through automation and improved alerting strategies

  • Document incident procedures, troubleshooting guides, and reliability metrics for audit and operational purposes

  • Mentor junior engineers, promote reliability best practices, and establish operational excellence standards


Technical Skills (By Category)

Monitoring & Observability (Essential):

  • Splunk (search, dashboards, alerts), Grafana (dashboards, queries, transformations) (7+ years)

  • Prometheus, CloudWatch or similar tools for metrics collection and monitoring

  • Log, metric, and trace-based observability practices for enterprise systems


Incident & Problem Management:

  • Root cause analysis, incident management workflows, and automation of remediation processes

  • Understanding of ITIL principles (preferred)


Automation & Scripting:

  • Python, Shell scripting, PowerShell for automating operational tasks and alerts (7+ years)


Cloud & Infrastructure:

  • AWS: CloudFormation, CDK, CodePipeline, CodeBuild, Lambda, EC2, ECS, DynamoDB

  • Azure or multi-cloud experience (preferred)


Security & Compliance:

  • Awareness of security best practices, certificate management, and compliance regulations (e.g., PCI-DSS, SOC)


Experience Requirements

  • 7+ years of experience supporting enterprise systems with a focus on reliability, incident response, and automation in cloud environments

  • Proven track record managing high-severity, critical services with automated remediation capabilities

  • Hands-on experience with CloudWatch, Splunk, Grafana, and observability frameworks

  • Experience working with cross-functional teams to improve operational stability and reliability in complex enterprise environments

  • Prior experience in regulated industries such as finance or banking is a plus


Day-to-Day Activities

  • Respond to and resolve high-priority incidents, conducting root cause analysis to prevent recurrence

  • Build and maintain dashboards, alerts, and monitoring pipelines for system health and performance

  • Automate incident detection, alerting, and remediation workflows to reduce manual intervention

  • Collaborate with development, security, and support teams to improve service reliability and security postures

  • Participate in deployment processes, ensuring stability during releases and upgrades

  • Conduct trend analysis on incidents, noise, and recurring issues to guide continuous improvement initiatives

  • Develop documentation, runbooks, and operational metrics to support audits and compliance efforts

  • Lead initiatives for capacity planning, disaster recovery, and performance tuning


Qualifications

  • Bachelor’s degree in Computer Science, Information Technology, or a related field

  • 7+ years of experience in reliability engineering, incident management, or DevOps supporting enterprise systems

  • Certifications such as AWS Certified DevOps Engineer, Azure DevOps Engineer, or equivalent preferred

  • Demonstrated success in automating monitoring and incident response in distributed systems


Professional Competencies

  • Critical thinking and strong analytical problem-solving skills

  • Leadership and teamwork capabilities, with mentorship experience

  • Excellent communication skills to articulate technical issues and collaborate with stakeholders

  • Ability to prioritize tasks, manage time effectively, and deliver results under pressure

  • Adaptability to evolving technologies, cloud environments, and compliance standards

  • A proactive mindset focused on continuous improvement and operational excellence

S​YNECHRON’S DIVERSITY & INCLUSION STATEMENT
 

Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.


All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.

Candidate Application Notice