Posted:
2/10/2026, 3:19:47 PM
Location(s):
Hyderabad, Telangana, India ⋅ Telangana, India
Experience Level(s):
Senior
Field(s):
DevOps & Infrastructure ⋅ Software Engineering
Job Summary
Synechron is seeking a seasoned Lead Site Reliability Engineer (SRE) to oversee incident and problem management, enhance system observability, and ensure operational resilience across enterprise systems. This role involves implementing best practices for reliability, automation, and monitoring in cloud environments, with a focus on reducing downtime, optimizing performance, and proactively managing service health. You will collaborate with application, infrastructure, and support teams to drive continuous improvement and ensure the stability of critical business services. Your contributions will support Synechron’s strategic goal of delivering highly available, secure, and efficient digital platforms.
Software Requirements
Required:
Extensive hands-on experience with Splunk (search, dashboards, alerts) and Grafana (dashboards, queries, transformations) (7+ years)
Strong understanding of observability and monitoring principles, including log, metric, and trace-based monitoring
Deep knowledge of SRE principles: MTTR, MTTD, SLIs, SLOs, error budgets, and reliability practices
Practical experience with APM tools such as Dynatrace, App Insights, or similar (preferred)
Ability to use automation and scripting: Python, Shell, or PowerShell for operational automation (7+ years)
Familiarity with cloud environments: AWS and/or Azure (hands-on)
Preferred:
Experience with enterprise deployment tools and automation frameworks
Knowledge of cloud-native architectures and multi-cloud strategies
Exposure to security best practices and compliance requirements in cloud operations
Overall Responsibilities
Lead incident and problem management activities to resolve high-severity issues efficiently, minimizing downtime and impact
Develop and continuously improve observability through dashboards, logs, metrics, and traces to provide comprehensive system insights
Define, monitor, and improve SLIs, SLOs, error budgets, and service reliability metrics
Automate monitoring, remediation, and data collection processes for operational efficiency
Lead root cause analysis for recurring issues and implement permanent fixes to reduce incident recurrence
Collaborate with cross-disciplinary teams to align on reliability standards, capacity planning, and disaster recovery
Support deployment and release processes by ensuring system stability during changes
Drive proactive incident detection and reducing MTTR and MTTD through automation and improved alerting strategies
Document incident procedures, troubleshooting guides, and reliability metrics for audit and operational purposes
Mentor junior engineers, promote reliability best practices, and establish operational excellence standards
Technical Skills (By Category)
Monitoring & Observability (Essential):
Splunk (search, dashboards, alerts), Grafana (dashboards, queries, transformations) (7+ years)
Prometheus, CloudWatch or similar tools for metrics collection and monitoring
Log, metric, and trace-based observability practices for enterprise systems
Incident & Problem Management:
Root cause analysis, incident management workflows, and automation of remediation processes
Understanding of ITIL principles (preferred)
Automation & Scripting:
Python, Shell scripting, PowerShell for automating operational tasks and alerts (7+ years)
Cloud & Infrastructure:
AWS: CloudFormation, CDK, CodePipeline, CodeBuild, Lambda, EC2, ECS, DynamoDB
Azure or multi-cloud experience (preferred)
Security & Compliance:
Awareness of security best practices, certificate management, and compliance regulations (e.g., PCI-DSS, SOC)
Experience Requirements
7+ years of experience supporting enterprise systems with a focus on reliability, incident response, and automation in cloud environments
Proven track record managing high-severity, critical services with automated remediation capabilities
Hands-on experience with CloudWatch, Splunk, Grafana, and observability frameworks
Experience working with cross-functional teams to improve operational stability and reliability in complex enterprise environments
Prior experience in regulated industries such as finance or banking is a plus
Day-to-Day Activities
Respond to and resolve high-priority incidents, conducting root cause analysis to prevent recurrence
Build and maintain dashboards, alerts, and monitoring pipelines for system health and performance
Automate incident detection, alerting, and remediation workflows to reduce manual intervention
Collaborate with development, security, and support teams to improve service reliability and security postures
Participate in deployment processes, ensuring stability during releases and upgrades
Conduct trend analysis on incidents, noise, and recurring issues to guide continuous improvement initiatives
Develop documentation, runbooks, and operational metrics to support audits and compliance efforts
Lead initiatives for capacity planning, disaster recovery, and performance tuning
Qualifications
Bachelor’s degree in Computer Science, Information Technology, or a related field
7+ years of experience in reliability engineering, incident management, or DevOps supporting enterprise systems
Certifications such as AWS Certified DevOps Engineer, Azure DevOps Engineer, or equivalent preferred
Demonstrated success in automating monitoring and incident response in distributed systems
Professional Competencies
Critical thinking and strong analytical problem-solving skills
Leadership and teamwork capabilities, with mentorship experience
Excellent communication skills to articulate technical issues and collaborate with stakeholders
Ability to prioritize tasks, manage time effectively, and deliver results under pressure
Adaptability to evolving technologies, cloud environments, and compliance standards
A proactive mindset focused on continuous improvement and operational excellence
SYNECHRON’S DIVERSITY & INCLUSION STATEMENT
Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.
All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.
Website: https://www.synechron.com/
Headquarter Location: New York, New York, United States
Employee Count: 5001-10000
Year Founded: 2001
IPO Status: Private
Industries: Consulting ⋅ IT Management ⋅ Software