SRE Lead – DBaaS Platform

Posted:
3/2/2026, 6:36:48 PM

Location(s):
Karnataka, India ⋅ Bengaluru, Karnataka, India

Experience Level(s):
Senior

Field(s):
DevOps & Infrastructure ⋅ Software Engineering

Job Title: SRE Lead – DBaaS Platform
Role Overview
We are seeking an experienced Site Reliability Engineering (SRE) Lead to strengthen
production reliability ownership for our Database-as-a-Service (DBaaS) platform. This role
will bring hyperscaler-grade (RDS-level) operational expertise to drive deep product
debugging, reliability engineering, and Dev collaboration across cloud-native database
services.
The SRE Lead will own platform stability, availability, performance, and incident excellence
across Azure/AWS/GCP-hosted database workloads.
Location :- Hyderabad
Department :- Customer Success
Reporting :- Senior Director Customer Success/SRE

Key Responsibilities
1. Production Reliability Ownership
 Own end-to-end reliability, availability, and performance of the DBaaS platform.
 Define and enforce SLIs, SLOs, and SLAs across all supported database engines.
 Lead production incident response (P1/P2), RCAs, and long-term resilience
improvements.
 Drive error budget governance with Engineering and Product teams.
2. Hyperscaler-Level Operational Excellence
 Bring RDS/Cloud SQL/Azure SQL Managed Instance operational patterns into the
platform.
 Implement automation-first operations (self-healing, auto-remediation, failover
orchestration).
 Standardize HA/DR architectures across multi-region deployments.
 Improve backup reliability, replication integrity, and failover predictability.
3. Deep Product Debugging & Dev Collaboration
 Partner with Product Engineering for deep database engine-level debugging.
 Troubleshoot complex performance bottlenecks (IO, CPU, locking, replication lag).

 Support root cause analysis involving cloud infrastructure, storage, networking, and
database internals.
 Influence platform architecture for operability and reliability.
4. Observability & Reliability Engineering
 Build unified observability across DBaaS (metrics, logs, traces).
 Define golden signals for database reliability.
 Improve proactive anomaly detection and capacity forecasting.
 Drive chaos testing and resilience validation practices.
5. Automation & Platform Hardening
 Lead reliability automation (runbooks → code).
 Improve provisioning, patching, upgrade, and scaling reliability.
 Standardize configuration management and drift detection.
 Enhance security posture aligned to enterprise compliance needs.
6. DevOps & Platform Governance
 Champion SRE best practices across engineering teams.
 Establish production readiness review frameworks.
 Define release reliability gates for DBaaS components.
 Mentor junior SREs and build a reliability-first culture.

Technical Requirements
Cloud Platforms (Mandatory – Multi-Cloud Preferred)
 Deep hands-on experience with:
o AWS RDS / Aurora
o Azure SQL MI / Azure Database Services
o GCP Cloud SQL / AlloyDB
 Strong understanding of cloud networking, storage, IAM, HA architectures.
Database Expertise
 Strong operational knowledge of:
o Oracle
o PostgreSQL
o MySQL
o SQL Server
 Experience handling large-scale production databases (TB+ workloads).
 Performance tuning, replication troubleshooting, and backup recovery validation.
SRE & Platform Skills

 Strong scripting: Python / Bash / Go.
 Infrastructure as Code (Terraform / ARM / CloudFormation).
 CI/CD pipelines and release automation.
 Observability stack (Prometheus, Grafana, ELK, Datadog, etc.).
 Kubernetes exposure preferred.

Leadership Expectations
 10+ years overall experience, 5+ in SRE/Platform roles.
 Prior experience in hyperscaler environments or cloud-native SaaS products.
 Strong incident leadership and executive communication skills.
 Ability to influence cross-functional stakeholders.
 Experience building and leading SRE teams preferred.

Success Metrics (First 12 Months)
 Reduction in P1/P2 incidents by X%.
 Improved MTTR by X%.
 Defined SLO framework implemented across all DBaaS services.
 Automation coverage >70% of repeat operational tasks.
 Zero critical audit non-compliance findings.

Why Join Us
 Opportunity to build hyperscaler-grade DBaaS reliability.
 Direct impact on mission-critical enterprise workloads.
 Multi-cloud platform engineering exposure.
 High visibility role working with Product, Engineering, and Leadership.