Job Title: Software Engineer - Application SRE
Location: Bangalore ( India)
About Circles.Life
Founded in 2014, Circles is a global technology company reimagining the telco industry with its SaaS platform - Circles X, helping telco operators launch and operate successful digital brands through its offerings.
Having pioneered a successful blueprint for disrupting the telco space in Singapore, Circles has since launched its own digital telco, Circles.Life, in Singapore, Taiwan and Australia. Circles has also partnered with other telco operators to launch digital services, enabling our partners to accelerate growth and capture market share within a short period of time.
Today, Circles is partnering with operators in 14 countries to deliver delightful digital experiences to millions of people through our businesses.
We are backed by global investors such as Sequoia, Warburg Pincus, EDBI and Founders Fund – renowned backers of industry-shaking innovators.
Position Overview
We are seeking a talented Software Engineer - Application SRE to join our Site Reliability Engineering team. In this role, you will focus on improving the reliability, scalability, and performance of our mission-critical applications. You will work closely with application developers and operations teams to build automated solutions, monitor applications, and address system challenges to ensure high availability. This role combines software engineering skills with a passion for operational excellence to support a reliable and resilient application infrastructure.
Objective of this role
- Understanding and documenting the performance and scalability non-functional requirements, including SLI/SLOs. Validating requirements with business stakeholders.
- Manage SLI/SLOs of customer-facing interfaces as well as backend services and provide improvement plans for non-compliance.
- Develop custom dashboards in observability platforms (New Relic/Dynatrace/Grafana etc.) to represent a holistic view of system operational health
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Support release engineering by providing automation support as well as pushing changes to production when manual intervention needed
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for multiple large distributed software applications
Key Responsibilities
- Application Reliability: Design, build, and maintain tools and systems that ensure the reliability and uptime of our core applications, focusing on automation and performance optimization.
- Incident Management: Actively participate in incident response, troubleshooting production issues, analyzing root causes, and implementing permanent fixes to improve overall system reliability.
- Problem Management: Conduct 5-why analysis on issues related to application design, code, and configuration to arrive at the best possible cause and solution for arresting them.
- Automation and Tooling: Develop and maintain automated workflows for application deployment, monitoring, and scaling using CI/CD pipelines and infrastructure as code (IaC) tools.
- Monitoring and Observability: Implement and maintain application monitoring and logging systems to track performance metrics and detect anomalies early, using tools like Prometheus, Grafana, or Datadog.
- Service-Level Objectives (SLOs): Collaborate with cross-functional teams to define and maintain service-level objectives (SLOs) and indicators (SLIs) that align with business goals.
- Collaboration: Work closely with software engineers and operations teams to integrate reliability best practices into the development lifecycle, ensuring applications are scalable, reliable, and secure.
- Performance tuning: Identify application performance bottlenecks and work with developers to optimize code, queries, and infrastructure to improve efficiency and reduce latency.
- Capacity Planning and Scalability: Monitor and plan for application capacity needs, ensuring systems can scale to meet growth demands and handle traffic spikes.
- Continuous Improvement: Participate in post-incident reviews and retrospectives, providing feedback to continuously improve system architecture and operational processes.
Required Skills and Experience
- 2-5 years of experience in software engineering, site reliability engineering, or a related role.
- Proficiency in at least one programming language (e.g., Python, Go, Java, Ruby) and strong scripting skills (e.g., Bash, Python).
- Hands-on experience in Spring boot, GoLang, React.
- Experience with cloud platforms such as AWS, Google Cloud, or Azure and familiarity with cloud-native architectures.
- Hands-on experience with monitoring, alerting, and observability tools (e.g., Datadog, Prometheus, Grafana, New Relic).
- Experience working with CI/CD pipelines and automation tools (e.g., Jenkins, GitLab CI, CircleCI).
- Familiarity with containers and orchestration tools (e.g., Docker, Kubernetes).
- Strong understanding of Infrastructure as Code (IaC) principles and tools such as Terraform, Ansible, or CloudFormation.
- Proven ability to diagnose and troubleshoot application performance and reliability issues in a production environment.
- Knowledge of version control systems (e.g., Git) and collaboration in a DevOps/SRE environment.
- Ability to work independently and collaboratively within cross-functional teams.
Preferred Qualifications
- Experience with microservice architectures and distributed systems.
- Understanding of database management (SQL and NoSQL) and caching technologies.
- Familiarity with security best practices for cloud environments and application reliability.
- Experience participating in on-call rotations for production systems and incident management.
- Exposure to blameless postmortems and a culture of continuous improvement.