Posted:
10/27/2025, 5:13:14 PM
Location(s):
Cluj-Napoca, Romania
Experience Level(s):
Junior ⋅ Mid Level ⋅ Senior
Field(s):
DevOps & Infrastructure ⋅ Software Engineering
Workplace Type:
Remote
About Betfair Romania Development:
Betfair Romania Development is the largest technology hub of Flutter Entertainment, with over 2,000 people powering the world’s leading sports betting and iGaming brands. Exciting, immersive and safe experiences are delivered to over 18 million customers worldwide, from our office in Cluj-Napoca. Driven by relentless innovation and commitment to excellence, we operate our own unbeatable portfolio of diverse proprietary brands such as FanDuel, PokerStars, SportsBet, Betfair, Paddy Power, or Sky Betting & Gaming.
Our Values:
The values we share at Betfair Romania Development define what makes us unique as a team. They empower us by giving meaning to our contributions, and they ensure that we consistently strive for excellence in everything we do. We are looking for passionate individuals who align with our values and are committed to making a difference.
Win together | Raise the bar | Got your back | Own it | Positive impact
About Flutter Functions:
The Flutter Functions division is a key component of Flutter Entertainment, responsible for providing essential support and services across the organization. The division encompasses various corporate functions, including finance, legal, human resources, technology, and more, ensuring seamless operations and strategic alignment throughout the company.
Role Overview:
The Site Reliability Engineer will be responsible for ensuring the reliability, availability, and performance of Flutter Entertainment's critical gaming and betting platforms across our global operations. This role combines software engineering expertise with operational excellence to maintain 24/7/365 service availability for millions of customers worldwide. As part of the Service Management Function within Flutter Functions, you will collaborate closely with development teams, infrastructure specialists, and business stakeholders to maintain the high-performance, scalable systems that power our iGaming & Sport platforms across multiple markets. Your role will involve implementing automation, monitoring, and incident response procedures to support Flutter's mission of delivering world-class entertainment experiences.
You understand and embrace the philosophy of continuous improvements and have experience of leading teams operating within a CI culture. You don't complain about recurring incidents – you drive process improvements and implement preventative measures to eliminate root causes. You work with internal and external teams to drive best in class to develop real-world solutions and positive user experiences for every interaction.
This role requires exceptional communication skills, as interaction and engagement with senior management during incident escalations and post-incident reviews will be a regular aspect of the role.
Key Accountabilities & Responsibilities:
Maintain 99.9%+ uptime for critical gaming and betting platforms serving millions of concurrent users
Design and implement monitoring, alerting, and observability solutions using tools such as Grafana, Splunk & CloudWatch
Conduct capacity planning and performance optimization to ensure systems can handle peak loads during major sporting events
Establish and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical services with support from Service Management
Support ProdOps and Service Management teams during P1/P2 incident response, providing technical expertise and facilitating cross-functional coordination to minimize customer impact
Collaborate with Service Management on post-incident reviews, contributing technical insights and supporting the implementation of preventative measures to reduce repeat occurrences
Assist in developing and maintaining comprehensive runbooks and incident response procedures in partnership with Service Management teams
Grafana Stack Management: Design, deploy, and maintain comprehensive Grafana dashboards for real-time system visibility across all Flutter platforms
Advanced Visualization: Create custom Grafana panels and dashboards for business metrics, technical KPIs, and operational insights tailored to different stakeholder needs
Multi-Source Data Integration: Configure and optimize Grafana data sources including Prometheus, InfluxDB, Elasticsearch, CloudWatch, and custom APIs
Alerting Strategy: Implement intelligent alerting rules using Grafana Alerting, reducing alert fatigue while ensuring critical issues are promptly escalated
Performance Monitoring: Establish application performance monitoring (APM) using Grafana Agent and integrate with existing observability stack
Custom Metrics Development: Work with development teams to implement custom business and technical metrics that provide actionable insights
Partner with development teams to improve application reliability and deployment practices
Mentor junior team members and contribute to the development of SRE practices across Flutter
Participate in architecture reviews and provide reliability expertise for new system designs
Document procedures, troubleshooting guides, and system architecture for knowledge sharing
Look for ways to use AI to triage and investigate alerts allowing for more rapid resolution
Use AI to find root cause by connecting the dots between code changes, alerts and past incidents
Investigate the use of AI to provide more collaboration and identify possible resolutions to incidents
Skills, Capabilities & Experience Required:
Cloud Platforms: Advanced experience with AWS, Azure, or Google Cloud Platform services and architecture
Containerization: Proficiency with Docker and Kubernetes for container orchestration and management
Programming: Strong scripting abilities in Python, Go, Bash, or PowerShell; familiarity with Java or .NET advantageous
Monitoring & Observability: Hands-on experience with Prometheus, Grafana, ELK stack, or similar monitoring solutions
CI/CD: Proficiency with Jenkins, GitLab CI, Azure DevOps, or similar continuous integration tools
Database Technologies: Working knowledge of SQL databases (PostgreSQL, MySQL) and NoSQL solutions
Networking: Understanding of load balancers, CDNs, DNS, and network security principles
Benefits:
Hybrid & remote working options
€1,000 per year for self-development
Company share scheme
25 days of annual leave per year
20 days per year to work abroad
5 personal days/year
Flexible benefits: travel, sports, hobbies
Extended health, dental and travel insurances
Customized well-being programmes
Career growth sessions
Thousands of online courses through Udemy
A variety of engaging office events
Disclaimer:
We are an inclusive employer. By embracing diverse experiences and perspectives, we create a lasting, positive impact for our employees, customers, and the communities we’re part of. You don't have to meet all the requirements listed to apply for this role. If you need any adjustments to make this role work for you, let us know, and we’ll see how we can accommodate them.
We thank all applicants for their interest; however, only the candidates who best meet the job requirements will be contacted for an interview.
By submitting your application online, you agree that your details will be used to progress your application for employment. If your application is successful, your details will be used to administer your personnel record. If your application is unsuccessful, we will retain your details for a period no longer than three years, to consider you for prospective roles within the company.
Website: https://betfairromania.ro/
Headquarter Location: Cluj-napoca, Cluj, Romania
Employee Count: 501-1000
Year Founded: 2007
IPO Status: Private
Industries: Information Services ⋅ Information Technology ⋅ Software