Principal SRE (AI Enablement Platform)-2

Posted:
5/6/2026, 1:14:35 AM

Experience Level(s):
Expert or higher ⋅ Senior

Field(s):
DevOps & Infrastructure ⋅ Software Engineering

-

Join ABC Fitness, the leading technology provider for the fitness industry!

What You’ll Do

• Architect and evolve core platform capabilities for reliability, including execution environments, CI/CD systems, and validation pipelines that support high-throughput, machine-assisted change.

• Design and implement fast, ephemeral, and strictly isolated execution environments where generated work can be built, tested, and safely discarded at scale.

• Transform CI/CD into a validation system by embedding automated verification (tests, integration harnesses, canarying, rollback signals) into promotion decisions.

• Build production-like validation environments that allow realistic system behavior testing without impacting live systems.

• Establish deep observability patterns for autonomous workflows, including tracing what ran, what failed, why, and what it cost across agents, tools, and orchestration layers.

• Define and implement guardrails-as-code, including access controls, policy enforcement, cost protections, and auditability for platform usage.

• Design for reliability from day one, including scalability, fault tolerance, performance optimization, and operational resilience.

• Lead technical design reviews and influence platform and infrastructure decisions across engineering teams.

• Define and document reusable infrastructure patterns, platform standards, and reference implementations that create a consistent paved path for teams.

 

What This Is Not

• Not a ticket queue or generic support role.

• Not incremental-only ops without ownership of architecture and adoption.

• Not “just Kubernetes admin”—Kubernetes is one layer in a broader platform problem.

 

What You’ll Need

• Typically 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Platform Engineering.

• Deep experience designing and operating distributed systems at scale, including cloud platforms (e.g., AWS), Kubernetes, and infrastructure-as-code.

• Strong expertise in reliability engineering practices, including incident management, fault isolation, resiliency design, and system performance tuning.

• Experience building and operating CI/CD systems, test harnesses, and automated validation frameworks.

• Strong understanding of observability systems, including metrics, logging, tracing, and system-level debugging.

• Demonstrated ability to define technical standards and influence multiple teams through architecture, design review, and strong engineering judgment.

• Strong production mindset, with experience designing systems for scalability, availability, and operational efficiency.

• Experience implementing secure, multi-tenant infrastructure with strong isolation, IAM, and secrets management practices.

• Excellent cross-functional collaboration skills.

• Growth mindset and One Team orientation.

 

And It’s Great to Have

• Experience supporting AI/LLM-powered systems in production, including understanding of latency, cost, and orchestration challenges.

• Experience designing high-throughput ephemeral compute systems or sandboxed execution environments.

• Experience building internal developer platforms or platform-as-a-product capabilities.

• Familiarity with governance or regulated environments.

• Experience with advanced validation systems such as canarying, chaos engineering, or automated rollback strategies.

 

What Success Looks Like

• Faster delivery through platform-enabled validation and automation.

• Automated validation of changes before production, reducing reliance on manual review.

• Platform standards adopted across teams as the default paved path.

• Early detection of reliability issues through strong observability and validation systems.

• Reduced infrastructure complexity so engineers can focus on product and policy.

 

Why This Matters

ABC Fitness is evolving toward an AI-native engineering model where automation, agents, and platform systems handle increasing portions of the software lifecycle. This role builds the foundation that enables scalable, trustworthy, and high-velocity software delivery across the organization.

If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!