GenAI Site Reliability Engineering Architect - Senior Vice President

Posted:
4/23/2026, 2:53:28 AM

Location(s):
Bengaluru, Karnataka, India ⋅ Karnataka, India

Experience Level(s):
Expert or higher ⋅ Senior

Field(s):
AI & Machine Learning ⋅ DevOps & Infrastructure ⋅ Software Engineering

Workplace Type:
Hybrid

About the Role

We're seeking an exceptional Site Reliability Engineering Architect to lead the technical vision and operational excellence of our enterprise GenAI platform serving 180,000+ Citi employees globally. This is a senior individual contributor role for someone who wants to architect intelligent, self-healing infrastructure at the intersection of AI and reliability engineering—without the overhead of people management.

You'll work with cutting-edge AI infrastructure including Claude, Gemini, and proprietary Citi models running on OpenShift/Kubernetes, building the next generation of AI-Ops capabilities that transform traditional operations into intelligent, autonomous systems.

About Our Team

Our team operates like a research-driven startup within Citi, rapidly innovating on AI operations while maintaining enterprise-grade reliability, security, and compliance. We build and operate Citi Stylus Workspaces and other mission-critical GenAI platforms that demand exceptional reliability, security, and performance at global scale.

What You'll Do

Platform Architecture & Reliability

  • Design and architect highly available, GPU-accelerated OpenShift clusters optimized for GenAI workloads
  • Build Model-as-a-Service platforms enabling seamless LLM hosting, inference, and lifecycle management
  • Architect multi-cluster, multi-region infrastructure supporting global AI platform availability (99.9%+ SLA)
  • Implement intelligent resource scheduling and optimization for GPU workloads and AI inference engines

AI-Ops & Intelligent Automation

  • Design and implement agentic AI workflows for automated incident detection, diagnosis, and remediation
  • Build Model Context Protocol (MCP) integrations enabling AI-driven operational decision-making
  • Create self-healing systems leveraging log analysis, anomaly detection, and automated remediation pipelines
  • Transform operational toil into intelligent automation that learns and adapts

Observability & Performance

  • Design and implement comprehensive observability stacks with Prometheus and Grafana providing deep visibility into AI workloads
  • Build custom metrics, exporters, and dashboards for LLM-specific monitoring (token throughput, inference latency, GPU utilization)
  • Establish SLO/SLI frameworks and error budget management for AI services
  • Drive performance optimization through data-driven analysis

Platform Engineering & GitOps

  • Architect and deploy OpenShift operators for AI/ML workloads (OpenShift AI, NVIDIA GPU Operator, Knative)
  • Design custom Kubernetes operators and controllers for platform-specific automation needs
  • Architect and maintain GitOps-driven deployment pipelines for multi-cluster AI infrastructure
  • Manage cluster lifecycle operations including upgrades, patching, and capacity expansion

Technical Leadership

  • Define technical vision and roadmap for GenAI platform reliability and operational excellence
  • Lead production incident response, root cause analysis, and blameless post-mortem processes
  • Provide technical mentorship to SRE and DevOps teams on advanced automation and AI-Ops practices
  • Partner with engineering, security, and business leaders to align infrastructure strategy with organizational objectives

What You Bring

Core Technical Expertise (Must-Have)

OpenShift & Kubernetes Mastery

  • 5+ years expert-level OpenShift 4.x administration and architecture experience
  • 5+ years deep Kubernetes expertise including custom operators, controllers, and CRDs
  • Hands-on experience with Red Hat Advanced Cluster Management (RHACM) and multi-cluster operations
  • Experience designing and implementing Kubernetes operators using Operator SDK or similar frameworks

AI/ML Infrastructure & Operations

  • Practical experience deploying and operating AI/ML platforms (OpenShift AI, Kubeflow, or similar)
  • Knowledge of GPU cluster provisioning, NVIDIA GPU Operator, and accelerated computing workloads
  • Understanding of LLM inference optimization and model serving frameworks (vLLM, TensorRT, ONNX)
  • Experience with Model-as-a-Service architectures and MLOps lifecycle management

Automation & Infrastructure as Code

  • 5+ years expert-level experience with Terraform and Ansible for infrastructure provisioning and configuration management
  • Strong scripting skills: Python, Bash, PowerShell for automation and tooling
  • Experience with GitOps workflows and declarative infrastructure management
  • Proficiency with Helm charts and Kubernetes manifest templating

Observability & Reliability Engineering

  • Deep expertise in Prometheus, Grafana, and metrics-driven reliability engineering
  • Experience designing custom metrics, exporters, and dashboards for specialized workloads
  • Knowledge of distributed tracing and log aggregation (Splunk or similar)
  • Understanding of SLO/SLI frameworks and error budget management

Cloud & Hybrid Infrastructure

  • Experience with AWS and Azure cloud platforms and hybrid cloud architectures
  • Knowledge of GPU instance types and cost optimization strategies
  • Understanding of cloud-native networking, storage, and security patterns
  • Familiarity with vSphere and on-premises virtualization platforms

Emerging AI-Ops Capabilities (Highly Valued)

  • Experience implementing agentic AI workflows and autonomous remediation systems
  • Knowledge of Model Context Protocol (MCP) or similar AI orchestration frameworks
  • Practical experience with AI-driven anomaly detection and predictive analytics
  • Familiarity with serverless frameworks (Knative) and event-driven architectures

Professional Experience

  • 15+ years of overall infrastructure, DevOps, or SRE experience
  • 5+ years in senior SRE, DevOps Architect, or Platform Engineering leadership roles
  • 5+ years hands-on experience with OpenShift/Kubernetes in production environments
  • 3+ years practical experience with AI/ML infrastructure and operations
  • Experience managing enterprise-scale platforms (100,000+ users, multi-region deployments)
  • Track record of successfully delivering complex infrastructure modernization projects
  • Experience operating in regulated industries (finance, healthcare, government)

Nice to Have

  • Experience with Go programming language for building operators, controllers, or automation tools
  • Familiarity with CI/CD tools (Jenkins, Bitbucket, Git)
  • Experience with service mesh implementations (Istio)
  • Understanding of enterprise security frameworks and compliance requirements (SOC2, PCI-DSS)
  • Experience with secrets management (Vault or similar)
  • Knowledge of policy-as-code frameworks (OPA, Kyverno)

Who You Are

Beyond technical skills, you are:

  • Innovative problem solver who transforms complex operational challenges into scalable solutions
  • Passionate about AI-Ops and leveraging AI to revolutionize traditional reliability engineering
  • Hands-on technical leader comfortable diving deep into technical details while maintaining strategic perspective
  • Relentlessly focused on eliminating toil through intelligent automation
  • Data-driven with strong analytical skills and ability to use metrics to drive improvements
  • Excellent communicator able to articulate complex technical concepts to diverse audiences
  • Collaborative with experience working across teams (engineering, security, business)
  • Curious about emerging technologies with commitment to staying current
  • Pragmatic with ability to balance ideal solutions with practical constraints and timelines
  • Calm under pressure with strong troubleshooting and crisis management skills

------------------------------------------------------

Job Family Group:

Technology

------------------------------------------------------

Job Family:

Architecture

------------------------------------------------------

Time Type:

Full time

------------------------------------------------------

Most Relevant Skills

Please see the requirements listed above.

------------------------------------------------------

Other Relevant Skills

For complementary skills, please see above and/or contact the recruiter.

------------------------------------------------------

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

 

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.

View Citi’s EEO Policy Statement and the Know Your Rights poster.

Citi

Website: https://www.citigroup.com/

Headquarter Location: New York, New York, United States

Employee Count: 10001+

Year Founded: 1812

Last Funding Type: Post-IPO Equity

Industries: Banking ⋅ Credit Cards ⋅ Financial Services ⋅ Wealth Management