Posted:
6/15/2026, 6:16:41 AM
Location(s):
Center District, Israel ⋅ Raanana, Center District, Israel ⋅ Tel-Aviv, Tel-Aviv District, Israel ⋅ Tel-Aviv District, Israel
Experience Level(s):
Senior
Field(s):
AI & Machine Learning ⋅ Software Engineering
NVIDIA is powering the world's most advanced AI Factories. To ensure their seamless operation, we are building a mission-critical Observability and Prediction platform - delivered as both a high-scale SaaS solution and a robust on-premises deployment for our largest enterprise customers.
We are looking for a Senior Software Engineer to join the AIOps platform team and help build the core distributed systems that ingest massive telemetry streams from GPU clusters and operationalize predictive AI models at scale. You will work at the intersection of high-performance data engineering and production ML, turning research algorithms into reliable, mission-critical software.
What you'll be doing:
Architect and build an agentic AIOps system that autonomously monitors GPU fleet health, aggregates and correlates massive telemetry streams, surfaces intelligent alerts, and orchestrates multi-step diagnostic workflows and corrective actions - powering real-time dashboards, automated root-cause analysis, and proactive incident response.
Research, evaluate, and prototype data storage strategies and data representations across diverse database technologies and modalities, ensuring AI models are trained on high-quality, well-structured data that improves predictive accuracy and generalization.
High-Scale Engineering: Design distributed systems to handle the extreme telemetry density of large-scale AI clusters, ensuring efficient data ingestion, processing, and real-time analysis.
Instrument services with deep observability (metrics, logs, traces) to support rapid debugging and continuous performance improvement.
Build and own the model-serving infrastructure that operationalizes predictive algorithms at scale - packaging, versioning, deploying, and monitoring AI models in both SaaS and on-premises environments.
Contribute to the platform's core libraries and abstractions that accelerate development across the broader AIOps engineering team.
What we need to see:
B.Sc./M.Sc. in Computer Science, Computer Engineering, or a related technical field.
8+ years of software engineering experience building production distributed systems.
Core Systems Programming: Expert-level proficiency in languages such as Go, C++, or Rust, with a focus on high-performance, concurrent architectures.
Solid understanding of Kubernetes and container-based deployments for production services.
Experience deploying, monitoring, and maintaining ML models or data-intensive services in a production environment.
Comfort working in ambiguous, fast-moving environments where the product is still being shaped.
Ways to stand out from the crowd:
Experience building ML model-serving platforms or MLOps tooling (model registries, A/B rollout frameworks, feature stores) at scale.
A track record of taking systems from prototype to stable, production-grade platform serving real enterprise customers.
A "Systems" Thinker: You don't just write software; you understand the full stack, from how data moves across the wire to how it’s processed in a distributed cluster.
Practical Innovation: The ability to simplify complex problems and build internal tools or frameworks that empower other engineering teams to move faster.
With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you are passionate about building mission-critical systems at the frontier of AI infrastructure, we want to hear from you.
Website: https://www.nvidia.com/
Headquarter Location: Santa Clara, California, United States
Employee Count: 10001+
Year Founded: 1993
IPO Status: Public
Last Funding Type: Grant
Industries: Artificial Intelligence (AI) ⋅ GPU ⋅ Hardware ⋅ Software ⋅ Virtual Reality