Senior SRE Engineer, NIM Factory

Posted:
8/16/2024, 3:03:36 PM

Location(s):
New York, New York, United States ⋅ Texas, United States ⋅ California, United States ⋅ New York, United States

Experience Level(s):
Senior

Field(s):
DevOps & Infrastructure ⋅ Software Engineering

Workplace Type:
Remote

NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a senior SRE to monitor and operate both the factory automation for NVIDIA Inference Microservices (NIMs) and its deployed services. The right person for this role brings technical drive and creativity to change the way NVIDIA provides high-performance inferencing for every AI model. Our NIM offerings are easy to use, optimized for performance, and developed using a highly automated software factory. We create containers available for download and hosted services. You will apply your expertise to operate highly available services that make effective use of the thousands of GPU involved in this operation. Your services provide the best-in-class performance, accuracy and availability. We are looking for technical talent to design, build, operate and improve our factory capabilities, including the underlying infrastructure, pipelines, backends, Docker build, test harness, metrics, performance engineering, log ingestion, and more.

What you'll be doing:

  • Operate a software factory that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments. With the development team, define and deliver rapid iterations on the group's technical strategies and roadmaps to evolve the NIM factory for continuous delivery of packaged NIMs. You will be responsible for both the operation of the factory, its availability, observability, and stability; and will track the deployment of our services into multiple cloud hosts and improve the efficiency, availability, and stability of these services.

  • Partner with internal and external SRE teams to provide the best experience for our developers and our users of the resulting services. Your work ensures our operation is secure with the proper configuration and management of infrastructure including containers, databases, and networking; following and improving standard processes for security, scalability, and cost optimization. This requires working closely with our security teams tasked with responding to security threats.

  • Broad collaboration with multiple AI model teams is needed to understand their requirements and build an efficient infrastructure that supports and improves development and production execution of these models. You will define metrics and drive improvements based on user feedback. You will mentor and collaborate throughout the team and with other teams to grow your colleagues and yourself. You will have a history of learning and growing your skills and those around you.

What we need to see:

  • Demonstrated advanced system engineering skills operating and improving the observability and maintainability of distributed microservice cloud applications and services.

  • Effective experience working with multi-functional teams, principals and architects, and across organizational boundaries.

  • Mentorship, growing teams and team members, and the flexibility to ability to adjust your direction and expectations given the needs of our customers.

  • Experience operating distributed containerize applications using technologies such as Docker, K8s, Cloud Endpoints, Helm, and Prometheus. Use of Infrastructure as code, such at Terraform, Puppet, Ansible or others.

  • Experience identifying the root cause of failures and performance bottlenecks in distributed microservices or cloud systems. Understand and practice good security practices for publicly facing cloud services.

  • BS or MS in Computer Science, Computer Engineering or equivalent experience.

  • 7+ years of shown experience as an SRE or Developer working on high-performance microservices and cloud software.

Ways to stand out from the crowd:

  • Excellent communication and interpersonal skills and the ability to engage a multi-functional team.

  • Experience with event-driven applications using various services such as Temporal, Kafka, Redis or others.

  • A history of building and deploying containers for Microservices, Cloud and On-prem deployments, and their associated CI/CD pipelines

We are widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and creative people in the world working for us. If you're creative and autonomous with a real passion for technology, we want to hear from you. We are an equal opportunity employer and value diversity at our company.

The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

NVIDIA

Website: https://www.nvidia.com/

Headquarter Location: Santa Clara, California, United States

Employee Count: 10001+

Year Founded: 1993

IPO Status: Public

Last Funding Type: Grant

Industries: Artificial Intelligence (AI) ⋅ GPU ⋅ Hardware ⋅ Software ⋅ Virtual Reality