Our vision is to transform how the world uses information to enrich life for all.
Micron Technology is a world leader in innovating memory and storage solutions that accelerate the transformation of information into intelligence, inspiring the world to learn, communicate and advance faster than ever.
Principal / Senior Systems Performance Engineer
Micron Data Center and Client Workload Engineering in Hyderabad, India, is seeking a senior/principal engineer to join our dynamic team.
The successful candidate will primarily contribute to the ML development, ML DevOps, HBM program in the data center by analyzing how AI/ML workloads perform on the latest MU-HBM, Micron main memory, expansion memory and near memory (HBM/LP) solutions, conduct competitive analysis, showcase the benefits that workloads see with MU-HBM’s capacity / bandwidth / thermals, contribute to marketing collateral, and extract AI/ML workload traces to help optimize future HBM designs.
Job Responsibilities:
The Job Responsibilities include but are not limited to the following:
- Design, implement, and maintain scalable & reliable ML infrastructure and pipelines.
- Collaborate with data scientists and ML engineers to deploy machine learning models into production environments.
- Automate and optimize ML workflows, including data preprocessing, model training, evaluation, and deployment.
- Monitor and manage the performance, reliability, and scalability of ML systems.
- Troubleshoot and resolve issues related to ML infrastructure and deployments.
- Implement and manage distributed training and inference solutions to enhance model performance and scalability.
- Utilize DeepSpeed, TensorRT, vLLM for optimizing and accelerating AI inference and training processes.
- Understand key care abouts when it comes to ML models such as: transformer architectures, precision, quantization, distillation, attention span & KV cache, MoE, etc.
- Build workload memory access traces from AI models.
- Study system balance ratios for DRAM to HBM in terms of capacity and bandwidth to understand and model TCO.
- Study data movement between CPU, GPU and the associated memory subsystems (DDR, HBM) in heterogeneous system architectures via connectivity such as PCIe/NVLINK/Infinity Fabric to understand the bottlenecks in data movement for different workloads.
- Develop an automated testing framework through scripting.
- Customer engagements and conference presentations to showcase findings and develop whitepapers.
Requirements:
- Strong programming skills in Python and familiarity with ML frameworks such as TensorFlow, PyTorch, or scikit-learn.
- Experience in data preparation: cleaning, splitting, and transforming data for training, validation, and testing.
- Proficiency in model training and development: creating and training machine learning models.
- Expertise in model evaluation: testing models to assess their performance.
- Skills in model deployment: launching server, live inference, batched inference
- Experience with AI inference and distributed training techniques.
- Strong foundation in GPU and CPU processor architecture
- Familiarity with and knowledge of server system memory (DRAM)
- Strong experience with benchmarking and performance analysis
- Strong software development skills using leading scripting, programming languages and technologies (Python, CUDA, C, C++)
- Familiarity with PCIe and NVLINK connectivity
Preferred Qualifications:
- Experience in quickly building AI workflows: building pipelines and model workflows to design, deploy, and manage consistent model delivery.
- Ability to easily deploy models anywhere: using managed endpoints to deploy models and workflows across accessible CPU and GPU machines.
- Understanding of MLOps: the overarching concept covering the core tools, processes, and best practices for end-to-end machine learning system development and operations in production.
- Knowledge of GenAIOps: extending MLOps to develop and operationalize generative AI solutions, including the management of and interaction with a foundation model.
- Familiarity with LLMOps: focused specifically on developing and productionizing LLM-based solutions.
- Experience with RAGOps: focusing on the delivery and operation of RAGs, considered the ultimate reference architecture for generative AI and LLMs.
- Data management: collect, ingest, store, process, and label data for training and evaluation. Configure role-based access control; dataset search, browsing, and exploration; data provenance tracking, data logging, dataset versioning, metadata indexing, data quality validation, dataset cards, and dashboards for data visualization.
- Workflow and pipeline management: work with cloud resources or a local workstation; connect data preparation, model training, model evaluation, model optimization, and model deployment steps into an end-to-end automated and scalable workflow combining data and compute.
- Model management: train, evaluate, and optimize models for production; store and version models along with their model cards in a centralized model registry; assess model risks, and ensure compliance with standards.
- Experiment management and observability: track and compare different machine learning model experiments, including changes in training data, models, and hyperparameters. Automatically search the space of possible model architectures and hyperparameters for a given model architecture; analyze model performance during inference, monitor model inputs and outputs for concept drift.
- Synthetic data management: extend data management with a new native generative AI capability. Generate synthetic training data through domain randomization to increase transfer learning capabilities. Declaratively define and generate edge cases to evaluate, validate, and certify model accuracy and robustness.
- Embedding management: represent data samples of any modality as dense multi-dimensional embedding vectors; generate, store, and version embeddings in a vector database. Visualize embeddings for improvised exploration. Find relevant contextual information through vector similarity search for RAGs.
Education:
- Bachelor’s or higher (with 12+ years of experience) in Computer Science or related field.
About Micron Technology, Inc.
We are an industry leader in innovative memory and storage solutions transforming how the world uses information to enrich life for all. With a relentless focus on our customers, technology leadership, and manufacturing and operational excellence, Micron delivers a rich portfolio of high-performance DRAM, NAND, and NOR memory and storage products through our Micron® and Crucial® brands. Every day, the innovations that our people create fuel the data economy, enabling advances in artificial intelligence and 5G applications that unleash opportunities — from the data center to the intelligent edge and across the client and mobile user experience.
To learn more, please visit micron.com/careers
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.
To request assistance with the application process and/or for reasonable accommodations, please contact hrsupport_india@micron.com
Micron Prohibits the use of child labor and complies with all applicable laws, rules, regulations, and other international and industry labor standards.
Micron does not charge candidates any recruitment fees or unlawfully collect any other payment from candidates as consideration for their employment with Micron.