It's fun to work in a company where people truly BELIEVE in what they're doing!
Job Description:
Deployment & Infrastructure Management:
- Deploy, configure, and manage AI models, agentic systems, and supporting infrastructure in cloud (e.g., GCP) and on-premise environments.
- Implement and maintain CI/CD pipelines for AI/ML models and agentic applications (MLOps/Agent Ops).
- Manage and optimize cloud resources, ensuring cost-effectiveness and scalability for AI workloads.
- Collaborate with infrastructure teams to ensure network, storage, and compute resources meet the demands of AI systems.
Monitoring, Logging & Alerting:
- Develop and implement comprehensive monitoring, logging, and alerting solutions for AI agents and infrastructure to ensure high availability and performance.
- Proactively identify and address potential issues, performance bottlenecks, and anomalies in production AI systems.
- Track key operational metrics and create dashboards for system health and performance.
Incident Response & Troubleshooting:
- Provide operational support for production AI systems, including incident response, root cause analysis, and resolution of technical issues.
- Develop and maintain runbooks and standard operating procedures for common operational tasks and incident management.
- Participate in on-call rotations as needed to support critical AI services.
Automation & Operational Excellence:
- Automate routine operational tasks, deployment processes, and system maintenance activities using scripting (e.g., Python, Bash) and automation tools.
- Contribute to the development and enforcement of operational best practices, security standards, and compliance requirements for AI systems.
- Work with development teams to improve the deployability, manageability, and observability of AI applications.
Collaboration & Documentation:
- Collaborate effectively with AI developers, data scientists, AI architects, and other stakeholders to ensure smooth transitions from development to production.
- Maintain clear and comprehensive documentation for system configurations, operational procedures, and troubleshooting guides.
- Provide feedback to development teams on operational aspects and system performance.