Posted:
3/9/2026, 8:53:56 AM
Location(s):
Nevada, United States ⋅ Las Vegas, Nevada, United States
Experience Level(s):
Junior ⋅ Mid Level ⋅ Senior
Field(s):
Operations & Logistics
Workplace Type:
On-site
About TensorWave
Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure.
About the Role
We are looking for an Operations Engineer to join our Global Operations Center team as the frontline of TensorWave’s customer infrastructure reliability. This role is focused on monitoring customer environment health, detecting issues before they impact workloads, and serving as the L1 response to customer-reported problems. This role is based in TensorWave headquarters in Las Vegas and is part of our customer facing 24/7 team. You’ll be responsible for monitoring systems, executing runbooks, and coordinating with on-site teams and engineering when escalation is needed.
TensorWave is building its Operations Center from the ground up, and early team members will have a direct impact on how we keep our customers’ most critical workloads running. This role is ideal for someone who is sharp under pressure, naturally detail-oriented, and motivated by the knowledge that their work directly protects customer outcomes.
What You’ll Do
Monitor customer environments in real time across TensorWave data centers using monitoring and observability platforms
Track key health indicators including GPU utilization, node availability, network performance, storage health, and Kubernetes cluster status
Identify anomalies, degradations, and emerging issues before they escalate into customer-impacting events
Maintain situational awareness of active customer workloads, scheduled maintenance windows, and known issues across the fleet
Provide regular health summaries and flag trends that may indicate systemic risks to customer environments
Serve as the first responder to customer-reported issues and system-generated alerts, performing initial triage and classification
Execute established runbooks to diagnose and resolve common infrastructure issues including node failures, connectivity problems, and resource contention
Escalate issues to L2 engineering or on-site data center teams with clear, actionable context
Maintain accurate incident records including timeline, actions taken, and resolution details in the ticketing system
Communicate status updates to internal stakeholders during active incidents, ensuring visibility across operations and customer-facing teams
Follow and contribute to operational runbooks and standard operating procedures, identifying gaps or improvements based on real-world incidents
Assist with monitoring and alerting tuning by providing feedback on alert quality, false positive rates, and coverage gaps
Document tribal knowledge, recurring issue patterns, and lessons learned to strengthen the team’s operational knowledge base
Participate in post-incident reviews, contributing observations from the frontline monitoring and response perspective
Support change management processes by monitoring customer environments during planned maintenance and infrastructure changes
Coordinate with on-site data center operations teams for hands-on remediation activities that require physical access
Who You Are
Required Qualifications
1–3 years of experience in a NOC, operations center, technical support, systems administration, or similar infrastructure operations role
Experience monitoring production infrastructure using observability tools (Grafana, Datadog, Prometheus, or similar)
Foundational Linux systems administration skills with the ability to navigate systems, read logs, and execute diagnostic commands
Basic understanding of networking fundamentals including TCP/IP, DNS, and VLANs
Experience following operational runbooks and structured triage procedures in a production environment
Strong written communication skills, particularly the ability to write clear incident updates and escalation summaries under time pressure
Demonstrated ability to stay calm, prioritize effectively, and work methodically during high-pressure situations
Familiarity with ticketing and incident tracking systems (PagerDuty, Jira, ServiceNow, or similar)
Willingness to work shift schedules including nights, weekends, and holidays as part of a 24/7 coverage model
Preferred Qualifications
Experience in a customer-facing operations role at a cloud provider, managed services provider, or colocation facility
Exposure to GPU infrastructure, HPC clusters, or AI/ML compute environments
Familiarity with Kubernetes concepts and basic container troubleshooting
Scripting ability in Python, Bash, or similar for basic automation and log analysis
Experience with high-performance networking concepts (RDMA, InfiniBand, or RoCE)
Background working across multiple geographically distributed data center sites
Relevant certifications (CompTIA Server+, Linux+, RHCSA, CCNA, or equivalent)
What We Offer
Stock Options
100% paid Medical, Dental, and Vision insurance for Employees
Company Health Savings Account Contributions
100% paid Short Term and Long Term Disability Insurance for Employees
Life and Voluntary Supplemental Insurance Options
Other Insurance Options, such as Pet & Legal Insurance
Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support
Flexible Spending Account
401(k)
Employee Assistance Program
Flexible PTO
Paid Holidays
Parental Leave
Other In-Office Perks
Equal Employment Opportunity
TensorWave is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of any protected status under applicable law.
Reasonable Accommodations
TensorWave provides reasonable accommodations in accordance with applicable laws. If you require accommodation during the hiring process, please contact [email protected].
Employment Eligibility
All offers of employment are contingent upon verification of identity and authorization to work in the United States, as required by law.
Background Checks
Where permitted by law, employment may be contingent upon the successful completion of a job-related background check.
Data Privacy Notice
By submitting an application, you acknowledge that TensorWave may collect, use, and retain your personal information for recruiting and employment-related purposes in accordance with applicable data privacy laws.
Website: https://www.tensorwave.com/
Headquarter Location: Las Vegas, Nevada, United States
Employee Count: 51-100
Year Founded: 2023
IPO Status: Private
Last Funding Type: Series A
Industries: AI Infrastructure ⋅ Artificial Intelligence (AI) ⋅ Cloud Computing ⋅ Cloud Infrastructure ⋅ Generative AI ⋅ IaaS