Career Area:
Technology, Digital and Data
Job Description:
Your Work Shapes the World at Caterpillar Inc.
When you join Caterpillar, you're joining a global team who cares not just about the work we do – but also about each other. We are the makers, problem solvers, and future world builders who are creating stronger, more sustainable communities. We don't just talk about progress and innovation here – we make it happen, with our customers, where we work and live. Together, we are building a better world, so we can all enjoy living in it.
- Own production reliability for assigned services through proactive monitoring, alerting, and operational excellence.
- Participate in 24x7 on‑call rotation, leading P1/P2 incident triage, stabilization, and resolution.
- Ensure adherence to SLOs, SLIs, SLAs, and availability targets.
Alerting & Monitoring
- Design, implement, and tune actionable alerts to reduce noise and false positives.
- Build and maintain alerting using tools such as:
- Datadog / Dynatrace / AppDynamics / Broadcom
- CloudWatch, Azure Monitor
- Synthetic monitoring tools (ThousandEyes or equivalents)
- Create and maintain operational dashboards for application, infrastructure, and business KPIs.
- Drive alert rationalization and standardization across teams.
Incident Management & RCA
- Lead or contribute to Root Cause Analysis (RCA) and Post‑Incident Reviews (PIRs).
- Perform event correlation across metrics, logs, traces, and deployments.
- Identify recurring issues and partner with engineering teams for permanent fixes.
- Produce clear RCA documentation including timeline, impact, root cause, and corrective actions.
Observability & Tooling
- Implement and operate observability platforms covering:
- Metrics, logs, traces
- Service topology and dependency mapping
- Work with OpenTelemetry‑based pipelines where applicable.
- Improve visibility into upstream/downstream dependencies.
- Support onboarding of applications into standard SRE tooling and frameworks.
Automation & Toil Reduction
- Identify manual and repetitive operational tasks and automate them using scripting or workflows.
- Contribute to self‑healing and auto‑remediation solutions.
- Improve MTTR through automation, runbooks, and tooling enhancements.
Collaboration & Governance
- Work closely with application teams, platform teams, and cloud engineers.
- Review application designs from a reliability and operability perspective.
- Contribute to SRE standards, best practices, and documentation.
Required Skills & Experience
Core Experience
- 5–6 years of experience in SRE, DevOps, Production Support, or Platform Engineering
- Strong experience handling production incidents (P1/P2) and RCAs
Monitoring & Alerting
- Hands‑on experience with monitoring and alerting tools such as:
- Datadog, Dynatrace, AppDynamics, Broadcom
- CloudWatch, Azure Monitor
- Synthetic monitoring tools (ThousandEyes or similar)
- Experience designing noise‑free, service‑impact‑based alerts
RCA & Troubleshooting
- Strong skills in log analysis, metric correlation, and distributed tracing
- Experience performing structured RCAs and postmortems
- Understanding of incident patterns, failure modes, and resilience
Cloud & Infrastructure
- Experience with AWS and/or Azure
- Working knowledge of containers and orchestration (ECS/EKS/Kubernetes)
- Experience with databases (Postgres, Oracle, or similar)
Automation & Programming
- Proficiency in at least one scripting language: Python, Bash, or JavaScript
- Familiarity with CI/CD pipelines and IaC concepts (Terraform, CloudFormation – good to have)
Nice to Have
- Experience with OpenTelemetry
- Exposure to AIOps / event correlation / AI‑assisted RCA
- Experience with service maps, dependency graphs, and topology modeling
- Prior experience supporting mission‑critical or customer‑facing platforms
Behavioral & Soft Skills
- Strong problem‑solving and analytical mindset
- Clear communication during high‑pressure incidents
- Ability to collaborate across engineering, product, and operations teams
- Ownership mindset with a focus on long‑term reliability over short‑term fixes
What Success Looks Like in This Role
- Reduced alert noise and faster incident detection
- Improved MTTR and fewer repeat incidents
- High‑quality RCAs leading to permanent improvements
- Strong operational readiness of onboarded applications
Posting Dates:
April 16, 2026 - April 23, 2026
Caterpillar is an Equal Opportunity Employer. Qualified applicants of any age are encouraged to apply
Not ready to apply? Join our Talent Community.