Infrastructure Engineer (SRE)

Travelopia

About The Role

You will be part of the Technology team responsible for improving the reliability, availability, performance and operational resilience of Travelopia’s platforms and services. Working across cloud, infrastructure, applications and service teams, you will use observability, automation, incident learning and engineering discipline to reduce operational risk, remove repeatable manual work and help teams deliver stable services for colleagues and customers.

What we’ll offer

Competitive base salary Competitive base salary (300,000–500,000 ZAR)
Comprehensive health benefits
Retirement savings plan with company contributions
Generous paid time off allowance and birthday day off
Flexible hybrid working schedules
Parental leave and family-supportive policies
Charity or volunteering time off
Group-wide travel discounts

What you’ll do

Improve the reliability, availability, and performance of production services through monitoring, observability, automation, and operational engineering.
Define and maintain service health using SLIs, SLOs, dashboards, actionable alerts, and operational reporting.
Support incident, major incident, and problem management by providing technical diagnosis, root cause analysis, and driving corrective actions.
Build and enhance observability across logs, metrics, traces, and synthetic monitoring, ensuring alerts are linked to runbooks and response procedures.
Automate operational processes using scripting, Infrastructure as Code (IaC), and configuration management to reduce manual effort and risk.
Collaborate with DevOps, infrastructure, security, and application teams to improve deployment quality, resilience, recovery capabilities, and operational readiness in a 24/7 support environment.

What you’ll bring

SRE & Cloud Operations Expertise – Strong understanding of Site Reliability Engineering principles, including availability, latency, error budgets, incident management, post-incident reviews, and continuous service improvement in AWS, Azure, or hybrid cloud environments.
Monitoring & Observability – Experience with monitoring and observability tools such as Grafana, Prometheus, Datadog, Splunk, CloudWatch, or Azure Monitor, including defining SLIs/SLOs, alerting, dashboards, and performance analysis.
Troubleshooting & Service Reliability – Proven ability to diagnose and resolve issues across applications, APIs, infrastructure, networks, operating systems, and cloud services, while managing incidents, problems, and changes effectively.
Automation, IaC & DevOps – Hands-on experience with Infrastructure as Code and automation tools (Terraform, CloudFormation, Ansible, Azure DevOps, GitHub Actions) along with scripting skills in TypeScript, Python, PowerShell, Bash, or Go.
CI/CD, Containers & Security – Knowledge of CI/CD pipelines, deployment strategies, Docker/Kubernetes, DevSecOps practices, vulnerability management, secrets management, and security scanning tools such as Snyk or Trivy.
Operational Excellence & Continuous Improvement – Experience with resilience, disaster recovery, capacity planning, runbook creation, AIOps capabilities, and a continuous improvement mindset aligned with SFIA operational and service management practices.

We are committed to building a diverse and inclusive workplace where individuals can be their authentic selves and contribute to meaningful outcomes. Travelopia ensures an inclusive workplace for all. If you need accommodations during the recruitment process, please inform us here: [email protected]