Site Reliability Engineer

INSPYR Solutions - Miami, FL

posted 8 days ago

Full-time - Mid Level

Miami, FL

Administrative and Support Services

About the position

The Site Reliability Engineer will play a critical role in ensuring the reliability, performance, and seamless operation of Royal Caribbean Cruise Lines' digital ecosystem, which includes guest-facing mobile apps, websites, and backend systems. The engineer will collaborate with development, operations, and product teams to build and maintain a highly resilient and scalable digital experience for guests.

Responsibilities

Respond to and resolve production incidents, prioritizing guest-facing issues to minimize disruption.
Conduct root cause analysis and implement preventive measures to avoid recurrence.
Build, maintain, and enhance monitoring tools and dashboards to provide visibility into system health and performance.
Develop and implement automation scripts and tools to streamline operations and improve system reliability.
Work closely with product teams to incorporate reliability principles into new feature development.
Create and maintain clear documentation on system architecture and incident postmortems.
Participate in on-call rotation to acknowledge and escalate incidents.

Requirements

Strong knowledge of mobile (iOS, Android) and web technologies, backend systems, cloud infrastructure (AWS, Azure), and database technologies.
Proficiency in one or more programming languages (e.g., Python, Java, Go) for scripting and automation.
Experience with monitoring tools like Prometheus, Grafana, or Splunk.
Experience with incident management tools like PagerDuty or ServiceNow.
Understanding of security best practices and incident response.
Excellent written and verbal communication skills.
Ability to work with large volumes of customer data and use Oracle SQL (or similar) to query databases.

Nice-to-haves

5+ years of demonstrated proficiency in one or more scripting languages such as Python or Go.
3+ years of experience with Kubernetes or equivalent.
5+ years of software development experience in Java or JavaScript.
3+ years of experience with containers and container orchestrators like Docker and Kubernetes.
5+ years of experience debugging and fixing system/infrastructure and application issues.
5+ years of experience working with monitoring tools such as Prometheus, Grafana, or Google Stack Driver.
5+ years of experience with databases (SQL or NoSQL).
5+ years of experience with log analysis and building dashboards.
At least 6 years in a Reliability Engineering, DevOps, or infrastructure-focused role.

Benefits

Comprehensive medical benefits
Competitive pay
401(k) retirement plan

Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company