Senior Site Reliability Engineer I/II

$135,000 - $195,000/Yr

Umbra Lab - Santa Barbara, CA

posted 3 months ago

Full-time - Mid Level

Remote - Santa Barbara, CA

Computer and Electronic Product Manufacturing

About the position

Umbra is seeking an experienced Site Reliability Engineer (SRE) to join our team in designing, building, operating, and scaling our mission-critical infrastructure. This role is pivotal in ensuring the reliability and scalability of our systems, which are essential for delivering high-quality satellite data to our customers. The SRE will work closely with cross-functional teams, including developers, product managers, and other stakeholders, to align on technical strategies and provide expert guidance. The position can be based in our Santa Barbara office or can be performed fully remotely, offering flexibility to the right candidate. The ideal candidate will possess a deep understanding of the entire technology stack and architecture, enabling informed decisions regarding technical debt and trade-offs. They will demonstrate leadership in technical innovation, advocating for new technologies and best practices while continuously refining existing processes to enhance efficiency and effectiveness across projects and services. Effective communication with both technical and non-technical stakeholders is crucial, as the SRE will foster collaboration and understanding across diverse teams. The role also involves driving impactful changes that benefit the entire team and extend beyond individual contributions. In this position, the SRE will ensure that critical systems meet service level agreements (SLAs) through proactive monitoring and effective incident response. They will develop and promote new technologies and tools, conducting research and creating proofs of concept to introduce solutions that enhance team capabilities. The SRE will lead by example in fostering a culture of excellence and reliability, continuously evaluating and improving team processes and workflows to increase efficiency and reduce complexity. Participation in on-call rotations to provide support and resolve complex technical issues is also a key responsibility.

Responsibilities

Ensure the reliability and scalability of critical systems, meeting SLAs through proactive monitoring and effective incident response.
Develop and promote new technologies and tools, conducting research and creating proofs of concept to introduce solutions that enhance the team's capabilities.
Lead by example in fostering a culture of excellence and reliability.
Continuously evaluate and improve team processes and workflows to increase efficiency and reduce complexity.
Collaborate closely with cross-functional teams, product managers, and stakeholders to align on technical strategy and provide expert guidance.
Participate in on-call rotations, providing support and resolving complex technical issues.
Perform all other duties as assigned.

Requirements

6+ years in a Site Reliability Engineer or DevOps role supporting a SaaS platform, with demonstrated expertise managing distributed systems.
Extensive experience with AWS services (EC2, S3, Lambda, VPC Networking) and deep knowledge of cloud infrastructure, networking, and security best practices.
Proficiency running, optimizing, and scaling Kubernetes clusters in production environments.
Experience using and writing Terraform to architect and manage production infrastructure.
Ability to create and utilize Infrastructure-as-code (IaC), GitOps practices, and automation tools to increase reliability and reduce manual tasks.
Proven success in leading teams or projects using Agile/Scrum methodologies.
Expertise in infrastructure and software architecture, capable of designing and implementing large-scale, reliable systems with minimal guidance.
Experience developing and managing comprehensive infrastructure monitoring and alerting strategies.

Nice-to-haves

Advanced understanding of cloud and application security, identity management, and compliance.
Expertise in service mesh and service registration technologies, focusing on performance and reliability.
Bachelor's degree in Computer Science or a related field, or equivalent professional experience.

Benefits

Flexible Time Off, Sick, Family & Medical Leave
Medical, Dental, Vision, Life, LTD, STD (employer funded)
Voluntary Life, Critical Illness, Accidental, Hospital Indemnity, Pet Insurance (employee funded)
401k with 3% non-elective company contribution
Stock Options

Senior Site Reliability Engineer I/II

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company