Principal Site Reliability Engineer, Datastores (ThousandEyes)

Cisco - San Francisco, CA

posted 3 months ago

Full-time - Principal

San Francisco, CA

Computer and Electronic Product Manufacturing

About the position

The Principal Site Reliability Engineer for Datastores at ThousandEyes will play a crucial role in ensuring the reliability and performance of the platform's mission-critical datastores, which include technologies such as ElasticSearch, Kafka, MongoDB, and MySQL. This position is focused on all aspects of datastore reliability, including availability, performance, change management, capacity planning, monitoring, and emergency response. As a leader in this role, you will be responsible for innovating and providing a strong technical vision while collaborating with various teams to build reliable, scalable, and highly available datastores on a multi-region scale platform. In this role, you will partner with leaders across ThousandEyes as a subject matter expert in datastores, helping to design optimal architectures and processes. You will also serve as a role model for the engineering team, promoting effective delivery and teamwork. The position requires a reliability-focused engineering leader who is passionate about automation and operational excellence, particularly in the context of managing ever-growing volumes of data. The ideal candidate will possess deep knowledge of datastores, with experience in building and supporting mission-critical systems. You will be expected to ensure that the ThousandEyes platform's services utilize the appropriate datastore infrastructure, designed and optimized for availability, latency, and performance. Strong technical vision and the ability to communicate effectively with various stakeholders are essential, as is a hands-on approach to writing software and automating processes to enhance the reliability of the datastores. Additionally, you will be expected to mentor and uplift the team, fostering a culture of learning and collaboration.

Responsibilities

Ensure the reliability and performance of mission-critical datastores.
Design and implement scalable and well-tested solutions focused on datastores.
Write high-quality code in Python, Go, or equivalent languages.
Utilize Infrastructure as Code skills, ideally with Terraform and Kubernetes.
Leverage cloud provider managed services, ideally AWS, in the context of the platform.
Collaborate with Engineering and Product Management to shape the future direction of the platform's datastores.
Mentor and support team members to enhance their skills and knowledge.

Requirements

Deep knowledge of datastores, including relational and NoSQL databases.
Experience building and supporting mission-critical datastores.
Strong technical vision and ability to communicate effectively with stakeholders.
Hands-on experience in writing software and automating processes.
Expertise in reliability engineering and delivering complex systems.
Ability to design scalable solutions with a focus on datastores.
Strong Infrastructure as Code skills, ideally with Terraform and Kubernetes.
Good knowledge of cloud provider managed services, ideally AWS.
Understanding of Unix/Linux systems, the kernel, system libraries, file systems, and Client Server protocols.
Strong communication and documentation skills.

Nice-to-haves

Experience with multi-region scale platforms.
Familiarity with self-service systems in datastore management.
Passion for mentoring and team development.

Benefits

Health insurance coverage.
401k retirement savings plan.
Paid holidays and vacation time.
Professional development opportunities.

Principal Site Reliability Engineer, Datastores (ThousandEyes)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company