Cisco - San Francisco, CA
posted 4 months ago
The FedRAMP Site Reliability Engineer (SRE) role at Cisco ThousandEyes is pivotal in ensuring the reliability and performance of our Federal region's infrastructure and operations. This position is responsible for managing all aspects of the Federal region's platform, which includes availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning, with a strong emphasis on security. The SRE team operates under the principle of treating operations and infrastructure as code, which enhances the efficiency and effectiveness of our distributed team. As a Senior Site Reliability Engineer, you will be tasked with maintaining a robust and scalable infrastructure capable of handling a high volume of incoming data daily. You will collaborate closely with software engineers to design and optimize the ThousandEyes platform's infrastructure and services, ensuring they meet the highest standards for availability, latency, and performance. Your role will also involve the design, implementation, and management of FedRAMP-compliant infrastructure and systems, establishing processes for continuous monitoring, logging, and auditing to ensure compliance with FedRAMP controls. In addition, you will work alongside security teams to identify and remediate vulnerabilities, conduct security assessments, and implement necessary security controls. You will be responsible for designing and implementing dynamic infrastructure solutions that support the growth and scaling of our platform, particularly in multi-region environments. Your expertise in automation will be crucial in enabling our infrastructure and platforms to scale effortlessly, with a special focus on FedRAMP systems. Staying updated on industry best practices, evolving security threats, and changes to FedRAMP guidelines will be essential to improving the security posture of our systems. This role also includes designing, deploying, and maintaining cloud-native services in AWS that are elastic and resilient to failure, participating in incident response, and contributing to our 24x7 on-call rotation. Capacity planning for the infrastructure and platform will be a key responsibility, helping teams prepare for future growth.