Nvidia - Santa Clara, CA
posted 3 months ago
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role combines software and systems engineering practices to ensure that our internal and external GPU cloud services operate with maximum reliability and uptime. SRE at NVIDIA is not just a technical role; it embodies a mindset and a set of engineering approaches aimed at optimizing production systems. The discipline requires a deep understanding of various systems, networking, coding, database management, capacity management, continuous delivery, and deployment, as well as familiarity with open-source cloud technologies like Kubernetes and OpenStack. The SRE team is responsible for the overall health of our services, ensuring that they run smoothly and efficiently. This involves careful planning and preparation to enable developers to make changes to existing systems while monitoring capacity, latency, and performance. Our software development efforts are heavily focused on automation, performance tuning, and enhancing the efficiency of production systems. SREs play a crucial role in minimizing manual work and proactively identifying potential outages, which is essential for maintaining product quality and ensuring a dynamic work environment. At NVIDIA, we foster a culture of diversity, intellectual curiosity, and problem-solving. We believe that bringing together individuals with varied backgrounds and experiences leads to innovative solutions. Our team encourages collaboration, big thinking, and risk-taking in a blame-free environment. We prioritize self-direction in meaningful projects while providing the necessary support and mentorship for personal and professional growth.