Nvidia - Santa Clara, CA
posted 3 months ago
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role combines software and systems engineering practices to ensure that our internal and external GPU cloud services operate with maximum reliability and uptime. SRE at NVIDIA requires a deep understanding of various systems, networking, coding, database management, capacity management, continuous delivery, and deployment, as well as familiarity with open-source cloud technologies like Kubernetes and OpenStack. The SRE team is responsible for the overall health of our services, ensuring that they run smoothly and efficiently. This involves careful preparation and planning to enable developers to make changes to existing systems while monitoring capacity, latency, and performance. SRE is not just a role but a mindset that emphasizes running better production systems through optimization and automation. Our software development efforts are geared towards eliminating manual work, enhancing performance, and increasing the efficiency of production systems. SREs at NVIDIA are tasked with the big picture of how our systems interrelate, employing a variety of tools and approaches to address a wide range of challenges. We prioritize practices such as minimizing reactive operational work, conducting blameless postmortems, and proactively identifying potential outages, all of which contribute to iterative improvements in product quality and create a dynamic work environment. Our culture values diversity, intellectual curiosity, problem-solving, and openness, fostering collaboration and encouraging team members to think big and take risks in a supportive, blame-free atmosphere. We aim to create an environment that promotes self-direction on meaningful projects while providing the necessary support and mentorship for personal and professional growth.