Senior SRE Engineer

$140,000 - $258,750/Yr

Nvidia - Santa Clara, CA

posted 4 months ago

Full-time - Mid Level
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

NVIDIA is looking for a seasoned Site Reliability Engineer (SRE) to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization. In this role, you will be working as a Senior SRE Engineer, part of a dynamic team that develops and maintains NVIDIA's internal cloud provisioning product for GPUs and Tegra systems. This position involves collaboration with various business units within NVIDIA Software, including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars, to address their infrastructure and systems needs. As an SRE, you will work closely with software engineering teams to deploy new products and manage our infrastructure, associated processes, and systems. Attention to detail, problem-solving abilities, and a solid knowledge base are essential for success in this role. Your responsibilities will include Kubernetes System Administration for DevOps and CI/CD, designing and implementing clusters, cluster segmentation, and managing internal/external networking for multiple clusters and environments. You will monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization. Additionally, you will architect CI/CD pipelines for container build and deployment, develop tools for automating workflows, and maintain our infrastructure codebase. You will also craft and implement critical metrics using various analytics methods and dashboards, participate in prototyping and developing cloud infrastructure for NVIDIA, and leverage AI techniques to extract useful signals from the data generated by machines and jobs.

Responsibilities

  • Kubernetes System Administration for DevOps & CI/CD.
  • Designing and implementing clusters, cluster segmentation, internal/external networking for multiple clusters and environments.
  • Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
  • Architect CI/CD pipelines for container build and deployment.
  • Craft and develop tools needed for automating workflows.
  • Develop, improve and maintain our infrastructure codebase.
  • Craft and implement critical metrics using various analytics methods and dashboards.
  • Take part in prototyping, crafting, and developing cloud infrastructure for NVIDIA.
  • Reuse AI techniques to extract useful signals about machines and jobs from the data generated.

Requirements

  • Kubernetes domain expertise with extensive experience building scalable, resilient platforms in both public and private cloud.
  • High proficiency in administering and configuring Kubernetes.
  • Proficient with CI/CD pipelines like Jenkins, Gitlab CI, Github Actions, ArgoCD etc.
  • Experience with data analytics/visualization tools like Kibana, Grafana, Splunk etc.
  • Strong Ansible skills.
  • Experience with other configuration tools like Chef and Puppet is also good to have.
  • Proficient using source code management and binary repository systems like GitLab, GitHub, Artifactory, Perforce etc.
  • Knowledge of monitoring systems such as Zabbix, Alertmanager, PagerDuty and/or similar systems.
  • Well versed in Prometheus, writing custom exporters and PromQL.
  • 8+ years of proven experience.
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.

Nice-to-haves

  • Experience managing NVIDIA hardware like GPUs and Tegras.
  • Background with Gitlab CI.
  • Experience with building and deploying containers.
  • Solid understanding of containerization and microservices architecture.
  • Certified Kubernetes Administrator (CKA), Certified Kubernetes Security Specialist (CKS) & Certified Kubernetes Application Developer (CKAD) preferred.
  • Ability to design simple systems that can work efficiently without needing much support.

Benefits

  • Competitive salaries
  • Generous benefits package
  • Equity options
  • Ongoing learning opportunities
  • Diversity and inclusion initiatives
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service