This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Senior Compute SRE (GPU)

$166,600 - $296,300/Yr

Apple - Seattle, WA

posted 2 months ago

Full-time - Mid Level
Seattle, WA
Computer and Electronic Product Manufacturing

About the position

As a Senior Site Reliability Engineer at Apple, you will play a crucial role in supporting and scaling cloud services for development and operations engineers. This hands-on position focuses on maintaining and improving Site Reliability Engineering (SRE) practices for a private cloud service, ensuring constant uptime and seamless scalability for thousands of applications. You will collaborate closely with developers and architects to enhance stability, security, and scalability of cloud systems.

Responsibilities

  • Design and deploy GPU-accelerated VM and container infrastructure using platforms such as KVM, Qemu, AWS, or Google Cloud.
  • Implement GPU-based Kubernetes clusters to support containerized applications and services.
  • Work with data scientists, developers, and other stakeholders to understand requirements and provide solutions for GPU-accelerated tasks.
  • Implement best practices for security, scalability, and high availability environments.
  • Monitor and optimize resource utilization to ensure performance and cost-efficiency.
  • Actively participate in capacity planning, scale testing, and disaster recovery exercises.
  • Troubleshoot issues across the entire infrastructure stack.
  • Cultivate and maintain relationships with internal and external third-party vendors.

Requirements

  • 5+ years in a Site Reliability Engineering, DevOps, or Infrastructure focused role.
  • Proven experience with GPU-based virtual machine infrastructure and cloud platforms (e.g., AWS, GCP).
  • Experience with GPU hardware (e.g., NVIDIA, AMD) and associated software stack (e.g., CUDA, cuDNN).
  • Experience with GitOps, CI/CD tools, and deployment strategies like Spinnaker, Argo.
  • Ability to implement and coordinate telemetry using monitoring and observability tools such as Splunk, Grafana, and Prometheus.
  • Outstanding organizational and communications skills.

Nice-to-haves

  • Strong verbal and written communication skills.
  • Knowledge of Kubernetes, including deployment, management, and optimization of clusters.
  • Automation advocate - you truly believe in removing operational load via software.
  • A strong sense of ownership and being a great teammate who communicates clearly and transparently.
  • Self-motivated, inquisitive, and always looking to learn more.
  • Experience managing, scaling, and troubleshooting Golang and GPU applications.
  • Ability to work independently and manage multiple priorities effectively.
  • CNCF Kubernetes Administration certification.

Benefits

  • Comprehensive medical and dental coverage
  • Retirement benefits
  • Discounted products and free services
  • Reimbursement for certain educational expenses, including tuition
  • Discretionary bonuses or commission payments
  • Relocation assistance
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service