ASE -Site Reliability Engineer

Apple - Cupertino, CA

posted 2 months ago

Full-time

Cupertino, CA

Computer and Electronic Product Manufacturing

About the position

The Apple Service Engineering - Site Reliability Engineering (SRE) team is seeking skilled Site Reliability Engineers who are adept at developing processes, tools, and automation for managing distributed systems in production environments. This role is crucial as it combines software and systems engineering with system administration practices to build and maintain large-scale, massively distributed, fault-tolerant systems. The software developed by the SRE team ensures that Apple's services are reliable, scalable, and secure, utilizing both open-source and proprietary technologies to provide managed data infrastructure services. In this position, you will be instrumental in building the next generation of search infrastructure and platform services, collaborating cross-functionally with various teams within Apple Service Engineering, including store and commerce, search, and recommendations. Your work will involve creating platforms capable of rapidly scaling to serve both personalized and non-personalized data with minimal latency. The ideal candidate will possess a questioning mindset, be a supportive colleague under tight deadlines, and be able to devise elegant technical solutions to complex problems. The ASE SRE team is responsible for developing applications and tooling that are safe, reliable, scalable, and fast. This role demands an innovative spirit and a high level of care and precision in engineering. Team members contribute to all major components of the Redis deployment infrastructure, which includes maintenance automation, backup service applications, monitoring and alerting tools/dashboards, and deployment architecture, all focused on stability, performance, and scalability. Success in this role requires a deep understanding of core SRE concepts, including monitoring, alerting, and incident management, as well as database concepts such as consistency models and crash recovery semantics. Additionally, you will need to have expertise in performance engineering, service management across various platforms (bare metal, virtualized environments like EC2, and Kubernetes), and a solid grasp of system-level hardware and networking components. Knowledge of operating systems concepts and datacenter architecture is also essential. Excellent communication skills and a strong customer focus are critical when engaging with internal platform customers. As part of a distributed team, the ability to work effectively with colleagues in different locations is vital, and prior experience in this area is advantageous. Apple values craftsmanship, and performance is a key ingredient in delivering services and applications that are fluid and responsive. You will collaborate with engineers across Apple to define metrics, set targets, uncover optimization opportunities, and establish quality standards, ultimately shipping products and services that delight customers. This role is tailored for engineers who thrive on deep technical challenges that span large, cross-organizational projects, and your willingness to learn and implement new technologies will be pivotal in the continuous evolution of our organization.

Responsibilities

Develop processes, tools, and automation for managing distributed systems in production environments.
Build and maintain large-scale, fault-tolerant systems.
Collaborate cross-functionally with various ASE teams to enhance search infrastructure and platform services.
Create platforms that can rapidly scale to serve personalized and non-personalized data with low latencies.
Contribute to Redis deployment infrastructure, including maintenance automation and monitoring tools.
Engage with internal platform customers with a high degree of customer focus.
Define metrics, set targets, and uncover optimization opportunities for services and applications.

Requirements

Understanding of core SRE concepts including monitoring, alerting, and incident management.
Knowledge of database concepts such as consistency models, isolation levels, and crash recovery semantics.
Experience in performance engineering and profile-guided optimization.
Service management experience across bare metal, virtualized (EC2), and Kubernetes platforms.
Familiarity with system-level hardware and networking components.
Understanding of operating systems concepts including process scheduling and disk I/O.
Knowledge of datacenter architecture and design of multi-datacenter systems.

Nice-to-haves

Experience developing critical internet services and platform infrastructure.
Proficiency in programming languages such as Java, Go (golang), or Python.
Experience managing services running on Kubernetes.
Familiarity with EC2, EBS, and Terraform.

ASE -Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company