Site Reliability Engineer - ASE

Apple - Cupertino, CA

posted 4 months ago

Full-time - Mid Level

Cupertino, CA

Computer and Electronic Product Manufacturing

About the position

As a Site Reliability Engineer (SRE) at Apple Services Engineering (ASE), you will play a pivotal role in ensuring the reliability and performance of the systems that support Apple’s services, including iCloud. This position is not just about maintaining systems; it’s about crafting experiences that millions of customers rely on daily. You will be part of a team that is responsible for the design, engineering, and operation of services that must scale globally and remain highly available. Your work will directly impact the quality of Apple Services, and you will be expected to bring your passion for engineering and problem-solving to the forefront. In this role, you will lead data-driven roadmaps and quarterly planning for a subset of core services, focusing on reliability. You will oversee the entire software lifecycle for these services, which includes infrastructure setup, capacity planning, deployment, monitoring, architecture, and software implementation. Collaboration with development teams will be crucial as you work to ensure that the services not only meet but exceed customer expectations. The ideal candidate will thrive in a fast-paced, collaborative environment and will be driven by a desire to solve complex engineering problems. Your responsibilities will include implementing SRE principles such as monitoring, alerting, and automation, as well as managing the lifecycle of global services from inception through deployment and operations. You will also be expected to participate in on-call service support, ensuring that any issues are addressed promptly and effectively. This is an opportunity to work at the intersection of software development and operations, making a significant impact on the reliability of Apple’s services.

Responsibilities

Lead data-driven roadmap and quarterly planning for core services from a reliability perspective.
Oversee the entire software lifecycle for core services, including infrastructure setup, capacity planning, deployment, monitoring, architecture, and software implementation.
Collaborate closely with development teams to ensure high-quality service delivery.
Implement SRE principles such as monitoring, alerting, error budgets, fault analysis, and automation.
Participate in on-call service support to address issues promptly.

Requirements

5+ years of software development or production operations experience in a large-scale environment.
Strong attention to detail and ability to meet deadlines.
Proven experience with SRE principles, including monitoring, alerting, and automation.
Proficiency in an object-oriented programming language such as Java, Golang, or Python.
Excellent troubleshooting and problem-solving skills.
Strong written and verbal communication skills.
Experience in improving the lifecycle of global services from inception through deployment and operations.
Familiarity with Linux/Unix, Networking, Systems Management, and Systems Security.
Experience managing large numbers of diverse systems.

Nice-to-haves

Solid understanding and hands-on experience with Kubernetes and container orchestration.
Experience leading a team or projects from scoping to delivery.
Broad leadership and partnership skills.
Experience managing customer-facing internet scale systems.
Experience managing Distributed Systems / Large Scale Systems Operations.

Site Reliability Engineer - ASE

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company