Apple - Austin, TX

posted 2 months ago

Full-time
Austin, TX
Computer and Electronic Product Manufacturing

About the position

As a Site Reliability Engineer (SRE) at Apple, you will play a crucial role in ensuring the reliability, scalability, and performance of our systems and services that integrate Apple Retail Stores and Apple Online Store with major US Carriers for iPhone activations. This position requires extensive hands-on experience in working as an SRE engineer for large-scale, customer-facing cloud applications. You will collaborate closely with engineering and operations teams to design, build, and maintain robust infrastructure and automation solutions. Your expertise will be vital in representing the SRE organization during design reviews and operational readiness exercises for both new and existing services. In this dynamic environment, you will be expected to analyze system statistics to provide a clear picture of the current state of our systems. A strong understanding of SRE principles, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts, is essential. You will also need excellent troubleshooting and problem-solving skills to proactively address critical production issues and work with necessary partners to resolve them. Your responsibilities will include automating manual operations and improving processes through repeated iterations. You should have a good understanding of networking and load balancing concepts, and the ability to lead a small team to develop innovative solutions. Participation in an on-call rotation will be required, providing hands-on technical expertise during service-impacting events. This role is ideal for someone who is self-motivated, capable of making business-critical decisions, and comfortable working in a fast-paced, ever-changing environment.

Responsibilities

  • Ensure the reliability, scalability, and performance of systems and services.
  • Collaborate with engineering and operations teams to design, build, and maintain robust infrastructure and automation solutions.
  • Represent the SRE organization in design reviews and operational readiness exercises for new and existing services.
  • Analyze system statistics to provide a clear picture of the current state of systems.
  • Automate manual operations and improve processes through repeated iterations.
  • Lead a small team to develop innovative solutions for reliability engineering.
  • Participate in an on-call rotation providing hands-on technical expertise during service-impacting events.

Requirements

  • 5 years of hands-on experience as an SRE engineer supporting large scale microservices applications.
  • 5 years of experience in deploying, supporting, and monitoring cloud services in a large scale, customer-facing environment.
  • 5 years of hands-on experience in developing Java-based applications.
  • 5 years of hands-on experience building complex queries and dashboards using Splunk.
  • 5 years of promoting observability of systems for monitoring, alerting, and metrics reporting using Datadog, Prometheus, and similar tools.
  • 5 years proficiency with at least 1 scripting language like Python.
  • 5 years hands-on experience working with Kubernetes, Docker, and containerization.
  • Proven track record for eliminating repetitive manual processes using automation.
  • 5 years working on maintenance tasks for Oracle and Cassandra databases.

Nice-to-haves

  • BS in Computer Science or equivalent work experience.
  • Strong problem-solving skills, software development, and debugging skills.
  • Proven track record of taking ownership and successfully delivering results.
  • Experience in leading small teams (3-4 members) to design and develop scalable SRE solutions while collaborating with other teams.
  • Fluency in Japanese language.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service