Apple - Austin, TX

posted 4 months ago

Full-time
Austin, TX
Computer and Electronic Product Manufacturing

About the position

As a Site Reliability Engineer (SRE) in the Retail Engineering team, you will play a crucial role in ensuring the reliability, scalability, and performance of our systems and services that integrate Apple Retail Stores and Apple Online Store with major US Carriers for iPhone activations. This position requires a talented individual who can thrive in a dynamic environment and make a meaningful impact through technical expertise and dedication to excellence. You will work closely with engineering and operations teams to design, build, and maintain robust infrastructure and automation solutions. The role demands extensive hands-on experience in working as an SRE engineer for large-scale, customer-facing cloud applications. You should possess a solid understanding of SRE principles, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts. Excellent troubleshooting and problem-solving skills are essential, as you will be expected to represent the SRE organization in design reviews and operational readiness exercises for both new and existing services. Collaboration with technical and non-technical teams will be a key part of your responsibilities, as you analyze statistics to provide a clear picture of the current state of our systems. A passion for automating manual operations and improving processes through repeated iteration is crucial. You should have a good understanding of networking and load balancing concepts and be capable of leading a small team to develop innovative solutions. Self-motivation and the ability to make business-critical decisions in a fast-paced environment are essential. You will also be proactive in addressing critical production issues and ensuring their resolution while collaborating with necessary partners. Participation in an on-call rotation to provide hands-on technical expertise during service-impacting events is also part of the role.

Responsibilities

  • Ensure the reliability, scalability, and performance of systems and services.
  • Design, build, and maintain robust infrastructure and automation solutions.
  • Represent the SRE organization in design reviews and operational readiness exercises.
  • Collaborate with technical and non-technical teams to analyze system statistics.
  • Automate manual operations and improve processes through iteration.
  • Lead a small team to develop innovative SRE solutions.
  • Proactively address critical production issues and ensure their resolution.
  • Participate in an on-call rotation for service-impacting events.

Requirements

  • 5 years of hands-on experience as an SRE engineer supporting large scale microservices applications.
  • 5 years of experience in deploying, supporting, and monitoring cloud services in a large scale, customer-facing environment.
  • 5 years of hands-on experience in developing Java-based applications.
  • 5 years of hands-on experience building complex queries and dashboards using Splunk.
  • 5 years of promoting observability of systems for monitoring, alerting, and metrics reporting using Datadog, Prometheus, and similar tools.
  • 5 years proficiency with at least 1 scripting language like Python.
  • 5 years hands-on experience working with Kubernetes, Docker, and containerization.
  • Proven track record for eliminating repetitive manual processes using automation.
  • 5 years working on maintenance tasks for Oracle and Cassandra databases.

Nice-to-haves

  • BS in Computer Science or equivalent work experience.
  • Fluency in Japanese language is a plus!
  • Strong problem-solving skills, software development, and debugging skills.
  • Proven track record of taking ownership and successfully delivering results.
  • Experience in leading small teams (3-4 members) to design and develop scalable SRE solutions.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service