Site Reliability Engineer

$135,400 - $250,600/Yr

Apple - San Diego, CA

posted 3 months ago

Full-time - Mid Level
San Diego, CA
Computer and Electronic Product Manufacturing

About the position

The Apple Information Apps Engineering teams are responsible for powering some of the most widely used applications at Apple, including Apple News, Stocks, Weather, and Books. Operating at a massive global scale, we meet high expectations through a commitment to best practices, enabling us to deliver a vast array of information that people worldwide utilize daily in over 150 countries. We are currently seeking an experienced and dynamic Site Reliability Engineer (SRE) Operator to join our team, focusing on maintaining the reliability, availability, and performance of our systems. The ideal candidate will possess a strong background in production monitoring, a deep understanding of development and operations, and a proven track record in managing large-scale production environments. As part of our highly collaborative team, you will work closely with partner teams to achieve the best results for Apple. We prioritize finding effective solutions while ensuring efficiency in addressing each engineering challenge we encounter. Good ideas are valued and rewarded within our team culture. In your role as an SRE at Apple, you will be responsible for operating, monitoring, and triaging all aspects of our production and non-production environments. You will pioneer and implement the next generation telemetry system for Apple News, Stocks, Weather, and Books, prepare alert handling procedures and runbooks, and collaborate with our off-shore SRE team. Additionally, you will automate the deployment and orchestration of services into the cloud environment, participate in capacity planning and disaster recovery exercises, and support partner teams including engineering, SRE, QA, and project management by creating self-service solutions for them. Building and maintaining relationships with internal and external third-party vendors will also be a key part of your responsibilities.

Responsibilities

  • Operate, monitor, and triage all aspects of production and non-production environments.
  • Pioneer and implement the next generation telemetry system for Apple News, Stocks, Weather, and Books.
  • Prepare alert handling procedures and runbooks, collaborating with the off-shore SRE team.
  • Automate deployment and orchestration of services into the cloud environment and other routine processes.
  • Actively participate in capacity planning and disaster recovery exercises.
  • Interact with and support partner teams including engineering, SRE, QA, and project management.
  • Create self-service solutions for partner teams.
  • Cultivate and maintain relationships with internal and external third-party vendors.

Requirements

  • At least 3 years of prior demonstrated experience in a Site Reliability Engineering, DevOps, or an Infrastructure-focused role.
  • Linux expertise.
  • Support of internet-facing production services and distributed systems via deployments, on-call, and incident management.
  • Proficiency in implementing and coordinating telemetry using monitoring and observability tools like Splunk, Grafana, and Prometheus, or similar.
  • Experience in solving and resolving issues in Kubernetes from both an operating system and application perspective.
  • Hands-on scripting with Python.
  • Building and operating container orchestrating systems like Kubernetes or EKS.
  • Designing, building, and maintaining infrastructure with a cloud provider such as AWS.
  • Advocacy for automation and a history of removing operational toil via software.
  • Strong sense of ownership and team camaraderie with clear and transparent communication abilities.
  • Self-motivated, inquisitive, and always looking to learn more.

Nice-to-haves

  • Networking, TCP/IP network fundamentals and basic troubleshooting.
  • Disaster recovery and capacity planning.
  • Deployment automation via Terraform or CloudFormation.
  • Systems built upon open source storage and search technologies including Cassandra, Kafka, Solr, Postgres, and Redis.

Benefits

  • Comprehensive medical and dental coverage.
  • Retirement benefits.
  • Discounted products and free services.
  • Reimbursement for certain educational expenses, including tuition.
  • Opportunity to participate in Apple's discretionary employee stock programs.
  • Eligibility for discretionary restricted stock unit awards and the Employee Stock Purchase Plan.
  • Potential for discretionary bonuses or commission payments.
  • Relocation assistance.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service