Alibaba Cloudposted about 1 month ago
$133,200 - $219,600/Yr
Full-time • Mid Level
Sunnyvale, CA

About the position

We are looking for a Site Reliability Engineer (SRE) specialized in the database domain to support the stable operation of Alibaba Cloud's OLAP platform. This role combines software and systems engineering to ensure the reliable operation of Alibaba Cloud's database OLAP platform, providing stable OLAP database services to customers.

Responsibilities

  • Ensuring System Stability and High Availability: Responsible for health checks of components within the database foundational platform, developing maintenance tools for routine inspections, identifying and resolving potential risks in advance.
  • Development of Operations Platforms and Tools: Design and implement automated operations platforms that can maintain large-scale online clusters. Monitor and maintain various operational metrics, optimizing the system through data analysis. Participate in solving issues related to capacity, performance, and stability in production systems, designing and implementing automated operations platforms for large-scale online clusters.
  • Ensuring System Stability and High Availability: Design and implement high-availability systems, such as automatic fault localization, automatic recovery, adaptive disaster recovery, and implementation of cloud-native technologies, to ensure continuous business availability.
  • Incident Handling and Emergency Response: During major events like promotional sales, ensure smooth user experience under massive peak loads while maintaining cost control. Handle live network issues, including fault diagnosis, disaster recovery, intelligent scheduling, elastic scaling, and anti-attack measures.
  • Close Collaboration with Development Teams: Work closely with product teams to promptly identify and optimize technical architectures, improving service response latency and performance, and enhancing service availability. Actively participate in discussions and designs of business solutions, promoting optimization and improvement of services.

Requirements

  • Bachelor's degree in Computer Science, or a related technical field, or equivalent practical experience.
  • 4+ years of work experience in Site Reliability Engineer within the domain of databases or other cloud products.
  • Familiar with the basic principles of the Linux kernel, common tools and commands, and has good skills in diagnostics and optimization.
  • Proficient in at least one or more of the following languages: Java, Python, Go with experience in developing operations and maintenance tools.
  • Familiar with open-source cloud platforms such as Kubernetes, OpenStack, and CloudFoundry.

Nice-to-haves

  • Experience with OLAP system like Clickhouse, PostgreSQL, Presto, as well as open-source databases and queue products like Redis, MongoDB, HBase, Cassandra, Kafka, and Elasticsearch, with knowledge of their principles or operational experience being a plus.
  • Requires experience in operating large-scale distributed systems, with proficiency in at least one major cloud platform.
  • Excellent problem-solving and analytical skills.

Job Keywords

Hard Skills
  • Alibaba Cloud
  • Cloud Database
  • Cloudfoundry
  • Elasticsearch
  • Go
  • 1MLPXIh8rG Nudox4jc9
  • 26oy5J7vc jdIt0Z4SPpm
  • 3VdYkv8jOtb
  • 40bVyl
  • 7depv XYT2IxS7nml
  • dr1Rbea
  • dt1qhWr8oj
  • dt28vLFYpJX
  • eA1cqXJr u06IqEh
  • jnJt9
  • McJqV GOKb2TPM
  • mH4jOGpV am9X1rSbBQiw
  • N7U0IQAipv8s IVR4CKDakMwH
  • nB1OhE
  • pE4MAZgS5 cuPVjBlRA
  • Qporevx8zhCM i8akuRAyd7og
  • qQyPgSEF wAdB0Eh
  • r46qOgV5
  • SMmBcxw31 MR2 vS WQqZvX9l
  • TJsM3C 4KHJZxsRpN
  • utGvYD175 3kiHlpu1
  • V2UGM aBxzcjw2f1Q6o TYQo3Ww
  • XjsRqYZBMO1 pHCDEKz
  • xuHYGC64ZQR HTRDLKE
  • xv5QCpS Gso8CDEjiOfv0
  • xZgaI9Cs vnHy7JMO
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service