Senior Software Engineer, DevOps

$190,000 - $240,000/Yr

Cointelegraph - San Francisco, CA

posted 5 months ago

Full-time
San Francisco, CA
Publishing Industries

About the position

The Site Reliability Engineer (SRE) role is pivotal in ensuring the reliability and performance of our services throughout their lifecycle. This position involves engaging in and improving the entire lifecycle of services, from inception and design through deployment, operation, and refinement. As an SRE, you will embed with engineering teams to apply industry best practices, ensuring that our systems, infrastructure, and applications are built and managed through automation. You will support services before they go live by participating in activities such as system design consulting, developing software platforms and frameworks, capacity planning, and conducting launch reviews. Once services are live, you will maintain them by measuring and monitoring their availability, latency, and overall system health. Your responsibilities will also include scaling systems sustainably through mechanisms like automation and evolving systems by advocating for changes that enhance reliability and velocity. You will practice sustainable incident response and conduct blameless postmortems to learn from incidents. Together with your engineering team, you will share an on-call rotation and serve as an escalation contact for service incidents, ensuring that we maintain high service levels and quickly address any issues that arise.

Responsibilities

  • Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement.
  • Embed with Engineering teams to apply industry best practices.
  • Build and manage systems, infrastructure, and applications through automation.
  • Support services before they go live through system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Share an on-call rotation and be an escalation contact for service incidents.

Requirements

  • BS in Computer Science with 7+ years or MS with 3 years of working experience as Site Reliability Engineering (SRE).
  • Experience with programming in at least one of the following languages: C, C++, Java, Python, or Go.
  • Strong skills around observability, debugging, and performance tuning, with a willingness to dive into understanding, debugging, and improving any layer of the stack.
  • Experience in data transformation & ETL on large data sets using open technologies like Spark, SQL, and Python.
  • Experience in infrastructure technologies such as AWS, GCP, Azure, MySQL, Kubernetes, Docker, etc.
  • Deep knowledge of Linux internals and networking protocols such as TCP/IP.
  • Experience in advanced Data Lake, Data Warehouse concepts & Data Modeling experience (i.e. Relational, Dimensional, internet-scale logs).
  • 3+ years of complex SQL with strong knowledge of SQL optimization and understanding of logical & physical execution plans.

Nice-to-haves

  • Extensive experience in supporting production services.
  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
  • Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.

Benefits

  • Performance-linked bonus
  • Equity
  • Competitive benefits package
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service