Site Reliability Engineer II

Chewy - Richardson, TX

posted 3 months ago

Full-time - Mid Level

Richardson, TX

Sporting Goods, Hobby, Musical Instrument, Book, and Miscellaneous Retailers

About the position

As a Site Reliability Engineer II at Chewy, you will play a crucial role in enhancing site reliability and resiliency, managing system operations, and implementing infrastructure as code. This position is based in Richardson, Texas, and involves leveraging AWS services and containerization techniques to ensure a seamless transition of applications to production. You will be responsible for supporting the implementation and management of Chewy platform standards, which are essential for maintaining high availability and performance of our services. Your primary focus will be on creating a comprehensive framework for automating and optimizing processes, thereby reducing the need for manual intervention. You will utilize tools such as Python and Terraform to achieve efficient process automation and establish a robust framework for site reliability that can be measured and reported to our customers. Additionally, you will implement scalable processes using various automation tools and take charge of maintaining security hardening on the Load Balancer end, overseeing regular upgrades and software maintenance. In this role, you will also engage in daily operations and regular developer/admin activities on the Chewy platform, sharing reports across the organization to ensure transparency and accountability. Your contributions will be vital in ensuring that our systems are reliable, secure, and performant, ultimately enhancing the customer experience.

Responsibilities

Support the implementation and management of Chewy platform standards to facilitate the seamless transition of applications to production.
Leverage AWS services and employ containerization through Infrastructure as Code (IAC) techniques.
Provide a comprehensive framework for automating and optimizing processes, minimizing reliance on manual intervention.
Utilize tools such as Python and Terraform for efficient process automation.
Establish a robust framework for site reliability that can be measured and reported to customers.
Implement scalable processes using various process automation tools.
Maintain security hardening on the Load Balancer and oversee regular upgrades and software maintenance.
Engage in daily operations and regular Developer/Admin activities on the Chewy platform and share reports across the organization.

Requirements

Bachelor's degree in Computer Science, Information Science, Network Engineering, Cyber Security, Site Reliability, or related field and 5 years of experience; or a Master's degree and 3 years of experience.
Hands-on experience with cloud services, specifically AWS including EC2 instances, ECS and EKS container platforms, IAM roles, network VPC configurations, Load Balancers, and other essential AWS services (3 years required).
Knowledge of broader AWS global setup including different regions, availability zones, and designing systems for reliability and fault tolerance.
Experience with containerization technologies including Docker and AWS Fargate.
Expertise in orchestrating containers using Elastic Container Services or Kubernetes.
Utilization of observability tools such as Datadog or Splunk.
Experience with Service Level Objectives (SLOs) and ability to measure reliability of services.
Familiarity with CI/CD tools and processes, including pipelines-as-code (Jenkins, Github Actions).
Experience with configuration management and infrastructure-as-code (Terraform) for cloud provisioning.
Ability to troubleshoot and handle outages and incidents.

Benefits

Employee Referral Program
Commitment to equal opportunity and diversity and inclusion in the workplace

Site Reliability Engineer II

About the position

Responsibilities

Requirements

Benefits

Tools

Career Hubs

Guides

Company