Site Reliability Engineer

$122,720 - $124,800/Yr

Randstad - Merrimack, NH

posted 5 months ago

Full-time - Mid Level

Merrimack, NH

Administrative and Support Services

About the position

As a Site Reliability Engineer (SRE), you will be at the forefront of ensuring the reliability and performance of our large-scale, distributed systems. This role combines software engineering and systems engineering principles to build and maintain fault-tolerant systems that can handle massive workloads. You will be responsible for managing Kubernetes clusters, particularly with Amazon EKS, and will need to demonstrate strong troubleshooting skills in this area. Your expertise in Python programming and API development will be crucial as you work to enhance our systems and automate processes. In this position, you will also engage with various monitoring and data visualization tools such as Datadog, Splunk, ELK, Prometheus, and Grafana. Your hands-on experience with these tools will help in log aggregation, monitoring, and alerting, ensuring that our systems are always performing optimally. You will be expected to implement AWS products and services effectively, utilizing infrastructure as code tools like CloudFormation and Terraform to manage our cloud resources efficiently. Collaboration is key in this role, as you will be working within a globally distributed team. Strong communication skills, both written and oral, are essential to convey complex technical information clearly. You will also be expected to contribute to a DevOps culture, embracing agile methodologies and continuously learning new technologies and practices to improve our systems and processes. This position is a contract role based in either Merrimack, NH, or Westlake, TX, with a work schedule from 9 AM to 5 PM.

Responsibilities

Manage and administer Kubernetes clusters using Amazon EKS.
Troubleshoot issues related to Kubernetes and ensure system reliability.
Develop and maintain Python scripts and APIs for system automation.
Implement and manage CI/CD pipelines using tools like Ansible, Jenkins, and ArgoCD.
Utilize log aggregation and monitoring tools for system performance tracking.
Work with AWS products and services to enhance system capabilities.
Implement infrastructure as code using CloudFormation and Terraform.
Collaborate with globally distributed teams to ensure seamless operations.
Communicate effectively with team members and stakeholders regarding system performance and issues.

Requirements

Minimum of 3 years of experience in systems and platform operations.
Strong experience in managing Kubernetes cluster administration.
Proficient in Python programming and API development.
Solid understanding of application networking principles.
Hands-on experience with log aggregation and monitoring tools (Datadog, Splunk, ELK, Prometheus, Grafana).
Experience with infrastructure as code tools (CloudFormation, Terraform).
Strong knowledge of container technologies, particularly Docker.
Experience implementing CI/CD pipelines with Ansible and Jenkins.
Understanding of AWS cloud security and account management.
Strong Linux and shell scripting skills.
Ability to work in an agile environment and adapt to new technologies.

Nice-to-haves

Experience with additional monitoring tools and frameworks.
Familiarity with DevOps practices and culture.
Knowledge of advanced AWS services and features.

Benefits

Health insurance coverage.
401K contribution plan.
Incentive and recognition program.

Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company