TRC Companiesposted 27 days ago
$120,000 - $140,000/Yr
Full-time • Mid Level
Clifton Park, NY
Administrative and Support Services

About the position

We are looking to hire a Site Reliability Engineer (SRE) to build and maintain software that enables our customers to navigate our site quickly. The ideal candidate will provide skilled problem-solving measures that enable product expansion and improve the customer experience to accomplish company objectives. The ideal candidate will have a background in the utility industry and possess strong knowledge of agile methodologies, specifically Scrum. The SRE will be responsible for ensuring the reliability, availability, and performance of our software products. This role involves working closely with both onshore and offshore teams to support migrations and product development. Excellent communication skills and the ability to collaborate effectively with diverse teams are essential. To be a successful SRE, you should be meticulous and detail-oriented, with excellent technical and information security skills.

Responsibilities

  • Develop and provide operational support for full-stack software applications.
  • Ensure the reliability, availability, and performance of software products in production.
  • Collaborate with development operations staff to create, monitor, and troubleshoot the entire system including infrastructure.
  • Increase system resilience and serve larger customer volumes with expert-level coding, bulletproof release, and change management skills.
  • Monitor and manage system health, performance, and capacity planning.
  • Automate repetitive tasks to improve system efficiency and reduce manual intervention.
  • Collect operating system data and report performance metrics to stakeholders.
  • Manage cloud and database system maintenance, debugging production issues as they arise.
  • Experience building software and computer systems using a variety of languages (JavaScript, Python, etc.).
  • Comfortable working with cloud-native infrastructure, such as AWS Lambda, and Azure Cloud Services.
  • Impeccable creative and communication skills.
  • Ability to problem solve in a fast-paced, high-stakes environment.
  • Monitoring application and VPC traffic for suspicious behavior.
  • Creating application policies and service level agreements metrics for measuring operations.
  • Develop and maintain incident response plans and conduct post-incident reviews to prevent future occurrences.
  • Participate in on-call rotations to provide timely support for critical systems.
  • Foster a culture of continuous improvement and learning within the team.
  • Consulting with staff, managers, and executives about the best operational practices and providing technical advice.

Requirements

  • Bachelor's degree in computer science, cyber security, or a related field.
  • 5+ years experience as a site reliability engineer or similar role.
  • Relevant industry certifications, such as through the Site Reliability Engineering (SRE) Foundation.
  • Strong knowledge of agile methodologies, particularly Scrum.
  • Proficiency in scripting and automation tools (e.g., Python, Bash, Ansible).
  • Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Familiarity with cloud platforms (e.g., AWS, Azure, Google Cloud) preferably AWS.
  • Excellent communication skills, both written and verbal.
  • Proven ability to work effectively with offshore and onshore teams.
  • Strong problem-solving skills and a proactive approach to identifying and resolving issues.
  • 5 + years of cloud infrastructure automation technologies (Terraform, Code Deploy).

Nice-to-haves

  • Master's degree in Computer Science, or a related field.
  • Certifications in cloud technologies (e.g., AWS Certified Solutions Architect, Google Cloud Professional DevOps Engineer).
  • Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
  • Experience in performance tuning and optimization of software systems.
  • 6+ years of managing images (e.g., AWS AMIs) including remediation and patching strongly preferred.
  • Familiarity with the Energy Efficiency Domain a plus.

Benefits

  • Medical, dental, vision, and disability insurance.
  • 401k package that includes both traditional and Roth IRA options and Company match.
  • Paid time off contingent upon full time or part time status and level of seniority (ranging from 15 to 25 days per year).
  • All full-time employees enjoy a minimum of 8 Paid Holidays per year.
  • TRC ensures that all employees, including those that work part-time, receive paid sick, family, and disability leave in accordance with the laws of their state of residence.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service