Cloud Site Reliability Engineer

Venatôre - Liberty, NC

posted 7 days ago

Full-time

Liberty, NC

Professional, Scientific, and Technical Services

About the position

As a Cloud Site Reliability Engineer (SRE) at Venatore, you will play a crucial role in ensuring that our mission is never interrupted. Your primary responsibility will be to help transition legacy technologies to cloud infrastructure in an efficient and secure manner. This position requires a proactive approach to maintaining the production environment, monitoring availability, and taking a holistic view of system health. You will be tasked with building software and systems to manage platform infrastructure and applications, improving reliability, quality, and performance for cloud-hosted applications. In this role, you will measure and optimize system performance, focusing on pushing our capabilities forward, anticipating customer needs, and innovating to continually improve our services. You will provide primary operational support and engineering for multiple large, distributed software applications, gathering and analyzing metrics from both operating systems and applications to assist in performance tuning and fault finding. Collaboration with development teams will be essential as you work to improve services through rigorous testing and release procedures. Additionally, you will participate in system design consulting, platform management, and capacity planning, creating sustainable systems and services through automation and uplifts. Balancing feature development speed and reliability with well-defined service level objectives will be a key aspect of your responsibilities. Your expertise will help ensure that our cloud infrastructure remains robust and responsive to the needs of our users.

Responsibilities

Run the production environment by monitoring availability and taking a holistic view of system health.
Build software and systems to manage platform infrastructure and applications.
Improve reliability, quality, performance for cloud-hosted applications.
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve.
Provide primary operational support and engineering for multiple large, distributed software applications.
Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding.
Partner with development teams to improve services through rigorous testing and release procedures.
Participate in system design consulting, platform management, and capacity planning.
Create sustainable systems and services through automation and uplifts.
Balance feature development speed and reliability with well-defined service level objectives.

Requirements

Bachelor's Degree in a STEM field.
DoD 8570 Level II (Security +) certification.
8+ years of related experience.
Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript.
Adept Shell/BASH scripter.
Experience with distributed storage technologies like NFS, HDFS, Ceph, and S3.
2+ years of experience working with container orchestration technologies, specifically Kubernetes.
Security Clearance Level: Secret to start, must be able to obtain TS/SCI.
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks along with an ability to offer and implement solutions to address these.
Experience creating dashboards to track service health that appeal to both technical and non-technical audiences preferably with Splunk.
Excellent written and verbal communication skills, with a strong attention to detail and a head for problem solving.
Skilled at working in tandem with a team, or unsupervised as required.

Nice-to-haves

Experience working with identity and access management technologies and solutions.
Experience with Agile development methodologies; using collaboration tools such as Jira and Confluence.
Experience with monitoring and logging solutions, specifically Splunk.
Any of the following: AWS Certified SysOps Administrator Associate or AWS Certified Solutions Architect Associate or any Professional level of the above-mentioned certs where applicable.
1+ years' experience working with Gitlab.
Skilled at creating Ansible playbooks, working with AWX/Ansible Tower.

Cloud Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company