Site Reliability Engineer

$135,000 - $175,000/Yr

Striveworks - Tampa, FL

posted 1 day ago

Full-time - Mid Level
Hybrid - Tampa, FL

About the position

As a Site Reliability Engineer (SRE) at Striveworks, you will be challenged-and trusted-on day one to take ownership of specific product deployments by maintaining, optimizing, and enhancing our on-premises and cloud computing environments. You will play a crucial role in the successful deployment of our software solutions to clients. You will be responsible for executing the technical aspects of implementation projects and for ensuring the seamless integration, customization, and configuration of our software. Your expertise will play a critical role for the company as we deploy new instances of Striveworks' machine learning operations (MLOps) capabilities to customer infrastructure. You are right for this opportunity if you value and possess technical expertise and you enjoy pushing the boundaries of your capabilities. You will be responsible for maintaining Striveworks' software deployments both on-premises and with various cloud service providers, using Infrastructure-as-Code (IaC) methodologies. Your day-to-day will include: * Automating IaC to stand-up virtual machines and deploying containers, services, and other infrastructure; leaning on expertise to deploy custom Kubernetes clusters in AWS, Azure, GCP, on-premises, or hybrid cloud environments * Working with platform developers and DevOps to define requirements and build solutions for customer use cases of the platform * Software deployments to unclassified, CUI, Secret, and Top Secret DOD networks * Incident response and initial triage of critical system faults The SRE works on the DevOps team and acts as a liaison between DevOps, platform developers, and professional services teams, taking on operational tasks to ensure the efficient functioning of Striveworks' customer solutions. The SRE monitors, automates, and improves software reliability, performance, and availability, which supports the IT needs for various projects. They work alongside a team of software engineers and data scientists to help them deploy and operate their work as functional products, learning from them so that building effective AI solutions becomes second nature. You will directly contribute to the success of mission-critical systems within national security and commercial clients. You will be expected to wear multiple hats and to step into vacuums where improvements are needed, and you will be given the breadth to explore new technologies and solutions.

Responsibilities

  • Automating IaC to stand-up virtual machines and deploying containers, services, and other infrastructure
  • Deploying custom Kubernetes clusters in AWS, Azure, GCP, on-premises, or hybrid cloud environments
  • Working with platform developers and DevOps to define requirements and build solutions for customer use cases of the platform
  • Software deployments to unclassified, CUI, Secret, and Top Secret DOD networks
  • Incident response and initial triage of critical system faults
  • Monitoring, automating, and improving software reliability, performance, and availability
  • Acting as a liaison between DevOps, platform developers, and professional services teams

Requirements

  • 2+ years of direct, hands-on experience in microservice deployment in Kubernetes
  • Helm Chart and Kustomizations development/deployment
  • Python programming
  • Automation and IaC (e.g., Terraform, Ansible)
  • Cloud infrastructure (e.g., AWS, Azure, GCP, or OpenStack)
  • Understanding of networking concepts, security, and disaster recovery best practices
  • Strong problem-solving skills and the ability to troubleshoot complex technical issues
  • Communication and collaboration skills; the ability to work effectively in a cross-functional team environment
  • Active Top Secret security clearance and intimate familiarity with DOD networking, tools, infrastructure, security requirements, and policies

Nice-to-haves

  • Bash programming
  • Experience managing and troubleshooting Linux systems (e.g., RHEL, Ubuntu, Centos)
  • Experience deploying, maintaining, or contributing to CNCF projects
  • Proficiency with US federal information system security policies, including Security Technical Implementation Guides (STIGs), NIST 800-171, NIST 800-53, CMMC, ICD 503
  • Experience with DevSecOps/DevOps and CI/CD for administration and deployment of GPU-enabled servers
  • Experience with Network-Attached Storage (NAS) and Storage Area Networks (SAN) technologies
  • Deep knowledge of DevOps principles and practices for deploying and managing service mesh in cloud environments
  • Experience with both blue-green and Canary deployment strategies
  • Experience designing, managing, and optimizing workloads across multiple cloud providers
  • Experience with Kubernetes and cloud-native applications and services in denied, disrupted, intermittent, and limited impact (DDIL) environments
  • DOD 8570 IAT II certification (Security+ CE); proficient with security automation and familiarity with API security, container security, and cloud security

Benefits

  • Top-of-market salary and total compensation
  • Generous equity plan
  • Health/vision/dental insurance
  • Unlimited PTO
  • Parental leave
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service