Knowbe4 - Clearwater, FL

posted 7 days ago

Full-time
Remote - Clearwater, FL
Educational Services

About the position

The Internal Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of internal systems and infrastructure. This role focuses on monitoring, automation, incident management, and maintaining self-hosted platforms to support smooth development operations. The Internal SRE collaborates with cross-functional teams to manage GitLab CI/CD workflows and cloud infrastructure on AWS, emphasizing proactive problem-solving and continuous improvement of system stability and efficiency.

Responsibilities

  • Manage and maintain GitLab environments to ensure high availability and security.
  • Design and implement CI/CD pipelines to automate software delivery.
  • Monitor and troubleshoot system performance issues, using observability tools like Prometheus, Grafana, or Datadog.
  • Collaborate with development teams to align infrastructure efforts with project needs and timelines.
  • Build and maintain infrastructure as code (IaC) solutions using tools like Terraform and Ansible.
  • Manage AWS services, including ECS, S3, API Gateway, DynamoDB, RDS, IAM, and VPC.
  • Participate in incident response, conducting root cause analysis and post-incident reviews.
  • Automate manual tasks to improve operational efficiency and reduce technical debt.

Requirements

  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • Equivalent work experience in SRE, DevOps, or infrastructure management may substitute for formal education.
  • Experience managing and securing self-hosted GitLab environments.
  • Expertise in designing and maintaining automated pipelines for continuous delivery.
  • Strong knowledge of AWS services, including ECS, S3, API Gateway, DynamoDB, RDS, IAM, VPC, and Lambda.
  • Proficiency in Terraform, Ansible, or similar tools.
  • Experience with Prometheus, Grafana, Datadog, or other observability platforms.
  • Proficiency in Python, Bash, or other scripting languages to automate tasks.
  • Ability to lead incident response efforts and conduct root cause analysis.
  • Strong interpersonal skills to work effectively across teams and with stakeholders.

Benefits

  • Competitive salary based on experience and skills.
  • Opportunities for professional development and training.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service