Raas Infotek - Atlanta, GA
posted 2 months ago
The Site Reliability Engineer (SRE) position is a critical role focused on ensuring the reliability, availability, and performance of our systems. As an SRE, you will be responsible for maintaining highly available and scalable systems, leveraging your extensive experience in cloud technologies and infrastructure management. You will work closely with development teams to implement best practices in automation, monitoring, and incident response, ensuring that our services meet the highest standards of reliability and performance. In this role, you will utilize your strong proficiency in scripting and programming languages such as Python, Bash, Ruby, or Go to develop automation tools that enhance operational efficiency. Your expertise in cloud platforms like AWS, Azure, or Google Cloud Platform will be essential as you manage infrastructure using infrastructure-as-code tools like Terraform or Cloud Formation. You will also be expected to have a solid understanding of containerization and orchestration technologies, particularly Docker and Kubernetes, to facilitate the deployment and management of applications in a cloud environment. As part of your responsibilities, you will implement and maintain monitoring and logging solutions using tools such as Prometheus, the ELK stack, or Splunk, and utilize distributed tracing frameworks like Jaeger or Zipkin to troubleshoot complex issues in production environments. Your strong problem-solving skills will be crucial in identifying and resolving incidents quickly, minimizing downtime and ensuring a seamless user experience. Collaboration is key in this role, as you will work effectively in cross-functional teams, communicating clearly with both technical and non-technical stakeholders. Your ability to mentor and lead others will also be valuable, as you help foster a culture of reliability and operational excellence within the organization.