Site Reliability Engineer

Disability Solutions - Atlanta, GA

posted 2 months ago

Full-time - Entry Level

Hybrid - Atlanta, GA

5,001-10,000 employees

Administrative and Support Services

About the position

As a Site Reliability Engineer at Honeywell Connected Enterprise (HCE), you will play a crucial role in ensuring the reliability, availability, and performance of our software systems. Your responsibilities will include designing, implementing, and maintaining the infrastructure and tools necessary for monitoring and managing our applications. Your expertise in automation and troubleshooting will be essential in identifying and resolving issues to minimize downtime and optimize system performance. You will collaborate with cross-functional teams to drive continuous improvement and implement best practices for system reliability. This position is based in Atlanta, Georgia, and operates on a hybrid work schedule, allowing for a blend of in-office and remote work. In this role, you will have a significant impact on the reliability and performance of our software systems, ensuring seamless operations and customer satisfaction. You will be involved in hands-on design, analysis, development, and troubleshooting of highly distributed large-scale production systems and event-driven, cloud-based services. Your primary focus will be on Linux Administration, managing a fleet of Linux and Windows VMs as part of the application solution. You will also engage in infrastructure as code development using tools like Terraform, shell scripting, and Python. Your responsibilities will extend to ensuring the repeatability, traceability, and transparency of our infrastructure automation. You will support on-call rotations for operational duties that have not been addressed with automation and promote healthy software development practices, including compliance with chosen software development methodologies such as Agile. Additionally, you will create and maintain monitoring technologies and processes that improve visibility into our applications' performance and business metrics, keeping operational workload in check. Partnering with security engineers, you will develop plans and automation to respond to new risks and vulnerabilities effectively. Your role will also involve participating in technical training events, game day scenarios, and professional conferences to enhance your skills and knowledge.

Responsibilities

Design, implement, and maintain infrastructure and tools for monitoring and managing applications.
Perform hands-on design, analysis, development, and troubleshooting of large-scale production systems.
Manage a fleet of Linux and Windows VMs as part of the application solution.
Develop infrastructure as code using Terraform, shell, and Python.
Ensure repeatability, traceability, and transparency of infrastructure automation.
Support on-call rotations for operational duties not addressed by automation.
Promote healthy software development practices and compliance with methodologies like Agile.
Create and maintain monitoring technologies and processes for application performance.
Collaborate with security engineers to respond to risks and vulnerabilities.
Participate in technical training events and professional conferences.

Requirements

2+ years of experience in system administration, application development, infrastructure development, or related areas.
2+ years of experience in Azure cloud administration and solution design.
2+ years of programming experience in languages like JavaScript, Python, PHP, Go, Java, or Ruby.
3+ years of mastery in infrastructure automation technologies (Terraform, CodeDeploy, Puppet, Ansible, Chef).
2+ years of expertise in container/container-fleet orchestration technologies (Kubernetes, AKS, EKS, Docker, etc.).
2+ years of cloud and container-native Linux administration/build/management skills.

Nice-to-haves

Versatility with troubleshooting diverse hosting technologies including web server platforms, application platforms, and operating systems.
Expertise with cloud-based software development lifecycles (CI/CD).
Experience with cloud database operations and deployment (RDS MySQL/Postgres/Aurora).
Familiarity with site and infrastructure monitoring systems (ELK, Datadog, AppDynamics, etc.).
Strong problem-solving and root cause analysis skills.
Excellent presentation and communication skills.

Benefits

Employer subsidized Medical, Dental, Vision, and Life Insurance.
Short-Term and Long-Term Disability.
401(k) match.
Flexible Spending Accounts and Health Savings Accounts.
Employee Assistance Program (EAP).
Educational Assistance.
Parental Leave.
Paid Time Off for vacation, personal business, sick time, and parental leave.
12 Paid Holidays.

Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company