Hitachi - Dallas, TX
posted 2 months ago
The Lead Site Reliability Engineer is a crucial member of the Site Reliability Engineering (SRE) team at Hitachi Digital Services. This role is essential for ensuring the availability, reliability, and performance of our services and platforms in a highly transactional 24x7 environment. As a Lead Site Reliability Engineer, you will be responsible for monitoring application performance, implementing improvements, and ensuring overall application stability. You will apply automation and software solutions to tasks that can benefit from these enhancements, particularly those that are currently performed manually. Your expertise will be vital in troubleshooting issues related to operating systems, networking, and databases in both cloud-based and on-premises environments. You will also handle live production incidents, debug and troubleshoot application and infrastructure issues, and implement SRE best practices. In this role, you will conduct system analysis and configuration management, developing improvements for system software performance, availability, and reliability. Collaboration is key, as you will work closely with software engineers and quality assurance teams to ensure that the system meets non-functional requirements such as performance, security, and availability. Documentation is also a critical aspect of this position; you will be expected to document system knowledge, create runbooks, and ensure that critical system information is readily available to those who need it. Additionally, you will maintain and monitor the deployment of servers, docker containers, databases, and general backend infrastructure, ensuring that everything runs smoothly and efficiently.