Hitachi - Dallas, TX

posted 2 months ago

Full-time - Mid Level
Dallas, TX
Furniture, Home Furnishings, Electronics, and Appliance Retailers

About the position

The Lead Site Reliability Engineer is a crucial member of the Site Reliability Engineering (SRE) team at Hitachi Digital Services. This role is essential for ensuring the availability, reliability, and performance of our services and platforms in a highly transactional 24x7 environment. As a Lead Site Reliability Engineer, you will be responsible for monitoring application performance, implementing improvements, and ensuring overall application stability. You will apply automation and software solutions to tasks that can benefit from these enhancements, particularly those that are currently performed manually. Your expertise will be vital in troubleshooting issues related to operating systems, networking, and databases in both cloud-based and on-premises environments. You will also handle live production incidents, debug and troubleshoot application and infrastructure issues, and implement SRE best practices. In this role, you will conduct system analysis and configuration management, developing improvements for system software performance, availability, and reliability. Collaboration is key, as you will work closely with software engineers and quality assurance teams to ensure that the system meets non-functional requirements such as performance, security, and availability. Documentation is also a critical aspect of this position; you will be expected to document system knowledge, create runbooks, and ensure that critical system information is readily available to those who need it. Additionally, you will maintain and monitor the deployment of servers, docker containers, databases, and general backend infrastructure, ensuring that everything runs smoothly and efficiently.

Responsibilities

  • Monitor application performance and implement improvements for stability.
  • Apply automation and software solutions to manual tasks.
  • Troubleshoot issues related to OS, Networking, and Database in cloud/on-premises environments.
  • Handle live production incidents and debug application and infrastructure issues.
  • Conduct system analysis and configuration management.
  • Develop improvements for system software performance, availability, and reliability.
  • Collaborate with software engineers and QAs to meet non-functional requirements.
  • Document system knowledge and create runbooks for critical information.
  • Maintain and monitor deployment of servers, docker containers, and databases.

Requirements

  • Bachelor's Degree in Computer Science or related field, or equivalent experience.
  • 8+ years of experience in full-stack application support, DevOps, or SRE role.
  • Experience in Javascript, Typescript, and web development technologies.
  • Proficient in scripting languages such as Powershell and/or Python.
  • Strong troubleshooting skills using built-in browser tools.
  • Knowledge of DevOps methodologies and tools, including CI/CD concepts and tools like Jenkins and CodePipeline.
  • Familiarity with automation and configuration tools such as Puppet and Ansible.
  • Experience with public clouds (GCP, AWS, Azure) and implementing projects on them is a plus.
  • Excellent verbal and written communication skills.
  • Ability to collaborate with local and remote teams across different time zones.
  • Ability to present and lead technical discussions.

Nice-to-haves

  • Experience with additional programming languages or frameworks.
  • Familiarity with container orchestration tools like Kubernetes.
  • Knowledge of monitoring and logging tools such as Prometheus or Grafana.

Benefits

  • Industry-leading benefits and support for holistic health and wellbeing.
  • Flexible work arrangements based on role and location.
  • Participation in bonus/variable/commission pay programs.
  • Support for diversity, equity, and inclusion initiatives.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service