Hitachi Digital Services - Dallas, TX

posted 4 days ago

Full-time - Senior
Dallas, TX
1,001-5,000 employees

About the position

As an SRE Lead at Hitachi Digital Services, you will be responsible for ensuring the availability, reliability, and performance of both cloud-based and on-premises platforms. This role involves leading a team of engineers to troubleshoot and optimize systems while promoting automation and SRE best practices. You will also manage incident processes, drive innovation in generative AI applications, and mentor team members to uphold high operational standards.

Responsibilities

  • Lead a team of platform, application, and incident SREs to manage and resolve complex production issues.
  • Improve application performance, availability, and reliability.
  • Implement observability solutions for proactive issue identification and optimization.
  • Manage processes for incidents, changes, releases, and deployments.
  • Develop automation tools (IaC, alert as code, dashboard as code) to enhance efficiency.
  • Conduct POCs to implement tools supporting generative AI platforms.
  • Analyze trends in incidents, problems, and alerts to drive operational improvements.
  • Document SOPs, critical systems information, and best practices for current and future use.
  • Provide technical guidance and mentorship to junior SRE team members.
  • Stay updated on advancements in generative AI technologies and responsible AI practices.

Requirements

  • Proven experience with SRE principles and practices in managing on-premises and cloud applications.
  • Knowledge of generative AI applications and related technologies.
  • Strong leadership skills, with the ability to drive team performance and continuous improvement.
  • Analytical skills for resolving complex technical issues, ensuring system reliability, and minimizing downtime.
  • Excellent communication and collaboration skills to work effectively with cross-functional teams.
  • Expertise in SRE principles: anomaly detection, root cause analysis, and predictive maintenance.
  • Proficiency in defining SLIs, SLOs, and error budgets.
  • Experience leading an operations team in application production environments.
  • Knowledge of scripting languages (e.g., Java, Python, PowerShell).
  • Hands-on experience with Kubernetes and OpenTelemetry.
  • Understanding of generative AI, large language models (LLMs), and responsible AI.
  • Familiarity with DevOps methodologies, tools, and automation (e.g., CI/CD pipelines, Terraform, Helm).
  • Experience with public/private cloud platforms (e.g., AWS, Azure, GCP).

Nice-to-haves

  • Knowledge of fine-tuning models, prompt engineering, retrieval-augmented generation (RAG), and cost optimization techniques.

Benefits

  • Industry-leading benefits and support for holistic health and wellbeing.
  • Flexible arrangements that work for you, depending on role and location.
  • A culture that champions life balance and encourages new ways of working.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service