Ledgent - Newport Beach, CA

posted about 2 months ago

Full-time - Senior
Newport Beach, CA
Administrative and Support Services

About the position

The Lead Site Reliability Engineer (SRE) will provide technical leadership and accountability for platform engineering, system design, and implementation to meet product non-functional requirements such as quality, security, reliability, availability, and performance. This role involves optimizing design and engineering processes, overseeing production operations, and developing solutions to enhance system reliability and automation.

Responsibilities

  • Lead the design, build, and implement orchestration and tooling solutions for efficient administration tasks.
  • Establish best practices for structuring, automating, building, deploying, and monitoring complex distributed software products.
  • Ensure reliability and traceability of software releases and deployments.
  • Create and maintain platform architecture and design specifications.
  • Design and implement monitoring and recovery tools for high availability and disaster recovery.
  • Develop highly available infrastructure and platform components for evolving product lines.
  • Implement security engineering best practices in deployed platforms.
  • Triage alerts, diagnose, and resolve critical issues, managing change implementations.
  • Coordinate, document, and track critical incidents and root cause analysis for rapid issue resolution.
  • Collaborate with Delivery Engineers and DevExp Engineers to enhance continuous integration/continuous deployment orchestration.
  • Lead, grow, and mentor other SRE team members.
  • Promote the DevSecOps culture and SRE mindset, mentoring others on reliability best practices.
  • Identify opportunities for automation, signal to noise reduction, and prevention of recurring issues.
  • Maintain a strong understanding of IaaS, PaaS, and SaaS offerings for cloud-based environments.
  • Design and implement processes and automation for performance testing.
  • Ensure documentation and operational processes support the solution lifecycle.

Requirements

  • 10-15 years of experience in infrastructure, system engineering, or software engineering.
  • Advanced knowledge in software engineering in test and testing automation frameworks.
  • Advanced knowledge in at least 3 key areas: Cloud native and IaaS Architecture, Design, Cloud Engineering, or Containers orchestration solutions.
  • Strong understanding of business technology drivers and their impact on architecture design.
  • Advanced knowledge on Observability engineering with hands-on experience in monitoring platforms.
  • Systematic problem-solving approach with strong communication skills.
  • Hands-on experience in designing, analyzing, scaling, and troubleshooting distributed systems.
  • Well-versed in SRE methodologies and passionate about automation and software engineering.
  • Ability to communicate technical strategy effectively across the organization.
  • Demonstrated ability to launch and deliver multiple engineering projects on time and within budget.

Nice-to-haves

  • Subject matter expert in AWS or other public cloud providers.
  • Expertise in microservices lifecycle management.
  • Strong experience with logging and monitoring tools like ELK stack, Prometheus, and Datadog.
  • Expert knowledge of release software tooling like Jenkins or Azure DevOps.
  • Expert level knowledge of containerization technologies and Docker image management.
  • Expert level of Kubernetes knowledge.

Benefits

  • Health insurance
  • 401k
  • Paid holidays
  • Flexible scheduling
  • Professional development opportunities
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service