Site Reliability Engineer III (Onsite)

Ledgent - Newport Beach, CA

posted about 2 months ago

Full-time - Senior

Newport Beach, CA

Administrative and Support Services

About the position

The Lead Site Reliability Engineer (SRE) will provide technical leadership and accountability for platform engineering, system design, and implementation to meet product non-functional requirements such as quality, security, reliability, availability, and performance. This role involves optimizing design and engineering processes, overseeing production operations, and developing solutions to enhance system reliability and automation.

Responsibilities

Lead the design, build, and implement orchestration and tooling solutions for efficient administration tasks.
Establish best practices for structuring, automating, building, deploying, and monitoring complex distributed software products.
Ensure reliability and traceability of software releases and deployments.
Create and maintain platform architecture and design specifications.
Design and implement monitoring and recovery tools for high availability and disaster recovery.
Develop highly available infrastructure and platform components for evolving product lines.
Implement security engineering best practices in deployed platforms.
Triage alerts, diagnose, and resolve critical issues, managing change implementations.
Coordinate, document, and track critical incidents and root cause analysis for rapid issue resolution.
Collaborate with Delivery Engineers and DevExp Engineers to enhance continuous integration/continuous deployment orchestration.
Lead, grow, and mentor other SRE team members.
Promote the DevSecOps culture and SRE mindset, mentoring others on reliability best practices.
Identify opportunities for automation, signal to noise reduction, and prevention of recurring issues.
Maintain a strong understanding of IaaS, PaaS, and SaaS offerings for cloud-based environments.
Design and implement processes and automation for performance testing.
Ensure documentation and operational processes support the solution lifecycle.

Requirements

10-15 years of experience in infrastructure, system engineering, or software engineering.
Advanced knowledge in software engineering in test and testing automation frameworks.
Advanced knowledge in at least 3 key areas: Cloud native and IaaS Architecture, Design, Cloud Engineering, or Containers orchestration solutions.
Strong understanding of business technology drivers and their impact on architecture design.
Advanced knowledge on Observability engineering with hands-on experience in monitoring platforms.
Systematic problem-solving approach with strong communication skills.
Hands-on experience in designing, analyzing, scaling, and troubleshooting distributed systems.
Well-versed in SRE methodologies and passionate about automation and software engineering.
Ability to communicate technical strategy effectively across the organization.
Demonstrated ability to launch and deliver multiple engineering projects on time and within budget.

Nice-to-haves

Subject matter expert in AWS or other public cloud providers.
Expertise in microservices lifecycle management.
Strong experience with logging and monitoring tools like ELK stack, Prometheus, and Datadog.
Expert knowledge of release software tooling like Jenkins or Azure DevOps.
Expert level knowledge of containerization technologies and Docker image management.
Expert level of Kubernetes knowledge.

Benefits

Health insurance
401k
Paid holidays
Flexible scheduling
Professional development opportunities

Match and compare your resume to any job description

Start Matching

Site Reliability Engineer III (Onsite)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company