Site Reliability Engineer (SRE)

INSPYR Solutions - Houston, TX

posted 5 days ago

Full-time - Mid Level

Houston, TX

Administrative and Support Services

About the position

The Site Reliability Engineer (SRE) will be a key member of the Digital IT team, focusing on building OpenShift/Kubernetes capabilities and ensuring the reliability of critical IT/OT applications. This role involves automating IT infrastructure tasks, implementing SRE best practices, and collaborating with application developers to enhance user experience. The ideal candidate will have extensive experience in deploying OpenShift and Kubernetes, along with a strong background in software development and systems administration.

Responsibilities

Maintaining survivability and reliability of IT/OT critical resources.
Write and build CI/CD pipelines and build/release processes for IT/OT workflow applications.
Provide mentoring to the IT/OT DevOps team in best practices associated with CI/CD deployments using ADO and GIT.
Perform periodic load and scalability testing to establish baselines, drift, and capacity planning.
Conduct weekly operational state reviews covering performance trends, anomalies, errors, and other availability events with SREs, product owners, and development teams.
Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, patching, etc.
Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection).

Requirements

Bachelor's degree and 7 years of IT experience.
Senior level experience with OpenShift and Kubernetes.
Familiarity with continuous integration/deployment processes and tools such as IDEs (Eclipse), Source Code management (GIT/Stash), ADO Pipelines, Maven, Nexus artifacts, etc.
Strong understanding of SRE practices: incident response, change/release management, capacity planning, infrastructure automation, elastic environments, chaos engineering, and blameless postmortems.
Expertise in application performance monitoring, observability, and proactive alert correlation, including monitoring containers and failure-based alerting.
Scripting experience such as Python and Bash.
Experienced in deploying applications in OpenShift in both public and private cloud.
Excellent written and oral communication skills.
Demonstrated ability to communicate to non-technical audiences on technical issues.
Demonstrated ability to communicate on a technical level to a technical audience.
Strong interpersonal skills, adaptable and able to learn quickly.
Requires limited supervision and has excellent time management skills.
Self-motivated and self-starter.
Ability to work and interact with others in a structured/team environment.

Nice-to-haves

Experience in cloud/virtual technologies and management - OpenShift, VMware, AWS, Azure, etc.
Familiarity with security best practices for containerized applications.
Knowledge of DevOps practices and tools.
Knowledge, skills, and abilities to automate the creation of Platform as a Services (PaaS) infrastructure using industry standard tools such as Ansible and Chef.
Familiarity with Industrial Control System (ICS) security architecture - Purdue model.

Benefits

Comprehensive medical benefits
Competitive pay
401(k) retirement plan
...and much more!

Site Reliability Engineer (SRE)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company