Cloud Reliability Engineer

Ampcus - Charleston, SC

posted 5 months ago

Full-time - Mid Level

Remote - Charleston, SC

Professional, Scientific, and Technical Services

About the position

The Cloud Reliability Engineer will play a pivotal role in the Banking Solutions business, reporting directly to the Head of Cloud / API Engineering. This position is integral to the company's digital transformation journey, focusing on driving customer-centric innovation and automation. The engineer will be responsible for ensuring the reliability, availability, and performance of applications and services, with a strong emphasis on minimizing downtime and optimizing response times. This role will also involve leading incident response efforts, developing monitoring solutions, and implementing automation tools to enhance operational efficiency. In this capacity, the Cloud Reliability Engineer will strategize the transition from private to public cloud, ensuring that all applications and services are reliable and performant. The engineer will collaborate with various teams, including development, QA, and DevOps, to align on reliability goals and incident response processes. Additionally, the role requires conducting capacity planning, performance tuning, and resource optimization, as well as managing deployment pipelines and release processes to ensure consistency and reliability across environments. The engineer will also be tasked with creating and maintaining documentation for operational procedures and best practices, as well as developing disaster recovery plans to ensure business continuity. This position requires a commitment to continuous learning and adaptability to evolving technologies and business needs, with a focus on driving continuous improvement and operational excellence.

Responsibilities

Strategize and drive the building blocks of reliability engineering as the transition from private to public cloud occurs.
Ensure the reliability, availability, and performance of applications and services, focusing on minimizing downtime and optimizing response times.
Lead incident response efforts for incidents, including identification, triage, resolution, and post-incident analysis.
Develop and maintain monitoring solutions and alerting mechanisms for infrastructure, application performance, and user experience metrics.
Implement automation tools and processes to automate routine tasks, scale infrastructure, and ensure seamless deployments.
Conduct capacity planning, performance tuning, and resource optimization for environments.
Collaborate with security teams to implement security best practices and ensure compliance with security standards.
Manage deployment pipelines, release processes, and configuration management for app deployments.
Identify areas for improvement in reliability, performance, and efficiency through data analysis and root cause analysis.
Create and maintain documentation, runbooks, and knowledge base articles for operational procedures and troubleshooting guides.
Develop and test disaster recovery plans, backup strategies, and failover mechanisms for app services.
Collaborate with development, QA, DevOps, and product teams to ensure alignment on reliability goals and performance metrics.
Participate in on-call rotations and provide 24/7 support for critical incidents.

Requirements

Specific experience in reliability engineering for a large-scale transition from private to public cloud.
Proficient in development technologies, architectures, and platforms (web, API).
Experience in cloud platforms (e.g., AWS, Azure, Google Cloud) and infrastructure as code (IaC) tools.
Knowledge of monitoring tools (e.g., Dynatrace, Logrocket, DataDog) and logging frameworks (e.g., ELK Stack).
Experience in incident management, including incident response and root cause analysis.
Strong troubleshooting skills to diagnose complex technical issues.
Proficiency in scripting languages (e.g., Python, Bash) and automation tools (e.g., Ansible, Terraform).
Experience in implementing CI/CD pipelines using tools like Jenkins or Azure DevOps.
Expertise in setting up monitoring solutions and creating dashboards.
Commitment to continuous learning and staying updated with industry trends.

Nice-to-haves

Familiarity with APM (Application Performance Monitoring) tools.
Adaptability to evolving requirements and technologies.
Strong interpersonal communication and negotiation skills.

Benefits

Health insurance
401k
Paid holidays
Flexible scheduling
Professional development opportunities

Cloud Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company