Staff Site Reliability Engineer - PRE

Visa - Austin, TX

posted 2 months ago

Full-time - Mid Level

Hybrid - Austin, TX

Credit Intermediation and Related Activities

About the position

As a Staff Site Reliability Engineer in Product Reliability Engineering at Visa, you will be responsible for maintaining and supporting Visa's Data Platform, focusing on cloud-based Big Data and Kafka platforms. This role involves driving innovation, ensuring the availability and performance of systems, and improving operational efficiency while collaborating with various teams to enhance reliability and efficiency.

Responsibilities

Design, build and manage Big Data and Kafka infrastructure on AWS, GCP and Azure.
Manage and optimize Apache Big Data and Kafka clusters for high performance, reliability, and scalability.
Develop tools and processes to monitor and analyze system performance and identify potential issues.
Collaborate with other teams to design and implement solutions to improve reliability and efficiency of the Big Data cloud platforms.
Ensure security and compliance of the platforms within organizational guidelines.
Conduct root cause analysis of major production incidents and develop learning documentation.
Identify and implement high-availability solutions for services with a single point of failure.
Plan and perform capacity expansions and upgrades in a timely manner to avoid scaling issues and bugs.
Automate repetitive tasks to reduce manual effort and prevent human errors.
Tune alerting and set up observability to proactively identify issues and performance problems.
Work closely with Level 3 teams in reviewing new use cases and cluster hardening techniques.
Create standard operating procedure documents and guidelines for managing and utilizing the platforms.
Leverage DevOps tools and disciplines in day-to-day operations.
Ensure platforms meet performance and service level agreement requirements.
Perform security remediation, automation, and self-healing as required.
Develop automations and reports to minimize manual effort.

Requirements

5 or more years of relevant work experience with a Bachelor's Degree or at least 2 years of work experience with an Advanced degree (e.g. Masters, MBA, JD, MD) or 0 years of work experience with a PhD.
Demonstrated experience with AWS and GCP cloud platforms.
Experience with managing and optimizing Big Data and Kafka clusters.
Proficient in scripting languages (Python, Bash) and SQL.
Familiarity with big data tools (Big Data, Spark, Kafka, etc.) and frameworks (HDFS, MapReduce, etc.).
Strong knowledge in system architecture and design patterns for high-performance computing.
Good understanding of data security and privacy concerns.
Experience with infrastructure automation technologies like Docker, Kubernetes, Ansible, Terraform is a plus.
Excellent problem-solving and troubleshooting skills.
Strong communication and collaboration skills.
Knowledge of observability tools like Grafana, Opera, and Splunk.
Understanding of Linux, networking, CPU, memory, and storage.

Nice-to-haves

Experience with infrastructure automation technologies like Docker, Kubernetes, Ansible, Terraform.
Knowledge of observability tools like Grafana, Opera, and Splunk.

Benefits

Medical
Dental
Vision
401(k)
FSA/HSA
Life Insurance
Paid Time Off
Wellness Program

Staff Site Reliability Engineer - PRE

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company