Lead Site Reliability Engineer

Mastercard - O'Fallon, MO

posted 3 months ago

Part-time - Mid Level

O'Fallon, MO

Credit Intermediation and Related Activities

About the position

The Lead Site Reliability Engineer (SRE) position at Mastercard is a pivotal role within the Enterprise Data Accessibility BizOps team, aimed at enhancing the reliability and efficiency of large-scale, distributed services and infrastructures. The SRE will leverage their expertise in software and systems engineering to build and manage cloud operations, CI/CD pipelines, and automation best practices. This role is essential for ensuring that the services and infrastructures are not only reliable and fault-tolerant but also scalable and cost-effective. The SRE will be responsible for overseeing the production environment, ensuring operational readiness, and collaborating closely with developers to implement technology services that meet operational criteria such as system availability, performance, and deployment automation. In this role, the SRE will engage in a variety of tasks including defining strategies for application performance monitoring, managing incident responses, and maintaining services post-launch by monitoring system health and availability. The SRE will also be involved in continuous optimization efforts within the production environment, ensuring that the systems are resilient and capable of handling the demands placed upon them. A significant aspect of the role involves practicing sustainable incident response and conducting blameless postmortems to foster a culture of learning and improvement. The SRE will work with a global team, requiring effective communication and collaboration across different time zones. This position not only demands technical expertise but also a systematic problem-solving approach, strong communication skills, and a proactive mindset to drive improvements in customer experience and operational efficiency. The SRE will play a crucial role in the DevOps transformation at Mastercard, advocating for change and standardization across development, quality, release, and product organizations, ultimately aligning product priorities with operational needs.

Responsibilities

Plan, manage, and oversee all aspects of a Production Environment for Enterprise Data Accessibility.
Define strategies for Application Performance Monitoring, Unit Cost, and Chaos Engineering aspects.
Find ways for Continuous Optimizations in a Production Environment.
Understand MTTR, SLO, SLI definitions and apply them to services.
Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.
Ensure reliable, fault-tolerant, efficiently scalable and cost-effective data, services and infrastructures.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Practice sustainable incident response and blameless postmortems.
Isolate problems between hardware and software, working with appropriate teams and vendors until resolution is reached.
Perform ad hoc requests from users such as data research and process issue investigations.
Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation and refinement.
Analyze ITSM activities of the platform and provide feedback to development teams on operational gaps or resiliency concerns.
Support services before they go live through system design consulting, capacity planning and launch reviews.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating.
Take a holistic approach to problem solving during production events across the technology stack to optimize mean time to recover.
Work with a global team spread across tech hubs in multiple geographies and time zones.
Share knowledge and explain processes and procedures to others.

Requirements

Bachelor's degree in computer science, software engineering, or a similar field.
Experience in cloud technologies and operations.
Experience supporting APIs and Cloud technologies.
Experience in monitoring/alerting tools such as Splunk and Dynatrace.
5+ years of DevOps, SRE, or general systems engineering experience.
5+ years of experience in running production systems.
2+ years of hands-on experience in industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Artifactory, and Chef.
Experience architecting and implementing data governance processes and tooling (such as data catalogs, lineage tools, role-based access control, PII handling).
Strong coding ability in Python or other languages like Java, C#, Golang, C, C++, Perl or Ruby, and a solid grasp of SQL fundamentals.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Ability to help debug and optimize code and automate routine tasks.
Ability to support many different stakeholders and deal with difficult situations with urgency.
Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Appetite for change and pushing the boundaries of automation.
Experience in working across development, operations, and product teams to prioritize needs and build relationships.

Nice-to-haves

Strong Big Data, Oracle and SQL Server experience.
SQL tuning experience.
Strong PowerBI experience.
Strong Data Observability experience.
Operations experience in supporting highly scalable systems.
Ability to operate in a 24x7 environment encompassing global timezones.
Self-motivating and creatively solves software problems.

Benefits

Insurance (including medical, prescription drug, dental, vision, disability, life insurance)
Flexible spending account and health savings account
Paid leaves (including 16 weeks new parent leave, up to 20 paid days bereavement leave)
10 annual paid sick days
10 or more annual paid vacation days based on level
5 personal days
10 annual paid U.S. observed holidays
401k with a best-in-class company match
Deferred compensation for eligible roles
Fitness reimbursement or on-site fitness facilities
Eligibility for tuition reimbursement
Gender-inclusive benefits
And many more.

Lead Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company