DevOps Site Reliability Engineer

Ford - Madison, WI

posted 3 months ago

Full-time - Mid Level

Onsite - Madison, WI

Transportation Equipment Manufacturing

About the position

At Ford, we are committed to building a better world through innovative technology and mobility solutions. The Monitoring as a Service (MaaS) Team is at the forefront of this mission, focusing on developing and enhancing services that prioritize customer needs. As a DevOps and Site Reliability Engineer (SRE), you will play a crucial role in shaping our global monitoring and observability platform. This position combines software engineering and systems engineering disciplines to ensure that our software systems are not only available but also scalable and maintainable. You will be responsible for constructing API libraries and automation scripts, consulting with product teams to onboard new applications to various monitoring tools, and improving existing application workflows. Your role will involve deploying applications to containers, delivering a positive user experience for our internal customers, and leveraging your software development and systems administration skills to enhance our monitoring capabilities. You will also be tasked with architecting and developing automation solutions that improve the resilience, recoverability, availability, and scalability of our applications. By collaborating with development teams, you will help design and operate scalable software systems while proactively identifying stability risks and establishing mitigation plans. This position requires a strong background in programming, particularly in Python, and experience with various monitoring tools and cloud services. As a member of the MaaS team, you will have the opportunity to innovate and push our capabilities forward, ensuring that we meet and exceed customer needs. You will also provide technical guidance and mentorship to other team members, participate in incident response, and conduct performance analysis of both new and existing systems. This role is essential in establishing a Site Reliability Engineering mindset within the organization, ensuring maximum availability and uptime for our services.

Responsibilities

Construct API libraries and automation scripts based on existing project workflows, mainly developing in Python.
Consult with Product Teams to onboard new applications to Splunk, Dynatrace, VictorOps, and other Monitoring Applications.
Work with First Responders and Product teams to improve and support tooling for existing applications, including participating in an On-Call rotation schedule for incident management.
Integrate and consolidate application workflows efficiently.
Deploy applications to containers using CloudRun and Tekton pipelines.
Deliver a positive web user interface/experience to internal Ford customers.
Architect, design, and develop automation to improve resilience, recoverability, availability, and scalability of supported applications.
Measure and optimize system performance, focusing on customer needs and innovation.
Identify and reduce or eliminate toil via automation to maximize engineering and innovation time.
Collaborate with development teams to design, build, and operate scalable and resilient software systems using Cloud native principles.
Conduct performance analysis and optimization of new and in-production systems.
Provide technical guidance and mentorship to other team members.
Participate in incident response, support, recovery, and postmortem analysis.

Requirements

Bachelor's degree in Computer Science, Computer Engineering, Systems Engineering or related field, or a combination of education and equivalent work experience.
3+ years of experience as a DevOps Engineer and/or Site Reliability Engineer.
5+ years of experience programming with one or more: Python, Go, Java/Scala, C or C+ or similar technologies.
3+ years of experience with any APM and other monitoring tools such as Dynatrace, New Relic, ELK, Splunk, Prometheus, Sensu, Nagios, Kafka, DataDog.
1+ years with Google Cloud and its library of services.

Nice-to-haves

Master's Degree in Computer Science, Computer Engineering, Systems Engineering or related field.
Strong experience with J2EE, NoSQL/SQL Datastore, Spring Boot, GCP/AWS/Azure & Docker/K8 in developing multi-tier applications.
Experience with automated test-driven development in CI/CD Pipelines.
Thorough understanding of software development and agile programming.
Understanding and ability to implement effective observability strategies to improve MTTD/R.
Experience with RESTful APIs and microservices platforms.
Working knowledge of the TCP/IP stack, Internet routing and load balancing.

Benefits

Immediate medical, dental, and prescription drug coverage.
Flexible family care, parental leave, new parent ramp-up programs, subsidized back-up childcare.
Vehicle discount program for employees and family members, and management leases.
Tuition assistance.
Established and active employee resource groups.
Paid time off for individual and team community service.
A generous schedule of paid holidays, including the week between Christmas and New Year's Day.
Paid time off and the option to purchase additional vacation time.

DevOps Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company