DevOps Site Reliability Engineer

Ford - Honolulu, HI

posted 3 months ago

Full-time - Mid Level

Onsite - Honolulu, HI

Transportation Equipment Manufacturing

About the position

We are the movers of the world and the makers of the future. At Ford, we are part of something bigger than ourselves, and we are looking for an experienced DevOps and Site Reliability Engineer (SRE) to join our Monitoring as a Service (MaaS) Team. This role is critical in shaping the future of mobility by leveraging advanced technology to redefine the transportation landscape, enhance customer experience, and improve people's lives. The MaaS Team is focused on building and evolving services with customers in mind, providing robust monitoring tools powered by AI and easy-to-use dashboards. This position will involve ensuring that our software systems are available, scalable, and maintainable, combining software engineering and systems engineering disciplines. As a DevOps/SRE, you will lead the development, enhancement, and extension of our global monitoring and observability platform. Your responsibilities will include constructing API libraries and automation scripts, consulting with product teams to onboard new applications to various monitoring applications, and improving tooling for existing applications. You will also be involved in deploying applications to containers, delivering a positive user interface for internal customers, and performing destructive testing to discover vulnerabilities. This role requires a strong background in software development and systems administration, excellent problem-solving skills, and the ability to collaborate with development teams to design and operate scalable software systems. In this position, you will proactively identify stability risks, conduct performance analysis, and provide technical guidance to team members. You will participate in incident response and postmortem analysis, ensuring maximum availability and uptime for our systems. This role is essential for driving innovation and improving the reliability and quality of our software solutions, ultimately enhancing the customer experience and meeting the evolving needs of our users.

Responsibilities

Constructing API Libraries & automation scripts based on existing project workflows, mainly developing in Python
Consulting with Product Teams to onboard new applications to Splunk, Dynatrace, VictorOps, and other Monitoring Applications
Working with First Responders and Product teams to improve and support tooling for existing applications
Integrating & consolidating application workflows efficiently
Deploying applications to containers using CloudRun and Tekton pipelines
Delivering a positive web user interface/experience to our internal Ford customers
Architecting, designing, and developing automation to improve resilience, recoverability, availability, and scalability of supported applications
Recognizing, validating, and evangelizing emerging technologies and architectures that align with business objectives
Developing tooling to improve reliability, quality, and time-to-market for software solutions
Measuring and optimizing system performance, with an eye toward pushing our capabilities forward
Identifying and reducing or eliminating toil via automation to maximize the time spent on engineering and innovation
Collaborating with development teams to design, build, and operate scalable and resilient software systems using Cloud native principles
Proactively identifying stability risks and working with engineering leadership to establish appropriate mitigation plans
Regularly reviewing key technical metrics such as transactions errors, logging, response times, caching strategies, conversion/bounce rates, capacity, and resource utilization
Assisting in establishing SRE mindset to ensure maximum availability/uptime
Conducting performance analysis and optimization of new and in-production systems
Providing technical guidance and mentorship to other team members
Participating in incident response, support, recovery, and postmortem analysis

Requirements

Bachelor's degree in Computer Science, Computer Engineering, Systems Engineering or related field or a combination of education and equivalent work experience
3+ years of experience as a DevOps Engineer and/or Site Reliability Engineer
5+ years of experience programming with one or more: Python, Go, Java/Scala, C or C+ or similar technologies
3+ years of experience with any APM and other monitoring tools such as Dynatrace, New Relic, ELK, Splunk, Prometheus, Sensu, Nagios, Kafka, DataDog
1+ years with Google Cloud and its library of services

Nice-to-haves

Master's Degree in Computer Science, Computer Engineering, Systems Engineering or related field
Strong experience with J2EE, NoSQL/SQL Datastore, Spring Boot, GCP/AWS/Azure & Docker/K8 in developing multi-tier applications
Experience with automated test-driven development in CI/CD Pipelines
Thorough understanding of software development and agile programming
Understanding and ability to implement effective observability strategies to improve MTTD/R
Experience with RESTful APIs and microservices platforms
Working knowledge of the TCP/IP stack, Internet routing and load balancing
Ability to solve complex architecture/design & business problems, work to simplify, optimize, remove bottlenecks, etc.

Benefits

Immediate medical, dental, and prescription drug coverage
Flexible family care, parental leave, new parent ramp-up programs, subsidized back-up childcare
Vehicle discount program for employees and family members, and management leases
Tuition assistance
Established and active employee resource groups
Paid time off for individual and team community service
A generous schedule of paid holidays, including the week between Christmas and New Year's Day
Paid time off and the option to purchase additional vacation time

DevOps Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company