DevOps/ Site Reliability Engineer

Ford - Trenton, NJ

posted 3 months ago

Full-time - Entry Level

Onsite - Trenton, NJ

Transportation Equipment Manufacturing

About the position

We are the movers of the world and the makers of the future. We get up every day, roll up our sleeves and build a better world -- together. At Ford, we're all a part of something bigger than ourselves. Are you ready to change the way the world moves? Enterprise Technology plays a critical part in shaping the future of mobility. If you're looking for the chance to leverage advanced technology to redefine the transportation landscape, enhance the customer experience and improve people's lives, this is the opportunity for you. Join us and challenge your IT expertise and analytical skills to help create vehicles that are as smart as you are. The Monitoring as a Service (MaaS) Team is building and evolving their services with customers in mind. MaaS will enable teams to modernize and disrupt by providing robust monitoring tools powered by AI and easy-to-use dashboards. Monitoring increases transparency of applications' performance end-to-end, regardless of hosting location (on-prem or in the cloud), which means a better view into how we can proactively manage our apps and improve performance. In this position, we are seeking an experienced DevOps and Site Reliability Engineer (SRE) to join our team and lead the development, enhancement, and extension of our global monitoring and observability platform. As a DevOps/SRE, your role will combine software engineering and systems engineering disciplines to ensure that software systems are available, scalable, and maintainable. This individual will play a pivotal role in shaping the evolving needs of our customers including code and pipeline development, best practices with associated templates, as well as automation to remove toil and facilitate adoption. Please note, this job is posted as remote unless the selected candidate lives within 50 miles of Dearborn, MI, then it may require a hybrid onsite schedule, up to 60% of the time.

Responsibilities

Constructing API Libraries & automation scripts based on existing project workflows, mainly developing in Python
Consulting with Product Teams to onboard new applications to Splunk, Dynatrace, VictorOps, and other Monitoring Applications
Work with First Responders and Product teams to improve and support tooling for existing applications - May include partaking in an On-Call rotation schedule for incident-management
Integrating & consolidating application workflows efficiently
Deploying applications to containers using CloudRun and Tekton pipelines
Delivering a positive web user interface/experience to our internal Ford customers
Leverage experience to safely perform destructive testing to seek and discover vulnerabilities
Architect, design and develop automation to improve resilience, recoverability, availability, and scalability of supported applications
Recognize, validate, and evangelize emerging technologies and architectures that align with business objectives
Develop tooling to improve reliability, quality, and time-to-market for software solutions
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
Identify and reduce or eliminate toil via automation to maximize the time spent on engineering and innovation
Collaborate with development teams to design, build, and operate scalable and resilient software systems using Cloud native principles
Proactively identify stability risks and work with engineering leadership to establish appropriate mitigation plans
Regularly review key technical metrics such as transactions errors, logging, response times, caching strategies, conversion/bounce rates, capacity, and resource utilization
Assist in establishing SRE mindset to ensure maximum availability/uptime.

Requirements

Strong background in software development and systems administration
Excellent problem-solving, troubleshooting, and communication skills
Experience in constructing API Libraries and automation scripts, particularly in Python
Familiarity with monitoring applications such as Splunk, Dynatrace, and VictorOps
Experience in deploying applications to containers using CloudRun and Tekton pipelines
Ability to conduct performance analysis and optimization of systems
Experience in developing automation to improve application resilience and scalability
Knowledge of Cloud native principles and practices

Nice-to-haves

Experience with AI-powered monitoring tools
Familiarity with incident management processes
Understanding of software development best practices
Experience in working with cross-functional teams

Benefits

Health insurance
401k retirement plan
Paid time off
Flexible work hours
Professional development opportunities

DevOps/ Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company