Ford - Lansing, MI

posted 3 months ago

Full-time - Mid Level
Onsite - Lansing, MI
Transportation Equipment Manufacturing

About the position

Ford Motor Company is seeking an experienced DevOps and Site Reliability Engineer (SRE) to join our Monitoring as a Service (MaaS) Team. This role is pivotal in leading the development, enhancement, and extension of our global monitoring and observability platform. As a DevOps/SRE, you will combine software engineering and systems engineering disciplines to ensure that our software systems are available, scalable, and maintainable. You will play a crucial role in shaping the evolving needs of our customers, which includes code and pipeline development, establishing best practices with associated templates, and automating processes to reduce toil and facilitate adoption. In this position, you will be responsible for constructing API libraries and automation scripts based on existing project workflows, primarily using Python. You will consult with product teams to onboard new applications to various monitoring applications such as Splunk, Dynatrace, and VictorOps. Additionally, you will work closely with first responders and product teams to improve and support tooling for existing applications, which may include participating in an on-call rotation schedule for incident management. Your role will also involve integrating and consolidating application workflows efficiently, deploying applications to containers using CloudRun and Tekton pipelines, and delivering a positive web user interface/experience to our internal Ford customers. You will leverage your strong background in software development and systems administration to perform destructive testing to discover vulnerabilities, architect and develop automation to improve resilience, recoverability, availability, and scalability of supported applications. You will also be responsible for measuring and optimizing system performance, proactively identifying stability risks, and collaborating with development teams to design, build, and operate scalable and resilient software systems using cloud-native principles. Your contributions will help establish an SRE mindset within the team to ensure maximum availability and uptime, conduct performance analysis, and provide technical guidance and mentorship to other team members.

Responsibilities

  • Constructing API Libraries & automation scripts based on existing project workflows, mainly developing in Python
  • Consulting with Product Teams to onboard new applications to Splunk, Dynatrace, VictorOps, and other Monitoring Applications
  • Working with First Responders and Product teams to improve and support tooling for existing applications, including participating in an On-Call rotation schedule for incident-management
  • Integrating & consolidating application workflows efficiently
  • Deploying applications to containers using CloudRun and Tekton pipelines
  • Delivering a positive web user interface/experience to internal Ford customers
  • Performing destructive testing to seek and discover vulnerabilities
  • Architecting, designing, and developing automation to improve resilience, recoverability, availability, and scalability of supported applications
  • Recognizing, validating, and evangelizing emerging technologies and architectures that align with business objectives
  • Developing tooling to improve reliability, quality, and time-to-market for software solutions
  • Measuring and optimizing system performance
  • Identifying and reducing or eliminating toil via automation to maximize engineering and innovation time
  • Collaborating with development teams to design, build, and operate scalable and resilient software systems using Cloud native principles
  • Proactively identifying stability risks and working with engineering leadership to establish appropriate mitigation plans
  • Regularly reviewing key technical metrics such as transactions errors, logging, response times, caching strategies, conversion/bounce rates, capacity, and resource utilization
  • Assisting in establishing SRE mindset to ensure maximum availability/uptime
  • Conducting performance analysis and optimization of new and in-production systems
  • Providing technical guidance and mentorship to other team members
  • Participating in incident response, support, recovery, and postmortem analysis.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, Systems Engineering or related field or a combination of education and equivalent work experience
  • 3+ years of experience as a DevOps Engineer and/or Site Reliability Engineer
  • 5+ years of experience programming with one or more: Python, Go, Java/Scala, C or C++ or similar technologies
  • 3+ years of experience with any APM and other monitoring tools such as Dynatrace, New Relic, ELK, Splunk, Prometheus, Sensu, Nagios, Kafka, DataDog
  • 1+ years with Google Cloud and its library of services

Nice-to-haves

  • Master's Degree in Computer Science, Computer Engineering, Systems Engineering or related field
  • Strong experience with J2EE, NoSQL/SQL Datastore, Spring Boot, GCP/AWS/Azure & Docker/K8 in developing multi-tier applications
  • Experience with automated test-driven development in CI/CD Pipelines
  • Thorough understanding of software development and agile programming
  • Understanding and ability to implement effective observability strategies to improve MTTD/R
  • Experience with RESTful APIs and microservices platforms
  • Working knowledge of the TCP/IP stack, internet routing and load balancing
  • Ability to solve complex architecture/design & business problems, work to simplify, optimize, remove bottlenecks, etc.

Benefits

  • Immediate medical, dental, and prescription drug coverage
  • Flexible family care, parental leave, new parent ramp-up programs, subsidized back-up childcare and more
  • Vehicle discount program for employees and family members, and management leases
  • Tuition assistance
  • Established and active employee resource groups
  • Paid time off for individual and team community service
  • A generous schedule of paid holidays, including the week between Christmas and New Year's Day
  • Paid time off and the option to purchase additional vacation time.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service