Ford - Santa Fe, NM

posted 3 months ago

Full-time
Santa Fe, NM
Transportation Equipment Manufacturing

About the position

Site Reliability Engineering at Ford Motor Company plays a critical role in maintaining and improving the reliability, scalability, and performance of our services. As a Site Reliability Engineer (SRE), you will work closely with our development teams to build and maintain large-scale, distributed systems, ensuring that our products meet our high standards for availability and user experience. Your contributions will be vital in enhancing the overall performance of our services, and you will be expected to write, configure, and deploy code that improves service reliability for both existing and new systems. You will set standards for others regarding code quality and provide helpful and actionable feedback on code or production changes. In this role, you will drive the repair and optimization of complex systems, taking into account a wide range of contributing factors. You will lead debugging, troubleshooting, and analysis of service architecture and design, and participate in an on-call rotation to address any issues that arise. Documentation is also a key aspect of this position; you will be responsible for writing design documents, system analyses, runbooks, and playbooks, while also providing design feedback to help elevate the design skills of your colleagues. You will implement and manage monitoring solutions using tools such as Dynatrace, Splunk, and OpenTelemetry to ensure visibility and proactive issue detection across our platforms. Working within the Google Cloud Platform (GCP) infrastructure, you will optimize performance and cost while scaling resources to meet demand. Collaboration with development teams will be essential as you enhance system reliability and performance, applying a platform engineering mindset to system administration tasks. Additionally, you will develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery. Troubleshooting and resolving issues in our development, testing, and production environments will be part of your daily responsibilities, and you will participate in postmortem analyses to create preventative measures for future incidents.

Responsibilities

  • Write, configure, and deploy code that improves service reliability for existing or new systems; set standard for others with respect to code quality
  • Provide helpful and actionable feedback and review for code or production changes
  • Drive repair/optimization of complex systems with consideration towards a wide range of contributing factors
  • Lead debugging, troubleshooting, and analysis of service architecture and design
  • Participate in on-call rotation
  • Write documentation: design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others
  • Implement and manage monitoring solutions using Dynatrace, Splunk, and OpenTelemetry to ensure visibility and proactive issue detection across our platforms
  • Work within GCP infrastructure, optimizing performance, and cost, and scaling resources to meet demand
  • Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks
  • Develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery
  • Troubleshoot and resolve issues in our dev, test, and production environments
  • Participate in postmortem analysis and create preventative measures for future incidents

Requirements

  • Bachelor's degree in Computer Science, Engineering, or equivalent experience
  • 3+ years of experience as an SRE, DevOps Engineer, or in a similar role
  • Strong experience with monitoring and observability tools, particularly Dynatrace and OpenTelemetry
  • Proficient with cloud services, with a strong preference for Google Cloud Platform (GCP) experience
  • Solid programming skills in Java, with a good understanding of software development best practices
  • Experience managing and optimizing PostgreSQL databases
  • Familiarity with front-end development frameworks, particularly React
  • Ability to debug, optimize code, and automate routine tasks
  • Strong problem-solving skills and the ability to work under pressure in a fast-paced environment
  • Excellent verbal and written communication skills
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service