Merck KGaA Darmstadt Germany - North Wales, PA

posted 4 months ago

Full-time - Senior
Onsite - North Wales, PA
Chemical Manufacturing

About the position

As a Senior Specialist in Data Engineering at Merck & Co., Inc., you will play a crucial role in designing, developing, and maintaining data pipelines that extract data from various sources to populate our data lake and data warehouse. This position requires collaboration with the data governance team to implement data quality checks and maintain data catalogs, ensuring the integrity and usability of our data assets. You will utilize orchestration, logging, and monitoring tools to build resilient data pipelines, employing test-driven development methodologies for building ELT/ETL pipelines. A strong understanding of concepts such as data lakes, data warehouses, lake-houses, data meshes, and data fabrics is essential for this role. In addition to pipeline development, you will be responsible for developing data models for cloud data warehouses like Redshift and Snowflake, and creating pipelines to ingest data into these environments. Your analytical skills will be put to use as you analyze data using SQL and collaborate with Data Analysts, Data Scientists, and Machine Learning Engineers to identify and transform data for ingestion, exploration, and modeling. You will leverage serverless AWS services such as Glue, Lambda, and Step Functions, and utilize Terraform code for deployment on AWS. Containerization of Python code using Docker will also be a key aspect of your responsibilities, along with version control using Git and understanding various branching strategies. You will be expected to build pipelines that can handle large datasets using PySpark, develop proof of concepts using Jupyter Notebooks, and create technical documentation as needed. This position requires a proactive approach to problem-solving and a commitment to maintaining high standards of data quality and governance.

Responsibilities

  • Design, develop and maintain data pipelines to extract data from a variety of sources and populate data lake and data warehouse.
  • Work with data governance team and implement data quality checks and maintain data catalogs.
  • Use orchestration, logging, and monitoring tools to build resilient pipelines.
  • Utilize test driven development methodology when building ELT/ETL pipelines.
  • Understand and apply concepts like data lake, data warehouse, lake-house, data mesh and data-fabric where relevant.
  • Develop data models for cloud data warehouses like Redshift and Snowflake.
  • Develop pipelines to ingest data into cloud data warehouses.
  • Analyze data using SQL and collaborate with Data Analysts, Data Scientists, and Machine Learning Engineers to identify and transform data for ingestion, exploration, and modeling.
  • Use serverless AWS services like Glue, Lambda, and Step Functions.
  • Use Terraform code to deploy on AWS.
  • Containerize Python code using Docker.
  • Use Git for version control and understand various branching strategies.
  • Build pipelines to work with large datasets using PySpark.
  • Develop proof of concepts using Jupyter Notebooks and create technical documentation as needed.

Requirements

  • Bachelor's degree or equivalent in Mathematics, Computer Science, Engineering, Artificial Intelligence, or a related field and 5 years of experience in the position offered or related.
  • 5 years of experience with SQL and PySpark; Git, Docker, and Terraform; Agile Methodology; and experience with Jenkins pipelines.
  • 1 year of experience with feature engineering pipelines and re-useability in both training and inference stages; creating docker images for ML models and custom python scripts; operationalizing and governing machine learning models using AWS Sagemaker; designing, developing, and maintaining pipelines using Python and serverless AWS services; and AWS services like S3, ECS, Fargate, Glue, StepFunctions, CloudWatch, Lambda, and EMR.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service