CTG - Morrisville, NC

posted about 2 months ago

Full-time - Mid Level
Morrisville, NC
5,001-10,000 employees
Professional, Scientific, and Technical Services

About the position

CTG is seeking to fill a Data Engineer opening for our client in Morrisville, NC. This position is part of a team focused on developing large language models and multi-modality LLMs, particularly in European languages. The primary goal is to work on the data aspect to help build robust multi-lingual AI models. Candidates must be proficient in English and at least one of the following languages: German, Italian, French, or Portuguese. The Data Engineer will be responsible for developing and maintaining web scraping and data extraction processes to gather large-scale text and image data from various sources. This role involves cleaning, preprocessing, and tagging data to ensure its quality and usability. The engineer will work with different data formats such as Parquet, JSONL, and CSV, ensuring efficient data storage and retrieval. Collaboration with data scientists and machine learning engineers is essential to support the evaluation and improvement of large language models. The engineer will also need to stay updated with the latest research and advancements in data engineering, web scraping, and machine learning, actively participating in academic research and reading groups. Additionally, the role requires implementing and optimizing data pipelines for high-volume data processing. Strong proficiency in Python and a solid understanding of HTML, JSON, and web technologies are crucial for success in this position. A Master's degree is required, along with 2-4 years of relevant experience. Excellent verbal and written communication skills in English are necessary, as the candidate will interact with a diverse group of professionals.

Responsibilities

  • Develop and maintain web scraping and data extraction processes to gather large-scale text and image data from diverse sources.
  • Clean, preprocess, and tag text and image data to ensure data quality and usability.
  • Work with different data formats such as Parquet, JSONL, and CSV, ensuring efficient data storage and retrieval.
  • Collaborate with data scientists and machine learning engineers to support the evaluation and improvement of large language models.
  • Stay up-to-date with the latest research and advancements in the field of data engineering, web scraping, and machine learning.
  • Implement and optimize data pipelines for high-volume data processing.

Requirements

  • Master's degree required
  • 2-4 years of experience
  • Strong proficiency in Python
  • Solid understanding of HTML, JSON, and web technologies
  • Excellent verbal and written English communication skills
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service