CTG - Morrisville, NC
posted about 2 months ago
CTG is seeking to fill a Data Engineer opening for our client in Morrisville, NC. This position is part of a team focused on developing large language models and multi-modality LLMs, particularly in European languages. The primary goal is to work on the data aspect to help build robust multi-lingual AI models. Candidates must be proficient in English and at least one of the following languages: German, Italian, French, or Portuguese. The Data Engineer will be responsible for developing and maintaining web scraping and data extraction processes to gather large-scale text and image data from various sources. This role involves cleaning, preprocessing, and tagging data to ensure its quality and usability. The engineer will work with different data formats such as Parquet, JSONL, and CSV, ensuring efficient data storage and retrieval. Collaboration with data scientists and machine learning engineers is essential to support the evaluation and improvement of large language models. The engineer will also need to stay updated with the latest research and advancements in data engineering, web scraping, and machine learning, actively participating in academic research and reading groups. Additionally, the role requires implementing and optimizing data pipelines for high-volume data processing. Strong proficiency in Python and a solid understanding of HTML, JSON, and web technologies are crucial for success in this position. A Master's degree is required, along with 2-4 years of relevant experience. Excellent verbal and written communication skills in English are necessary, as the candidate will interact with a diverse group of professionals.