Ccube - Columbus, OH

posted 3 months ago

Full-time
Columbus, OH

About the position

We are seeking a skilled Python Spark Developer to join our team in Columbus, OH. This position is a full-time role that requires the candidate to work onsite from day one, with a hybrid schedule of three days a week. The primary responsibility of the Python Spark Developer will be to develop and maintain data platforms utilizing Python, Spark, and PySpark. The successful candidate will handle the migration of existing data processes to PySpark on AWS, ensuring that the transition is smooth and efficient. In this role, you will design and implement robust data pipelines that are essential for processing large datasets. You will work closely with AWS and Big Data technologies to produce unit tests for Spark transformations and helper methods, ensuring the reliability and performance of the data processing tasks. Additionally, you will create Scala/Spark jobs for data transformation and aggregation, which are critical for our data analytics initiatives. Writing Scaladoc-style documentation for your code will also be a key part of your responsibilities, as it helps maintain clarity and understanding within the team. Optimizing Spark queries for performance is another crucial aspect of this role, as it directly impacts the efficiency of our data processing workflows. You will also be required to integrate with various SQL databases, including Microsoft SQL Server, Oracle, Postgres, and MySQL, which will involve a solid understanding of database management and query optimization. A strong grasp of distributed systems concepts, such as the CAP theorem, partitioning, replication, consistency, and consensus, is essential for success in this position. This role offers an exciting opportunity to work with cutting-edge technologies in a collaborative environment, contributing to the development of innovative data solutions.

Responsibilities

  • Develop and maintain data platforms using Python, Spark, and PySpark.
  • Handle migration to PySpark on AWS.
  • Design and implement data pipelines.
  • Produce unit tests for Spark transformations and helper methods.
  • Create Scala/Spark jobs for data transformation and aggregation.
  • Write Scaladoc-style documentation for code.
  • Optimize Spark queries for performance.
  • Integrate with SQL databases (e.g., Microsoft, Oracle, Postgres, MySQL).
  • Understand distributed systems concepts (CAP theorem, partitioning, replication, consistency, and consensus).

Requirements

  • Proficiency in Python, Scala (with a focus on functional programming), and Spark.
  • Familiarity with Spark APIs, including RDD, DataFrame, MLlib, GraphX, and Streaming.
  • Experience working with HDFS, S3, Cassandra, and/or DynamoDB.
  • Deep understanding of distributed systems.
  • Experience with building or maintaining cloud-native applications.
  • Familiarity with serverless approaches using AWS Lambda is a plus.
  • Bachelor's Degree in Computer Science/Programming or similar is preferred.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service