As a Data Engineer on our Development Team, you will design, develop, and optimize data pipelines within an AWS ecosystem for a US-based B2B marketplace company. Your expertise in PySpark will be instrumental in processing large-scale datasets, ensuring the reliability and performance of our data systems. You will collaborate with cross-functional teams, including data scientists and analysts, to deliver high-impact solutions that support business objectives.
Key Responsibilities:
- Design, develop, and implement data pipelines using PySpark within AWS environments
- Leverage AWS services such as S3, Glue, EMR, Lambda, and Redshift for building scalable data solutions
- Optimize PySpark workflows for performance, reliability, and cost-efficiency
- Collaborate with stakeholders to understand data requirements and translate them into technical solutions
- Ensure data quality and integrity through robust testing and monitoring processes
- Implement data governance, security, and compliance best practices in all development activities
- Document technical designs, processes, and workflows to support ongoing maintenance and team knowledge sharing.
Requirements:
- Bachelor’s degree in Computer Science, Engineering, or a related field
- 4 years+ of experience in data engineering, with a focus on building and optimizing data pipelines using PySpark
- Strong experience with AWS services, including S3, Glue, Lambda, EMR, and Redshift
- Proficiency in Python programming and familiarity with related frameworks and libraries
- Solid understanding of distributed computing and experience with Apache Spark
- Hands-on experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation) is a plus
- Strong analytical and problem-solving skills, with attention to detail and a proactive approach to troubleshooting
- Excellent communication and collaboration skills, with the ability to work in a dynamic, team-oriented environment
- Upper-Intermediate level of English