We are seeking a highly skilled PySpark Developer with at least 5 years of experience in big data processing and analytics. The ideal candidate will design, implement, and optimize large-scale data processing pipelines, leveraging the capabilities of Apache Spark and Python.
- Develop, test, and maintain PySpark-based ETL pipelines to process and analyze large datasets.
- Collaborate with data engineers, data scientists, and business stakeholders to understand data requirements and design optimal solutions.
- Optimize PySpark applications for performance and scalability in distributed computing environments.
- Work with Hadoop-based data platforms and integrate with other tools like Hive, HDFS, or Kafka.
- Ensure data quality and integrity through robust validation and monitoring practices.
- Debug and resolve issues in production and pre-production environments.
- Document technical solutions and best practices.
Technical Skills:
- 5+ years of experience in data engineering or big data development, with a strong focus on PySpark.
- Proficiency in Python programming, with experience in libraries commonly used in data processing (e.g., Pandas, NumPy).
- Strong understanding of Apache Spark concepts: Spark Core, Spark SQL, and Spark Streaming.
- Experience with distributed data processing frameworks and working in cloud-based environments (e.g., AWS, Azure, GCP).
- Solid knowledge of big data technologies like Hadoop, Hive, HDFS, Kafka, or Airflow.
- Hands-on experience with relational and NoSQL databases (e.g., PostgreSQL, Cassandra).
- Familiarity with CI/CD pipelines and version control (e.g., Git).
Soft Skills:
- Strong analytical and problem-solving skills.
- Ability to work collaboratively in a team and communicate technical concepts effectively.
- Detail-oriented, with a commitment to delivering high-quality code.
Preferred Qualifications:
- Experience with streaming data using Spark Streaming or Kafka.
- Knowledge of machine learning workflows and integration with big data pipelines.
- Understanding of containerization tools like Docker or orchestration with Kubernetes.