We are seeking an experienced Data Engineer to design and optimize scalable data pipelines that drive our global data and analytics initiatives.
In this role, you will leverage technologies such as Apache Spark , Airflow , and Python to build high performance data processing systems and ensure data quality, reliability, and lineage across Mastercard’s data ecosystem.
The ideal candidate combines strong technical expertise with hands-on experience in distributed data systems, workflow automation, and performance tuning to deliver impactful, data-driven solutions at enterprise scale.
Responsibilities :
- Design and optimize Spark-based ETL pipelines for large-scale data processing.
- Build and manage Airflow DAGs for scheduling, orchestration, and checkpointing.
- Implement partitioning and shuffling strategies to improve Spark performance.
- Ensure data lineage, quality, and traceability across systems.
- Develop Python scripts for data transformation, aggregation, and validation.
- Execute and tune Spark jobs using spark-submit.
- Perform DataFrame joins and aggregations for analytical insights.
- Automate multi-step processes through shell scripting and variable management.
- Collaborate with data, DevOps, and analytics teams to deliver scalable data solutions.
Qualifications :
Bachelor’s degree in Computer Science, Data Engineering, or related field (or equivalent experience).At least 7 years of experience in data engineering or big data development.Strong expertise in Apache Spark architecture, optimization, and job configuration.Proven experience with Airflow DAGs using authoring, scheduling, checkpointing, monitoring.Skilled in data shuffling, partitioning strategies, and performance tuning in distributed systems.Expertise in Python programming including data structures and algorithmic problem-solving.Hands-on with Spark DataFrames and PySpark transformations using joins, aggregations, filters.Proficient in shell scripting, including managing and passing variables between scripts.Experienced with spark submit for deployment and tuning.Solid understanding of ETL design, workflow automation, and distributed data systems.Excellent debugging and problem-solving skills in large-scale environments.Experience with AWS Glue, EMR, Databricks, or similar Spark platforms.Knowledge of data lineage and data quality frameworks like Apache Atlas.Familiarity with CI / CD pipelines, Docker / Kubernetes, and data governance tools.