Job Title : Data Engineer
Location : Philadelphia, PA (Hybrid)
Experience : 5+ years
Role Summary :
We are seeking an experienced Data Engineer with strong expertise in PySpark and data pipeline operations . This role focuses heavily on performance tuning Spark applications , managing large-scale data pipelines, and ensuring high operational stability. The ideal candidate is a strong technical problem-solver, highly collaborative, and proactive in automation and process improvements.
Key Responsibilities :
Data Pipeline Management & Support
- Operate and support Business-as-Usual (BAU) data pipelines, ensuring stability, SLA adherence, and timely incident resolution.
- Identify and implement opportunities for optimization and automation across pipelines and operational workflows.
Spark Development & Performance Tuning
Design, develop, and optimize PySpark jobs for efficient large-scale data processing.Diagnose and resolve complex Spark performance issues such as data skew, shuffle spill, executor OOM errors, slow-running stages, and partition imbalance.Platform & Tool Management
Use Databricks for Spark job orchestration, workflow automation, and cluster configuration.Debug and manage Spark on Kubernetes , addressing pod crashes, OOM kills, resource tuning, and scheduling problems.Work with MinIO / S3 storage for bucket management, permissions, and large-volume file ingestion and retrieval.Collaboration & Communication
Partner with onshore business stakeholders to clarify requirements and convert them into well-defined technical tasks.Provide daily coordination and technical oversight to offshore engineering teams.Participate actively in design discussions and technical reviews.Documentation & Operational Excellence
Maintain accurate and detailed documentation, runbooks, and troubleshooting guides.Contribute to process improvements that enhance operational stability and engineering efficiency.Required Skills & Qualifications :
Primary Skills (Must-Have)
PySpark : Advanced proficiency in transformations, performance tuning, and Spark internals.SQL : Strong analytical query design, performance tuning, and foundational data modeling (relational & dimensional).Python : Ability to write maintainable, production-grade code with a focus on modularity, automation, and reusability.Secondary Skills (Highly Desirable)
Kubernetes : Experience with Spark-on-K8s, including pod diagnostics, resource configuration, and log / monitoring tools.Databricks : Hands-on experience with cluster management, workflow creation, Delta Lake optimization, and job monitoring.MinIO / S3 : Familiarity with bucket configuration, policies, and efficient ingestion patterns.DevOps : Experience with Git, CI / CD, and cloud environments (Azure preferred).