Title : Databricks and AWS Focused Data Engineer
Location : Columbus, OH - Onsite
Overview :
We are seeking an experienced data engineer to deliver high-quality, scalable data solutions on Databricks and AWS for one of our Big Four clients. You will build and optimize pipelines, implement medallion architecture, integrate streaming and batch sources, and enforce strong governance and access controls to support analytics and ML use cases.
Key Responsibilities :
- Build and Maintain Data Pipelines : Develop scalable data pipelines using PySpark and Spark within the Databricks environment.
- Implement Medallion Architecture : Design workflows using raw, trusted, and refined layers to drive reliable data processing.
- Integrate Diverse Data Sources : Connect data from Kafka streams, extract channels, and APIs.
- Data Cataloging and Governance : Model and register datasets in enterprise data catalogs, ensuring robust governance and accessibility.
- Access Control : Manage secure, role-based access patterns to support analytics, AI, and ML needs.
- Team Collaboration : Work closely with peers to achieve required code coverage and deliver high-quality, well-tested solutions.
- Optimize and Operationalize : Tune Spark jobs (partitioning, caching, broadcast joins, AQE), manage Delta Lake performance (Z-Ordering, OPTIMIZE, VACUUM), and implement cost and reliability best practices on AWS.
- Data Quality and Testing : Implement data quality checks and validations (e.g., Great Expectations, custom PySpark checks), unit / integration tests, and CI / CD for Databricks Jobs / Workflows.
- Infrastructure as Code : Provision and manage Databricks and AWS resources using Terraform (workspaces, clusters, jobs, secret scopes, Unity Catalog objects, S3, IAM).
- Monitoring and Observability : Set up logging, metrics, and alerts (CloudWatch, Datadog, Databricks audit logs) for pipelines and jobs.
- Documentation : Produce clear technical documentation, runbooks, and data lineage for governed datasets.
Required Skills & Qualifications :
Databricks : 6-9 years of experience with expert-level proficiencyPySpark / Spark : 6-9 years of advanced hands-on experienceAWS : 6-9 years of experience with strong competency, including S3 and Terraform for infrastructure-as-codeData Architecture : Solid knowledge of the medallion pattern and data warehousing best practicesData Pipelines : Proven ability to build, optimize, and govern enterprise data pipelinesDelta Lake and Unity Catalog : Expertise in Delta Lake internals, time travel, schema evolution / enforcement, and Unity Catalog RBAC / ABACStreaming : Hands-on experience with Spark Structured Streaming, Kafka, checkpointing, exactly-once semantics, and late-arriving data handlingCI / CD : Experience with Git-based workflows and CI / CD for Databricks (e.g., Databricks Repos, dbx, GitHub Actions, Azure DevOps, or Jenkins)Security and Compliance : Experience with IAM, KMS, encryption, secrets management, token / credential rotation, and PII governancePerformance and Cost : Demonstrated ability to tune Spark jobs and optimize Databricks cluster configurations and AWS usage for cost and throughputCollaboration : Experience working in Agile / Scrum teams, peer reviews, and achieving code coverage targetsPreferred Skills & Qualifications :
Certifications : Databricks Data Engineer Professional, AWS Solutions Architect / Developer, HashiCorp Terraform AssociateData Catalogs : Experience with enterprise catalogs such as Collibra or Alation, and lineage tooling such as OpenLineageOrchestration : Databricks Workflows and / or AirflowAdditional AWS : Glue, Lambda, Step Functions, CloudWatch, Secrets ManagerTesting : pytest, chispa, Great Expectations, dbx testDomain Experience : Analytics and ML feature pipelines, MLOps integrations