Talent.com
Data Engineer
Data EngineerInstitute Of Foundation Models • Sunnyvale, California, United States
Data Engineer

Data Engineer

Institute Of Foundation Models • Sunnyvale, California, United States
30+ days ago
Job type
  • Full-time
Job description

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

As a Data Engineer specializing in Natural Language Processing (NLP) and large-scale data processing, you will quickly and effectively gather, curate, and prepare high-quality datasets to support cutting-edge NLP research. Your role will be instrumental in enabling researchers by delivering essential data through efficient and scalable engineering practices, including web crawling, LLM-generated content refinement, and robust data pipelines, primarily leveraging Python and related technologies.

Key Responsibilities

  • Rapidly collect, curate, and preprocess datasets based on detailed specifications provided by NLP researchers, delivering data within tight timelines (typically within 1-2 days).
  • Develop and maintain efficient web crawling solutions, APIs, and automated workflows to continuously improve data collection processes.
  • Refine and evaluate outputs from Large Language Models (LLMs) to generate structured datasets suitable for model training and benchmarking.
  • Implement scalable data pipelines, ensuring efficient data processing, storage, retrieval, and distribution to research teams.
  • Collaborate closely with researchers and engineers to ensure collected data meets specified quality and relevance criteria.
  • Document data collection methodologies, dataset characteristics, and pipeline architecture clearly and effectively.
  • Engage with peer teams and participate in technical reviews to uphold best practices and data quality standards.
  • Represent MBZUAI at industry and research forums, showcasing technical capabilities in large-scale data processing and AI data infrastructure.
  • Perform all other duties as reasonably directed by the line manager commensurate with these functional objectives.

Academic Qualifications

  • Bachelor's degree in Computer Science, Data Science, Engineering, or a related technical field required
  • Master’s degree or equivalent experience in Computer Science, Data Engineering, or related technical fields preferred.
  • Professional Experience - Required

  • Extensive experience in data engineering, data processing, and automation using Python.
  • Demonstrated proficiency in designing and deploying web crawling solutions, automated data extraction, and processing pipelines.
  • Strong understanding of data structures, algorithms, databases, SQL, and performance optimization.
  • Experience working with cloud infrastructure and distributed data processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
  • Excellent problem-solving abilities, attention to detail, and the capability to rapidly address technical challenges.
  • Strong communication and collaboration skills with cross-functional teams.
  • Professional Experience - Preferred

  • Proven track record of supporting NLP or AI research teams with rapid and reliable data delivery.
  • Experience with refining outputs from large-scale AI models, such as LLM-generated data.
  • Contributions to open-source projects, coding competitions, or high visibility in coding communities (e.g., GitHub, Stack Overflow).
  • Familiarity with the latest advancements in NLP data processing and large language model technologies.
  • $100,000 - $500,000 a year

    Visa Sponsorship

    This position is eligible for visa sponsorship.

    Benefits Include

  • Comprehensive medical, dental, and vision benefits
  • Bonus
  • 401K Plan
  • Generous paid time off, sick leave and holidays
  • Paid Parental Leave
  • Employee Assistance Program
  • Life insurance and disability
  • Create a job alert for this search

    Data Engineer • Sunnyvale, California, United States

    Related jobs
    DATA ENGINEER

    DATA ENGINEER

    Purple Drive • Pleasanton, CA, United States
    Full-time
    The Senior Data Engineer will be responsible for designing, building, and maintaining robust data pipelines and architectures on AWS to support scalable data processing, storage, and analytics.The ...Show more
    Last updated: 16 days ago • Promoted
    Data Engineer

    Data Engineer

    Omni Inclusive • Santa Clara, CA, United States
    Full-time
    We are seeking a highly skilled Data Engineer to join our team, focusing on building, maintaining, and optimizing our data infrastructure. As a Data Engineer, you will develop data pipelines, manage...Show more
    Last updated: 30+ days ago • Promoted
    Data Engineer

    Data Engineer

    E-Solutions • Mountain View, CA, United States
    Full-time
    Conduct complex data analysis and report on results.Prepare data for prescriptive and predictive modeling.Combine raw information from different sources. Explore ways to enhance data quality and rel...Show more
    Last updated: 3 days ago • Promoted
    Data Engineer

    Data Engineer

    Adobe • San Jose, CA, United States
    Full-time
    Changing the world through digital experiences is what Adobe's all about.We give everyone-from emerging artists to global brands-everything they need to design and deliver exceptional digital exper...Show more
    Last updated: 30+ days ago • Promoted
    Data Engineer - Hybrid

    Data Engineer - Hybrid

    The Dignify Solutions LLC • Sunnyvale, CA, United States
    Full-time
    Python or Scala or Java and experience optimizing SQL queries on large data.Experience working with large-scale data warehouse solutions such as Teradata, Snowflake, or Redshift.Hands on experience...Show more
    Last updated: 2 days ago • Promoted
    Data Engineer

    Data Engineer

    Northern Base • Mountain View, CA, United States
    Full-time
    Solid understanding of Spark including performance tuning.Solid understanding of the AWS Platform.Ability to work with Business and technical stake holders independently with minimal guidance - Mus...Show more
    Last updated: 2 days ago • Promoted
    Data Engineer - Data Platform

    Data Engineer - Data Platform

    Tik Tok • San Jose, CA, United States
    Full-time
    As a data engineer in the data platform team, you will have the opportunity to build, optimize and grow one of the largest data platforms in the world. You'll have the opportunity to gain hands-on e...Show more
    Last updated: 30+ days ago • Promoted
    Data Engineer

    Data Engineer

    Redolent • Sunnyvale, CA, United States
    Full-time
    Designs, develops, and implements Hadoop eco-system based applications to support business requirements.Follows approved life cycle methodologies, creates design documents, and performs program cod...Show more
    Last updated: 3 days ago • Promoted
    Data Engineer

    Data Engineer

    Tranzeal • Sunnyvale, CA, United States
    Full-time
    Design, build, and maintain robust data pipelines using Apache Spark and Python.Develop and manage workflow orchestration using Apache Airflow. Implement scalable data solutions on Google Cloud Plat...Show more
    Last updated: 30+ days ago • Promoted
    Data Engineer

    Data Engineer

    Kaav Inc. • San Jose, CA, United States
    Full-time
    Data Engineers are focused on enabling a data-driven approach to optimization by sourcing, maintaining and ensuring the availability of data used to drive full lifecycle marketing insights to optim...Show more
    Last updated: 3 days ago • Promoted
    Data Engineer

    Data Engineer

    Diverse Lynx • Pleasanton, CA, United States
    Full-time
    Job Overview : We are seeking a talented and motivated Data Engineer with expertise in Spark SQL, Databricks, Azure Data Factory (ADF), SQL, IICS, Unix, PySpark, Python, and Azure Data Lake Storage ...Show more
    Last updated: 30+ days ago • Promoted
    Data Engineer - AG

    Data Engineer - AG

    Bayone • Sunnyvale, CA, United States
    Full-time
    Demonstrates up-to-date expertise and applies this to the development,.Supporting and aligning efforts to meet customer, business. Create software design and architecture for next software solution....Show more
    Last updated: 30+ days ago • Promoted
    Databricks Data Engineer

    Databricks Data Engineer

    Tekfortune Inc • Pleasanton, CA, United States
    Permanent
    Tekfortune is a fast-growing consulting firm specialized in permanent, contract & project-based staffing services for world's leading organizations in a broad range of industries.In this quickly ch...Show more
    Last updated: 30+ days ago • Promoted
    Data Engineer

    Data Engineer

    Brevian.ai • Sunnyvale, CA, United States
    Full-time
    BREV / AN is at the forefront of revolutionizing how businesses leverage artificial intelligence.Our no-code platform empowers every business team to harness the power of production-grade AI agents, ...Show more
    Last updated: 30+ days ago • Promoted
    Data Engineer 3

    Data Engineer 3

    Varite • San Jose, CA, United States
    Full-time
    The Financial Data Engineer performs a wide range of job duties utilizing technical know-how and develop an analytics product that will generate insights into financial metrics and customer journey...Show more
    Last updated: 3 days ago • Promoted
    Data Engineer 3

    Data Engineer 3

    PayPal • San Jose, CA, United States
    Full-time
    PayPal has been revolutionizing commerce globally for more than 25 years.Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empow...Show more
    Last updated: 3 days ago • Promoted
    Data Engineer

    Data Engineer

    Akkodis • Sunnyvale, CA, United States
    Full-time
    This role requires advanced expertise in Snowflake, SQL, and Python to build scalable data pipelines and implement dimensional data models. The rate may be negotiable based on experience, education,...Show more
    Last updated: 3 days ago • Promoted
    Data Engineer

    Data Engineer

    CData Software • Pleasanton, CA, United States
    Full-time
    Should have good experience with Spark SQL, Databricks, Azure Data Factory (ADF), SQL, IICS, Unix, PySpark, Python, and Azure Data Lake Storage (ADLS). We are seeking a talented and motivated Data E...Show more
    Last updated: 13 days ago • Promoted