Founding Data EngineerElicit • Oakland, CA, US

No longer accepting applications

Founding Data Engineer

Elicit • Oakland, CA, US

13 days ago

Job type

Full-time

Job description

Oakland, CA (or remote within US timezones)

Type

Date posted

About Elicit

Elicit is an AI research assistant that uses language models to help professional researchers and high-stakes decision makers break down hard questions, gather evidence from scientific / academic sources, and reason through uncertainty.

What we're aiming for :

Elicit radically increases the amount of good reasoning in the world.

For experts, Elicit pushes the frontier forward.

For non-experts, Elicit makes good reasoning more accessible. People who don't have the tools, expertise, time, or mental energy to make carefully-reasoned decisions on their own can do so with Elicit.

Elicit is a scalable ML system based on human-understandable task decompositions, with supervision of process, not outcomes . This expands our collective understanding of safe AGI architectures.

Visit our Twitter to learn more about how Elicit is helping researchers and making progress on our mission.

Why we're hiring for this role

Two main reasons :

Currently, Elicit operates over academic papers and clinical trials. One of your key initial responsibilities will be to build a complete corpus of these documents, available as soon as they're published, combining different data sources and ingestion methods. Once that's done there is a growing list of other document types and sources we'd love to integrate!

One of our main initiatives is to broaden the sorts of tasks you can complete in Elicit. We need a data engineer to figure out the best way to ingest massive amounts of heterogeneous data in such a way as to make it usable by LLMs. We need your help to integrate into our customers' custom data providers to that they can create task-specific workflows over them.

In general, we're looking for someone who can architect and implement robust, scalable solutions to handle our growing data needs while maintaining high performance and data quality.

Probably less relevant to you, but ICOI :

Backend : Node and Python, event sourcing

Frontend : Next.js, TypeScript, and Tailwind

We like static type checking in Python and TypeScript!

All infrastructure runs in Kubernetes across a couple of clouds

We use GitHub for code reviews and CI

We deploy using the gitops pattern (i.e. deploys are defined and tracked by diffs in our k8s manifests)

Am I a good fit?

Consider the questions :

How would you optimize a Spark job that's processing a large amount of data but running slowly?

What are the differences between RDD, DataFrame, and Dataset in Spark? When would you use each?

How does data partitioning work in distributed systems, and why is it important?

How would you implement a data pipeline to handle regular updates from multiple academic paper sources, ensuring efficient deduplication?

If you have a solid answer for these—without reference to documentation—then we should chat!

Location and travel

We have a lovely office in Oakland, CA; there are people there every day but we don't all work from there all the time. It's important to us to spend time with our teammates, however, so we ask that all Elicians spend about 1 week out of every 6 with teammates.

5+ years of experience as a data engineer : owning make-or-break decisions about how to ingest, manage, and use data

Strong proficiency in Python (5+ years experience)

You have created and owned a data platform at rapidly-growing startups—gathering needs from colleagues, planning an architecture, deploying the infrastructure, and implementing the tooling

Experience with architecting and optimizing large data pipelines, ideally with particular experience with Spark; ideally these are pipelines which directly support user-facing features (rather than internal BI, for example)

Strong SQL skills, including understanding of aggregation functions, window functions, UDFs, self-joins, partitioning, and clustering approaches

Experience with columnar data storage formats like Parquet

Strong opinions, weakly-held about approaches to data quality management

Creative and user-centric problem-solving

You should be excited to play a key role in shipping new features to users—not just building out a data platform!

Nice to Have

Experience in developing deduplication processes for large datasets

Hands-on experience with full-text extraction and processing from various document formats (PDF, HTML, XML, etc.)

Familiarity with machine learning concepts and their application in search technologies

Experience with distributed computing frameworks beyond Spark (e.g., Dask, Ray)

Experience in science and academia : familiarity with academic publications, and the ability to accurately model the needs of our users

Hands-on experience with industry standard tools like Airflow, DBT, or Hadoop

Hands-on experience with standard paradigms like data lake, data warehouse, or lakehouse

What you'll do

Building and optimizing our academic research paper pipeline

You'll architect and implement robust, scalable systems to handle data ingestion while maintaining high performance and quality.

You'll work on efficiently deduplicating hundreds of millions of research papers, and calculating embeddings.

Your goal will be to make Elicit the most complete and up-to-date database of scholarly sources.

Expanding the datasets Elicit works over

Our users want Elicit to work over court documents, SEC filings, … your job will be to figure out how to ingest and index a rapidly increasing ontology of documents.

We also want to support less structured documents, spreadsheets, presentations, all the way up to rich media like audio and video.

Larger customers often want for us to integrate private data into Elicit for their organization to use. We'll look to you to define and build a secure, reliable, fast, and auditable approach to these data connectors.

Data for our ML systems

You'll figure out the best way to preprocess all these data mentioned above to make them useful to models.

We often need datasets for our model fine-tuning. You'll work with our ML engineers and evaluation experts to find, gather, version, and apply these datasets in training runs.

Your first week :

Start building foundational context

Get to know your team, our stack (including Python, Flyte, and Spark), and the product roadmap.

Familiarize yourself with our current data pipeline architecture and identify areas for potential improvement.

Make your first contribution to Elicit

Complete your first Linear issue related to our data pipeline or academic paper processing.

Have a PR merged into our monorepo, demonstrating your understanding of our development workflow.

Gain understanding of our CI / CD pipeline, monitoring, and logging tools specific to our data infrastructure.

Your first month :

You'll complete your first multi-issue project

Tackle a significant data pipeline optimization or enhancement project.

Collaborate with the team to implement improvements in our academic paper processing workflow.

You're actively improving the team

Contribute to regular team meetings and hack days, sharing insights from your data engineering expertise.

Add documentation or diagrams explaining our data pipeline architecture and best practices.

Suggest improvements to our data processing and storage methodologies.

You're flying solo

Independently implement significant enhancements to our data pipeline, improving efficiency and scalability.

Make impactful decisions regarding our data architecture and processing strategies.

You've developed an area of expertise

Become the go-to resource for questions related to our academic paper processing pipeline and data infrastructure.

Lead discussions on optimizing our data storage and retrieval processes for academic literature.

You actively research and improve the product

Propose and scope improvements to make Elicit more comprehensive and up-to-date in terms of scholarly sources.

Identify and implement technical improvements to surpass competitors like Google Scholar in terms of coverage and data quality.

Compensation, benefits, and perks

In addition to working on important problems as part of a productive and positive team, we also offer great benefits (with some variation based on location) :

Flexible work environment : work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8), as long as you can travel for in-person retreats and coworking events

Fully covered health, dental, vision, and life insurance for you, generous coverage for the rest of your family

Flexible vacation policy, with a minimum recommendation of 20 days / year + company holidays

401K with a 6% employer match

A new Mac + $1,000 budget to set up your workstation or home office in your first year, then $500 every year thereafter

$1,000 quarterly AI Experimentation & Learning budget, so you can freely experiment with new AI tools to incorporate into your workflow, take courses, purchase educational resources, or attend AI-focused conferences and events

A team administrative assistant who can help you with personal and work tasks

For all roles at Elicit, we use a data-backed compensation framework to keep salaries market-competitive, equitable, and simple to understand. For this role, we target starting ranges of :

Senior (L4) : $185-270k + equity

Expert (L5) : $215-305k + equity

We're optimizing for a hire who can contribute at a L4 / senior-level or above.

We also offer above-market equity for all roles at Elicit, as well as employee-friendly equity terms (10-year exercise periods).

J-18808-Ljbffr

Create a job alert for this search

Founding Engineer • Oakland, CA, US

Related jobs

AWS Data Engineer

VirtualVocations • San Francisco, California, United States

Full-time

A company is looking for an AWS Data Engineer to support cloud migration and optimize data queries.Key Responsibilities Review and rewrite existing SQL / Power BI queries for Amazon Redshift Optim...Show more

Last updated: 30+ days ago • Promoted

Analytics Engineer

VirtualVocations • Hayward, California, United States

Full-time

A company is looking for an Analytics Engineer to lead the development of data infrastructure for data monetization products. Key Responsibilities Act as a technical leader and mentor for the engi...Show more

Last updated: 30+ days ago • Promoted

Data Engineer / DBA

VirtualVocations • Fremont, California, United States

Full-time

A company is looking for a Data Engineer / DBA to manage and optimize their data systems.Key Responsibilities Design and manage MySQL databases focusing on speed, scalability, and reliability Bu...Show more

Last updated: 30+ days ago • Promoted

Staff Data Engineer

VirtualVocations • Hayward, California, United States

Full-time

A company is looking for a Staff Data Engineer - Databricks SME (Remote).Key Responsibilities Develop and maintain ETL / ELT pipelines using Databricks notebooks and workflows Optimize Spark jobs ...Show more

Last updated: 30+ days ago • Promoted

Lead Data Platform Engineer

VirtualVocations • Concord, California, United States

Full-time

A company is looking for a Lead Data Platform Engineer to architect and build scalable distributed data systems for cloud-based video surveillance and IoT control systems.Key Responsibilities Col...Show more

Last updated: 30+ days ago • Promoted

Associate Data Engineer

VirtualVocations • Hayward, California, United States

Full-time

A company is looking for an Associate Data Engineer to assist with data pipeline management and development.Key Responsibilities Participate in data engineering projects primarily on AWS Collabo...Show more

Last updated: 30+ days ago • Promoted

Data Solutions Engineer

VirtualVocations • Concord, California, United States

Full-time

A company is looking for a Data Solutions Engineer to support strategic data initiatives by designing and developing data pipelines and models. Key Responsibilities Architect and build new data pi...Show more

Last updated: 30+ days ago • Promoted

Cloud Data Engineer

VirtualVocations • Oakland, California, United States

Full-time

A company is looking for a Cloud Data Engineer to join the Data Office Team in modernizing enterprise analytics.Key Responsibilities Design and develop data pipelines and ELT processes to integra...Show more

Last updated: 30+ days ago • Promoted

Platform Engineer

VirtualVocations • Hayward, California, United States

Full-time

A company is looking for a Platform Engineer to design, build, and maintain their platform while addressing insider risk challenges. Key Responsibilities : Develop and optimize microservices for sp...Show more

Last updated: 30+ days ago • Promoted

Principal Data Engineer

VirtualVocations • Oakland, California, United States

Full-time

A company is looking for a Principal Data Engineer to lead data architecture and platform initiatives.Key Responsibilities Define and implement end-to-end data architecture and establish standard...Show more

Last updated: 30+ days ago • Promoted

Senior / Lead Data Solution Engineer

Meltwater • Redwood City, CA, United States

Full-time

We're thrilled to embark on the search for a seasoned.Senior / Lead Data Solution Engineer.This pivotal role offers an exciting opportunity to shape the future of technology within our organization.A...Show more

Last updated: 14 days ago • Promoted

ETL Data Engineer

VirtualVocations • Santa Clara, California, United States

Full-time

A company is looking for a Data Engineer for a contract role.Key Responsibilities Develop Ab Initio jobs and workflows within the existing metadata-driven framework Review requirements and sourc...Show more

Last updated: 3 days ago • Promoted

Databricks Engineer

VirtualVocations • Santa Clara, California, United States

Full-time

A company is looking for a Databricks Engineer.Key Responsibilities Design, develop, and deploy cloud data solutions for risk, regulatory, finance, and compliance projects Provide indirect leade...Show more

Last updated: 30+ days ago • Promoted

Data Engineer

VirtualVocations • Concord, California, United States

Full-time

A company is looking for a Data Systems Engineer to audit, compare, validate, and assess infrastructure and application data in ServiceNow. Key Responsibilities Design, write, and optimize complex...Show more

Last updated: 30+ days ago • Promoted

Senior Data Engineer

VirtualVocations • Hayward, California, United States

Full-time

A company is looking for a Senior Data Engineer to oversee data integration and migration, develop data warehouses, and mentor junior engineers. Key Responsibilities Developing solutions to improv...Show more

Last updated: 30+ days ago • Promoted

Data Pipeline Engineer

VirtualVocations • Hayward, California, United States

Full-time

A company is looking for a Staff Data Pipeline Engineer.Key Responsibilities Design, build, and optimize high-performance data pipelines for business and customer-facing systems Implement and ma...Show more

Last updated: 21 days ago • Promoted

Analytics Engineer II

VirtualVocations • Hayward, California, United States

Full-time

A company is looking for an Analytics Engineer II (REMOTE).Key Responsibilities Design and build data models and Power BI dashboards to solve business problems Collaborate with Agile teams to tr...Show more

Last updated: 30+ days ago • Promoted

Data Integration Engineer

VirtualVocations • Fremont, California, United States

Full-time

A company is looking for a Data Integration Engineer to develop data integration processes supporting client data initiatives. Key Responsibilities Develop solutions for data extraction, cleansing...Show more

Last updated: 30+ days ago • Promoted

Azure Cloud Data Engineer

VirtualVocations • Santa Clara, California, United States

Full-time

A company is looking for a Data Engineer for a fully remote contract opportunity.Key Responsibilities Manage data / cloud generalist responsibilities related to Oracle, DevOps, and Azure Cloud Bui...Show more

Last updated: 2 days ago • Promoted

Senior Data and AI Engineer

VirtualVocations • Concord, California, United States

Full-time

A company is looking for a Senior Engineer specializing in Data and AI to transform business operations through advanced AI applications. Key Responsibilities Lead the design, development, and dep...Show more

Last updated: 3 days ago • Promoted