Talent.com
Software Engineer, Training & Inference Infrastructure

Software Engineer, Training & Inference Infrastructure

datologyaiRedwood City, CA, United States
30+ days ago
Job type
  • Full-time
Job description

About the Company

Companies want to train their own large models on their own data. The current industry standard is to train on a random sample of your data, which is inefficient at best and actively harmful to model quality at worst. There is compelling research showing that smarter data selection can train better models faster-we know because we did much of this research. Given the high costs of training, this presents a huge market opportunity. We founded DatologyAI to translate this research into tools that enable enterprise customers to identify the right data on which to train, resulting in better models for cheaper. Our team has pioneered deep learning data research, built startups, and created tools for enterprise ML. For more details, check out our recent blog posts sharing our high-level results for text models and image-text models.

We've raised over $57M in funding from top investors like Radical Ventures, Amplify Partners, Felicis, Microsoft, Amazon, and notable angels like Jeff Dean, Geoff Hinton, Yann LeCun and Elad Gil. We're rapidly scaling our team and computing resources to revolutionize data curation across modalities.

This role is based in Redwood City, CA. We are in office 4 days a week.

About the Role

We're looking for an engineer with deep experience building and operating large-scale training and inference systems. You will design, implement, and maintain the infrastructure that powers both our internal ML research workflows and the high-performance inference pipelines that deliver curated data to our customers.

As one of our early hires, you will influence technical direction, partner directly with researchers and product engineers, and take ownership of systems that are central to our company's success.

What You'll Work On

  • Architect and maintain training infrastructure that are reliable, scalable, and cost-efficient.
  • Build robust model serving infrastructure for low-latency, high-throughput inference across heterogeneous hardware.
  • Automate resource orchestration and fault recovery across GPUs, networking, OS, drivers, and cloud environments.
  • Partner with researchers to productionize new models and features quickly and safely.
  • Optimize training and inference pipelines for performance, reliability, and cost.
  • Ensure all infrastructure meets the highest bar for reliability, security, and observability.

About You

  • Have at least 5 years of professional software engineering experience.
  • Expertise in Python and experience with deep learning frameworks (PyTorch preferred)
  • Have an understanding of modern ML architectures and an intuition for how to optimize their performance, particularly for training and / or inference
  • Have familiarity with inference tooling like vLLM, SGLang, or custom model parallel systems.
  • Proven experience designing and running large-scale training or inference systems in production.
  • Have or can quickly gain familiarity with PyTorch, NVidia GPUs and the software stacks that optimize them (e.g. NCCL, CUDA), as well as HPC technologies such as InfiniBand, NVLink, AWS EFA etc.
  • Commitment to engineering excellence : strong design, testing, and operational discipline.
  • Collaborative, humble, and motivated to help the team succeed.
  • Ownership mindset : you're comfortable learning fast and tackling problems end-to-end.
  • Don't meet every single requirement? We still encourage you to apply. If you're excited about our mission and eager to learn, we want to hear from you!

    Compensation

    At DatologyAI, we are dedicated to rewarding talent with highly competitive salary and significant equity. The base salary for this position ranges from $180,000 to $250,000.

  • The candidate's starting pay will be determined based on job-related skills, experience, qualifications, and interview performance.
  • We offer a comprehensive benefits package to support our employees' well-being and professional growth :

  • 100% covered health benefits (medical, vision, and dental).
  • 401(k) plan with a generous 4% company match.
  • Unlimited PTO policy
  • Annual $2,000 wellness stipend.
  • Annual $1,000 learning and development stipend.
  • Daily lunches and snacks are provided in our office!
  • Relocation assistance for employees moving to the Bay Area.
  • Create a job alert for this search

    Software Engineer Infrastructure • Redwood City, CA, United States

    Related jobs
    • Promoted
    Flight Software Infrastructure Engineer

    Flight Software Infrastructure Engineer

    Reliable RoboticsMountain View, CA, United States
    Permanent
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Information Security Engineer (27639)

    Sr. Information Security Engineer (27639)

    SupermicroSan Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show moreLast updated: 12 days ago
    • Promoted
    Security Systems Engineer (Remote)

    Security Systems Engineer (Remote)

    Cisco Systems, Inc.San Jose, CA, United States
    Remote
    Full-time
    The application window is expected to close on 12 / 08 / 2025.Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.With Cisco, you're not ...Show moreLast updated: 25 days ago
    • Promoted
    Research Engineer, Training Infrastructure Lead

    Research Engineer, Training Infrastructure Lead

    GoodfireSan Francisco, CA, United States
    Full-time
    Behind our name : Like fire, AI holds the potential for both immense benefit and significant risk.Just as mastering fire transformed human history, we believe the safe and intentional development of...Show moreLast updated: 9 days ago
    • Promoted
    Staff Infrastructure Engineer, Pre-training

    Staff Infrastructure Engineer, Pre-training

    AnthropicSan Francisco, CA, United States
    Full-time
    Anthropic's mission is to create reliable, interpretable, and steerable AI systems.We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group ...Show moreLast updated: 1 day ago
    • Promoted
    Machine Learning Engineer - Training & Infrastructure

    Machine Learning Engineer - Training & Infrastructure

    P-1 AISan Francisco, CA, United States
    Full-time
    We are building an engineering AGI.We founded P-1 AI with the conviction that the greatest impact of artificial intelligence will be on the built worldhelping mankind conquer nature and bend it to ...Show moreLast updated: 1 day ago
    • Promoted
    Software Engineer L4 / L5 Training Platform, Machine Learning Platform

    Software Engineer L4 / L5 Training Platform, Machine Learning Platform

    NetflixSan Francisco, CA, United States
    Full-time
    Netflix is one of the world's leading entertainment services, with over 300 million paid memberships in over 190 countries enjoying TV series, films and games across a wide variety of genres and la...Show moreLast updated: 1 day ago
    • Promoted
    Infrastructure & Security Engineer (Platform)

    Infrastructure & Security Engineer (Platform)

    MeanwhileSan Francisco, CA, United States
    Full-time
    Infrastructure & Security Engineer (Platform) role focused on driving the evolution of our infrastructure and security posture. You will have ownership over technology choices and implementation for...Show moreLast updated: 1 day ago
    • Promoted
    Software Infrastructure & Platform Engineer

    Software Infrastructure & Platform Engineer

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    HEDRA INCSan Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 30+ days ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Hedra, IncSan Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 30+ days ago
    • Promoted
    Infrastructure Software Engineer, Public Sector

    Infrastructure Software Engineer, Public Sector

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    Scale AI is seeking a highly skilled and motivated.Software Engineer, AI Infrastructure & Security.Public Sector Engineering team. As a part of this team, you will play a critical role in delivering...Show moreLast updated: 30+ days ago
    • Promoted
    Software Engineer, AI Training Infrastructure

    Software Engineer, AI Training Infrastructure

    Fireworks AIRedwood City, CA, United States
    Full-time
    At Fireworks, we're building the future of generative AI infrastructure.Our platform delivers the highest-quality models with the fastest and most scalable inference in the industry.We've been inde...Show moreLast updated: 30+ days ago
    • Promoted
    Research Engineer, Training Infrastructure Lead

    Research Engineer, Training Infrastructure Lead

    Menlo VenturesSan Francisco, CA, United States
    Full-time
    Behind our name : Like fire, AI holds the potential for both immense benefit and significant risk.Just as mastering fire transformed human history, we believe the safe and intentional development of...Show moreLast updated: 21 days ago
    • Promoted
    Software Engineer, Infrastructure Reliability

    Software Engineer, Infrastructure Reliability

    OpenAISan Francisco, CA, United States
    Full-time
    We're hiring software engineers to join our broader Infrastructure organization, which supports multiple high-impact teams. Depending on your interests and experience, you could work on one of sever...Show moreLast updated: 6 days ago
    • Promoted
    Senior Machine Learning Systems Infrastructure Engineer - SIML, ISE

    Senior Machine Learning Systems Infrastructure Engineer - SIML, ISE

    AppleCupertino, CA, United States
    Full-time
    Do you think Computer Vision and Machine Learning can change the world? Do you think it can transform the way millions of people collect, discover and share the most special moments of their lives?...Show moreLast updated: 5 days ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Ipro Networks Pte. Ltd.San Francisco, CA, United States
    Full-time
    Job Title : Machine Learning Engineer, Training Infrastructure | Position Type : Full time | Location : San Francisco, CA, USA | Salary Range : $150,000 - $250,000 (USD) | Job ID# : 158135.Design, imple...Show moreLast updated: 30+ days ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    HedraSan Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 30+ days ago