Overview
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is seeking a strong candidate to join us in developing and maintaining our ML infrastructure, including large GPU training and inference clusters.
Responsibilities
- Design, deploy, and maintain cloud-based ML training (Slurm) and inference (Kubernetes) clusters
- Implement and manage network-based cloud file systems and blob / S3 storage solutions
- Develop and maintain Infrastructure as Code (IaC) for resource provisioning
- Implement and optimize CI / CD pipelines for ML workflows
- Design and implement custom autoscaling solutions for ML workloads
- Ensure security best practices across the ML infrastructure
- Provide developer-friendly tools and practices for efficient ML operations
Ideal Experience
Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML / AI servicesExtensive experience with Kubernetes and Slurm cluster managementExpertise in Infrastructure as Code tools (e.g., Terraform, Ansible)Proven track record in managing and optimizing network-based cloud file systems and object storageExperience with CI / CD tools and practices (e.g., CircleCI, GitHub Actions, ArgoCD)Strong understanding of security principles and best practices in cloud environmentsExperience with monitoring and observability tools (e.g., Prometheus, Grafana, Loki)Familiarity with ML workflows and GPU infrastructure managementDemonstrated ability to handle complex migrations and breaking changes in production environmentsNice to have
Experience with custom autoscaling solutions for ML workloadsKnowledge of cost optimization strategies for cloud-based ML infrastructureFamiliarity with MLOps practices and toolsExperience with high-performance computing (HPC) environmentsUnderstanding of data versioning and experiment tracking for MLKnowledge of network optimization for distributed ML trainingExperience with multi-cloud or hybrid cloud architecturesFamiliarity with container security and vulnerability scanning toolsEEO and Privacy
Black Forest Labs is an equal opportunity employer. We do not discriminate on the basis of any protected status under applicable law. Employment is contingent on compliance with applicable laws and regulations. Voluntary self-identification of disability information is requested for government reporting purposes; participation is voluntary and will not affect hiring decisions. Any information provided is confidential.
#J-18808-Ljbffr