Cloud Platform EngineerSambaNova Systems • Palo Alto, CA, United States

Cloud Platform Engineer

SambaNova Systems • Palo Alto, CA, United States

6 days ago

Job type

Full-time

Job description

The era of pervasive AI has arrived. In this era, organizations will use generative AI to unlock hidden value in their data, accelerate processes, reduce costs, drive efficiency and innovation to fundamentally transform their businesses and operations at scale.

SambaNova Suite™ is the first full-stack, generative AI platform, from chip to model, optimized for enterprise and government organizations. Powered by the intelligent SN40L chip, the SambaNova Suite is a fully integrated platform, delivered on-premises or in the cloud, combined with state-of-the-art open-source models that can be easily and securely fine-tuned using customer data for greater accuracy. Once adapted with customer data, customers retain model ownership in perpetuity, so they can turn generative AI into one of their most valuable assets.

About SambaNova Systems

Join the company that's building the future of AI computing. At SambaNova, we are disrupting the AI and high-performance computing space with our integrated hardware and software platform. Our DataScale systems and SambaFlow software are pushing the boundaries of what's possible with generative AI and large language models. We are a team of passionate innovators tackling some of the world's most challenging computational problems.

The Role

As a Cloud Platform Engineer, you will be specializing in our AI Inferencing Service and will be the guardian of its reliability, performance, and scalability. You will bridge the gap between software development and operations, applying an engineering mindset to solve operational challenges. Your primary focus will be ensuring our inference endpoints have exceptional uptime, low-latency response times, and efficient resource utilization, directly impacting the experience of our customers and the success of our AI products. This role includes participating in a shared on-call rotation to maintain 24 / 7 service reliability.

What You'll Do

Service Ownership & On-Call : Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions. This includes implementing and supporting AI infrastructure in new regions, such as Asia, Europe, and Latin America, to support the growth of our business. Participate in a balanced on-call rotation to provide 24 / 7 support for the service.

On-Call & Work-Life Balance

We believe a sustainable on-call schedule is critical for long-term success and team health. Our on-call philosophy is built on the following principles :

Balanced Rotation : The on-call rotation is shared equally across the team, typically following a primary / secondary (follow-the-sun) model to ensure no single person bears a disproportionate burden.
Focus on Prevention : We invest heavily in automation, robust testing, and system design to prevent pages before they happen. The goal of on-call is not to heroically fight fires, but to manage rare, complex failures and use those learnings to make the system more resilient.
Actionable Alerts : We have a strict policy against alert fatigue. Alerts must be actionable and require immediate human intervention.
Incident Management : Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence.
Monitoring & Alerting : Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization. A key responsibility is ensuring alerts are actionable and have a low false-positive rate, minimizing on-call fatigue.
Performance & Scalability : Proactively identify and eliminate performance bottlenecks. Design and implement auto-scaling policies to handle variable inference loads cost-effectively. Use insights from on-call incidents to drive improvements that enhance system stability and scalability.
Infrastructure as Code (IaC) : Manage and evolve our cloud infrastructure (on AWS, GCP, and / or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable.
CI / CD & Automation : Champion automation by building and improving CI / CD pipelines for the seamless and safe deployment of new model versions and service updates. A core goal is to automate manual toil identified during on-call shifts, reducing future operational overhead.
Capacity Planning : Forecast infrastructure needs based on product roadmaps and usage trends. Work with finance and engineering teams to manage cloud costs and optimize spending.
SLOs & SLIs : Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments.

What We're Looking For (Must-Haves)

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.

3-5+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure).

Strong programming / scripting skills in languages like Python, Go, or Java.

Proven experience with containerization and orchestration technologies (Docker, Kubernetes).

Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).

Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation).

Familiarity with CI / CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD).

Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems.

What Will Make You Stand Out (Nice-to-Haves)

Experience in a hybrid environment bridging cloud and on-premise / data center infrastructure.

Direct experience supporting ML / AI inferencing services in production.

Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs.

Knowledge of model serving frameworks like vLLM, SGLang or Ray.

Understanding of MLOps principles and practices.

Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached).

Strong Linux / Unix system administration fundamentals.

Why SambaNova?

Massive Impact : You will be a key part of a critical platform with high visibility and direct impact on our product and engineers.

Cutting-Edge Technology : Work with a world-class team on one of the most advanced AI stacks in the industry.

Autonomy and Growth : We trust you to make technical decisions. This is a greenfield opportunity to build something remarkable from the ground up.

Competitive Compensation : Including equity, excellent benefits, and a flexible work environment.

Submission Guidelines Please note that in order to be considered an applicant for any position at SambaNova Systems, you must submit an application form for each position for which you believe you are qualified.

EEO Policy SambaNova Systems is an Equal Opportunity / Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard basis of age (40 and over), color, disability, gender identity, genetic information, marital status, military or veteran status, national origin / ancestry, race, religion, creed, sex (including pregnancy, childbirth, breastfeeding), sexual orientation, and any other applicable status protected by federal, state, or local laws.

Benefits Summary for US-Based, Full-Time Employment Positions

SambaNova offers a competitive total rewards package, including the base salary, plus equity and benefits. We cover 95% premium coverage for employee medical insurance, and 77% premium coverage for dependents and offer a Health Savings Account (HSA) with employer contribution. We also offer Dental, Vision, Short / Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plans in addition to Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care. Our library of well-being benefits available to you and your dependents includes a full subscription to Headspace, Gympass+ membership with access to physical gyms, One Medical membership, counseling services with an Employee Assistance Program, and much more.

Create a job alert for this search

Cloud Platform Engineer • Palo Alto, CA, United States

Related jobs

Senior Platform DevOps Engineer

Saxon Global • Mountain View, CA, United States

Full-time

Platform Engineering, SRE, or DevOps.Experience with HPC clusters (Slurm, PBS, Grid Engine).Cloud infrastructure expertise (GCP / AWS preferred). Proficiency with Terraform, Ansible, Prometheus, Grafa...Show more

Last updated: 6 days ago • Promoted

Cloud Engineer (Cloud-based Third-Party Integrations)

Cynet Systems • San Jose, CA, United States

Full-time

Backend and cloud development experience within connected vehicle or embedded systems.Cloud and API integration expertise. Strong experience with RESTful and / or GraphQL APIs.Knowledge of authenticat...Show more

Last updated: 14 days ago • Promoted

Staff Platform Engineer

Intuitive • Sunnyvale, California, USA

Full-time

We are seeking an experienced Staff Platform Engineer with 5-8 years of proven experience in regulated environments.The ideal candidate will support small to mid-sized development teams focusing on...Show more

Last updated: 25 days ago • Promoted

Senior Cloud Platform Engineer - FedRAMP

Rubrik Job Board • Palo Alto, CA, United States

Full-time

The Information Technology team at Rubrik influences business processes, employee experience, and technologies to scale our organization to $1B+. This team creates operational efficiency across the ...Show more

Last updated: 6 days ago • Promoted

Cloud Engineer

Zone IT Solutions • San Jose, CA, United States

Full-time

We are looking for a Cloud Engineer, where you will be a key player in shaping our cloud strategy and enhancing the services we provide to our clients. In this role, you will design, implement, and ...Show more

Last updated: 19 days ago • Promoted

Platform Engineer

Info Way Solutions • Fremont, CA, United States

Full-time

Info Way Solutions, LLC We have job opening for.Job description is given below : .Kindly check the JD and share your views. In this role you will be responsible for engineering, integrating and mainta...Show more

Last updated: 30+ days ago • Promoted

Cloud Engineer

Bayone • San Jose, CA, United States

Full-time

The successful applicant will be performing work on US Government project for this role, and therefore, US citizenship is required. This position may also perform work that the U.Responsibilities in...Show more

Last updated: 30+ days ago • Promoted

Sr. Cloud Infrastructure Engineer TechOps CICD

CrowdStrike • Sunnyvale, California, USA

Full-time

As a global leader in cybersecurity CrowdStrike protects the people processes and technologies that drive modern organizations. Since 2011 our mission hasnt changed were here to stop breaches and w...Show more

Last updated: 11 days ago • Promoted

Cloud Engineer III

Verily Life Sciences • Mountain View, CA, United States

Full-time

Verily is a subsidiary of Alphabet that is using a data-driven approach to change the way people manage their health and the way healthcare is delivered. Launched from Google X in 2015, our purpose ...Show more

Last updated: 30+ days ago • Promoted

Platform Engineer

Hyve Solutions • Fremont, CA, United States

Full-time

Hyve Solutions is a leader in the data center solutions industry.We design, manufacture, and deliver custom Server, Storage, and Networking Solutions to the world's largest Cloud, Social Media, and...Show more

Last updated: 13 days ago • Promoted

Senior Cloud Platform Engineer - FedRAMP

Rubrik • Palo Alto, CA, United States

Full-time

Last updated: 6 days ago • Promoted

Staff Cloud Engineer

Harness • Mountain View, California, USA

Full-time

Harness is a high-growth company that is disrupting the software delivery market.Our mission is to enable the 30 million software developers in the world to deliver code to their users reliably eff...Show more

Last updated: 25 days ago • Promoted

Cloud Platform Engineer

E-Solutions • San Jose, CA, United States

Full-time

Cloud Platform Engineer (Terraform).Hybrid in Jacksonville, FL (3 x days onsite).Hashicorp Terraform for Infrastructure as Code to write advanced modules for deploying infrastructure or services wh...Show more

Last updated: 6 days ago • Promoted

Platform Engineer

Dtex Systems • Fremont, CA, United States

Full-time

As a Platform Engineer, you will play a crucial role in designing, building, and maintaining our DTEX InTERCEPT platform. You will work with a talented team of engineers to develop and deploy robust...Show more

Last updated: 13 days ago • Promoted

Platform Engineer

Aptos • Palo Alto, CA, United States

Full-time

Aptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way. Founded by some of the original creat...Show more

Last updated: 30+ days ago • Promoted

Sr. Azure Cloud Engineer

SanDisk • Milpitas, CA, United States

Full-time

Sandisk understands how people and businesses consume data and we relentlessly innovate to deliver solutions that enable today's needs and tomorrow's next big ideas. With a rich history of groundbre...Show more

Last updated: 30+ days ago • Promoted

Software Engineer - Cloud Storage Solutions

Mindlance • Fremont, CA, United States

Full-time

I am sending requirement, kindly get back to me if the job description suits you.Software Engineer - Cloud Storage Solutions. Design, implement, optimize private cloud-based Cloud Storage solutions ...Show more

Last updated: 19 days ago • Promoted

Azure Cloud Platform Engineer

Diverse Lynx • Pleasanton, CA, United States

Full-time

Certified Kubernetes Administrator (CKA).Onboard new applications and environments to Azure Platform.Create new AKS Namespaces, Resource Groups and configure existing Azure resources.Provision AKS,...Show more

Last updated: 19 days ago • Promoted