Are you ready to revolutionize agentic AI workflows? We are looking for a Lead Software Engineer to elevate our MCP Server into a powerful, enterprise-grade service, enabling seamless interaction between diverse tools and AI agents. In this role, you will play a crucial part in designing the new MCP Gateway, responsible for efficient request routing, policy enforcement, and establishing a solid foundation for scalable multi-agent systems. Your expertise will drive improvements in scalability, performance, and developer experience, ensuring a robust platform that meets enterprise demands.
About the Team
The Graph DX AI Runtime Team is at the forefront of building and maintaining the MCP Server and Gateway, which serve as the crucial communication backbone between agents and tools. We are dedicated to simplifying the developer experience, enabling efficient workflows, and facilitating reliable interactions. Our primary focus is on speed, security, and seamless integration, allowing teams to concentrate on creating intelligent applications rather than managing complex infrastructure.
Your Responsibilities
- Scale our enterprise AI / MCP Server and Gateway to support multi-agent workflows across Apollo, including routing, orchestration, and integration.
- Develop robust server infrastructure ensuring exceptional reliability, performance, and security under high-demand conditions.
- Build and maintain tools for agent discovery, communication, and coordination.
- Define effective deployment strategies and runtime optimizations to enhance efficiency and reduce operational overhead.
- Create frameworks and patterns for seamless multi-agent collaboration and AI-driven orchestration.
- Integrate observability, logging, and monitoring to gain complete visibility into server and agent operations.
- Explore AI-enhanced developer workflows to optimize orchestration and agent interactions.
- Collaborate with internal teams to adapt the MCP Server to meet evolving product and developer needs.
Technical Challenges You Will Address
Build and scale the MCP Gateway to ensure reliable agentic workflows across diverse environments.Design and implement high-performance routing infrastructure centered on reliability, scalability, and security.Establish routing patterns and coordination mechanisms that enable timely interaction between agents and tools.Define deployment strategies and runtime optimizations to minimize latency while balancing operational overhead.Explore AI-driven routing strategies to enhance context retrieval and decision-making accuracy.Coordinate with cross-functional teams to ensure smooth integration of the MCP Server and Gateway with Apollo's control plane.Integrate comprehensive observability and monitoring into the routing layer for insights into traffic flows, tool availability, and agent interactions.What We Are Looking For
Required Skills :
Expertise in agent-to-tool orchestration, routing, and coordination within scalable, fault-tolerant systems.Proficiency in Rust programming language.Strong background in distributed systems, server architecture, and high-performance backend development.Proven experience in protocol design, message routing, and server-side orchestration frameworks.Experience in developing and maintaining robust runtime infrastructure for AI-driven workflows.Hands-on proficiency with observability, monitoring, and debugging frameworks in complex systems.Commitment to clean, maintainable code, high reliability, and scalable architecture.Experience in strategic system design, with an eye towards long-term scalability and maintainability.Technical leadership skills, including mentoring junior engineers and promoting engineering best practices.Ability to influence architectural decisions across teams and align engineering efforts with product and business objectives.Demonstrated production ownership experience leading incident response and performance optimization in impactful backend systems.Bonus Skills :
Understanding of AI / ML-enabled developer tooling or autonomous system orchestration.Familiarity with cloud-native architectures, containerization, and orchestration frameworks.Track record of performance optimization and cost-efficient scaling of high-throughput distributed systems.