Tesla’s innovative networking solution for AI training: Tesla has developed a new networking protocol called Tesla Transport Protocol over Ethernet (TTPoE) to address the high bandwidth and low latency requirements of its Dojo supercomputer for AI training in automotive applications.
- TTPoE is designed to replace TCP (Transmission Control Protocol) in Tesla’s supercomputing environment, offering microsecond-scale latency and simple hardware offload capabilities.
- The protocol runs over standard Ethernet switches, maintaining compatibility with existing network infrastructure while optimizing performance for AI workloads.
Key features of TTPoE: The new protocol simplifies traditional TCP processes to reduce latency and improve efficiency in high-performance computing environments.
- Connection establishment and termination are streamlined, reducing the number of transmissions required and eliminating wait states.
- The protocol is designed to be handled entirely in hardware, which makes it transparent to software and potentially faster than standard TCP implementations.
- Congestion control is simplified, using a fixed-size SRAM buffer instead of TCP’s dynamic congestion window, optimized for the low-latency, low-packet-loss environment of a supercomputer network.
Hardware implementation: Tesla has created a custom hardware solution to implement TTPoE efficiently.
- A hardware block called the TPP MAC is placed between the chip and standard Ethernet hardware, incorporating CPU-like design features for efficient packet handling.
- The TPP MAC includes a 1 MB transmit SRAM buffer, capable of tolerating about 80 microseconds of network latency without significant bandwidth loss.
- This implementation is part of what Tesla calls a “Dumb-NIC” (Network Interface Card), designed to be cost-effective for large-scale deployment in host nodes feeding the Dojo supercomputer.
Mojo: Expanding Dojo’s capabilities: Tesla has introduced Mojo, a system to increase the data ingestion capabilities of the Dojo supercomputer.
- Mojo cards incorporate the TPP MAC, a host chip with PCIe Gen 3 x16 interface, and 8 GB of DDR4 memory.
- These cards are installed in remote host machines, allowing Tesla to scale up bandwidth by adding more hosts to the network as needed.
- The use of slightly older technologies like PCIe Gen 3 and DDR4 helps keep costs down while still meeting performance requirements.
Implications for supercomputing and networking: Tesla’s approach offers insights into potential optimizations for high-performance computing networks.
- The simplification of TCP for use in a controlled, high-quality network environment demonstrates how established protocols can be adapted for specific use cases.
- While TTPoE is not suitable for general internet use due to its fixed congestion window, it showcases how custom protocols can enhance performance in specialized environments.
- Compared to other supercomputing network solutions like InfiniBand, Tesla’s Ethernet-based approach may provide a more cost-effective way to meet the bandwidth needs of AI training workloads.
Looking ahead: Tesla’s TTPoE and Mojo implementations represent a novel approach to addressing the unique networking challenges posed by AI training in automotive applications.
- As AI workloads continue to demand ever-increasing amounts of data and computing power, we may see more custom networking solutions emerge in the supercomputing space.
- The balance between performance optimization and cost-effectiveness demonstrated by Tesla’s approach could influence future developments in high-performance computing infrastructure.
- While TTPoE is currently specific to Tesla’s needs, the concepts behind it may inspire similar optimizations in other specialized computing environments where traditional TCP might be a bottleneck.
Tesla’s TTPoE at Hot Chips 2024: Replacing TCP for Low Latency Applications