ﻻ يوجد ملخص باللغة العربية
We present NetReduce, a novel RDMA-compatible in-network reduction architecture to accelerate distributed DNN training. Compared to existing designs, NetReduce maintains a reliable connection between end-hosts in the Ethernet and does not terminate the connection in the network. The advantage of doing so is that we can fully reuse the designs of congestion control and reliability in RoCE. In the meanwhile, we do not need to implement a high-cost network protocol processing stack in the switch, as IB does. The prototype implemented by using FPGA is an out-of-box solution without modifying commodity devices such as NICs or switches. For the coordination between the end-host and the switch, NetReduce customizes the transport protocol only on the first packet in a data message to comply with RoCE v2. The special status monitoring module is designed to reuse the reliability mechanism of RoCE v2 for dealing with packet loss. A message-level credit-based flow control algorithm is also proposed to fully utilize bandwidth and avoid buffer overflow. We study the effects of intra bandwidth on the training performance in multi-machines multi-GPUs scenario and give sufficient conditions for hierarchical NetReduce to outperform other algorithms. We also extend the design from rack-level aggregation to more general spine-leaf topology in the data center. NetReduce accelerates the training up to 1.7x and 1.5x for CNN-based CV and transformer-based NLP tasks, respectively. Simulations on large-scale systems indicate the superior scalability of NetReduce to the state-of-the-art ring all-reduce.
The advent of RoCE (RDMA over Converged Ethernet) has led to a significant increase in the use of RDMA in datacenter networks. To achieve good performance, RoCE requires a lossless network which is in turn achieved by enabling Priority Flow Control (
Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years. Despite continuous efforts, prior solutions such as gradient compression/reduction, compute/communication overlapping and l
Distributed storage systems provide reliable access to data through redundancy spread over individually unreliable nodes. Application scenarios include data centers, peer-to-peer storage systems, and storage in wireless networks. Storing data using a
A good parallelization strategy can significantly improve the efficiency or reduce the cost for the distributed training of deep neural networks (DNNs). Recently, several methods have been proposed to find efficient parallelization strategies but the
Deep Neural Nets have hit quite a crest, But physical networks are where they must rest, And here we put them all to the test, To see which network optimization is best.