📚FAQ
Last updated
Last updated
Distributed AI training needs to compress the data transmitted by the network.
At present, the network data compression of AI training is mainly based on two facts:
1. Most of the data based on AI training are sparse, and 99.9% of the gradient exchanges in distributed random gradient descent training are redundant.
2. The change of activation value can be compressed.
Therefore, there is a basis for greatly compressing communication transmission data.
References can be found in the appendix.
Generally speaking, the current compression technology in the industry can keep the training speed under the 1GBps network bandwidth at 80% of the 10GBps network bandwidth, resulting in a 20% performance loss.
Appendix:
Industry research on AI training network transmission data compression:
https://arxiv.org/abs/1712.01887
All models generate structured data with specific tags.
When the master node of the cluster performs the summary processing of the calculation results, it performs syntax and semantic checks on the obtained data.
The compute node container can only load the container image specified by the user on the demand side. This check is done through the native function of the container and K8S: verifying the hash and tags of the loaded image.
Computing tasks, processes and results will be recorded on the blockchain for subsequent review.
In this mode, if one want to fake, the cost is greater than the reward it can get, and it will become a serious loss, which is not economically feasible.
Traditional clusters are Ethernet fixed IPs, and existing distributed computing projects can only use public network fixed IPs, and can only be deployed and configured by professionals to connect between servers. DEKUBE is a non-fixed IP on the internet. Ordinary PC users do not understand how traditional clusters are deployed and configured, so there are two problems in using the computing resources of ordinary users:
1. Peer-to-peer networks (P2P networks) are generally used for mutual transmission between non-fixed IPs on the Internet, such as BT, eMule, and traditional blockchain. But this transfer speed is low and unstable. That's why we've enhanced the point-to-point network layer to increase the speed and stability by many times to meet the needs of inter-GPU communication.
2. We have made the world's only computing node client to make the entire integration process simple for ordinary users, and can be used directly without typing commands or writing code. Traditional clusters, including cloud computing service providers, and existing distributed computing projects can't do this.
During the large-scale training, due to the large number of nodes, the hardware, software and network will inevitably have intermittent failures and interruptions, such as an A100 card running at full load for one day, the damage rate is 1/10000.
Therefore, the training process of large language model often has its own fault-tolerant mechanism.
For specific solutions, please refer to:
https://arxiv.org/abs/2205.01068
https://arxiv.org/abs/2204.02311
The user's model data and output results are encrypted and stored on the storage node, and only the user's own confidential information can be accessed. Even system administrators can't access it.
The host's communication with the cluster master node and client server is carried out in an encrypted channel, and its information cannot be read by the outside world.
The client monitors host user actions, and if the host user tries to read the client's data, the host is kicked out of the cluster.
Large model training generally has alignment requirements for GPU performance, so in DEKUBE task allocation system, the same task will allocate GPUs with similar performance.
A typical GPU requires four CPU cores.
Requires twice the CPU memory capacity of the GPU.
The CPU and motherboard need to allocate at least 8 PCI-E lanes to each GPU.
There are two types of GPU memory consumption for deep learning:
Model state and residual state, both of which consume a lot of GPU memory.
1. There are three main types of model states: Optimizer, Gradients, and Parameters.
Splitting these three types of data into each GPU can produce three splitting methods:
(where the number of model parameters is Ψ, K is the ratio between the optimizer data volume and the parameters, and Nd is the number of split GPUs)
POS: Only the optimizer status is evenly divided.
POS+g: Divide the gradient and optimizer states evenly.
POS+g+p: Divide the optimizer state, gradient, and parameters evenly.
The OS and OS+G methods do not increase the amount of traffic, and are lossless splitting methods, which can greatly reduce the GPU memory requirements. The OS+G+P method can reduce the GPU memory requirement infinitely when there are enough GPUs, so that low-end GPUs can participate in the training of super-large models, but will increase the traffic volume. When the os+g+p data of one GPU is split into 64 GPUs, the traffic between GPUs will increase by 50%. This traffic can be compressed in many ways, and in general the increase in the communication load is not large.
2. Residual states refer to the memory occupation in addition to the model state, including activation values, various buffers and unusable memory fragmentations, of which the main memory occupation comes from the activation value
Activation optimization also uses the sharding method and checkpointing, which is a method of exchanging compute for memory space, which is widely used for storage optimization of GPU memory. In the 2017 SKT paper, a checkpoint was created to reduce the memory requirement to 1/3 of the original memory requirement with a performance of 94% of the original performance.
In the process of model training, some temporary buffers of different sizes are often created, and the solution is to create a fixed buffer in advance, which is no longer dynamically created during the training process, and if the data to be transmitted is small, multiple sets of data will be integrated and transmitted at one time to improve efficiency.
One of the main reasons for the fragmentation of GPU memory is that after checkpointing, those unsaved activation values are constantly created and destroyed, and the solution is to pre-allocate a continuous GPU memory, store the model state and checkpoints in the resident GPU memory, and use the remaining GPU memory for dynamic creation and destruction.
For more than 100 GPU cards, it is recommended to have a bandwidth of more than 1 Gbps (both the upload bandwidth and the download bandwidth are greater than 1 Gbps).
It is recommended that the bandwidth of more than 10 GPU cards be more than 500 Mbit/s (both upload bandwidth and download bandwidth are greater than 500 Mbps).
The current price of the Amazon 8-card A100 80G version is $40.96 per hour, or $3,686 per card per month.
Assuming that the training performance of the NVIDIA A100 80G is 1, the computing power and power consumption of each graphics card are as follows:
GPU
Training Performance/
Relative to A100 80G
TDP/w
Performance per watt/
Relative to A100 80G
A100 80G
1
300
1
A100 40G
0.86
250
1.032
RTX4090
0.69
450
0.459
RTX4080
0.505
320
0.473
RTX4070
0.42
200
0.63
RTX3090
0.48
350
0.41
RTX3080
0.38
320
0.35
RTX3070Ti
0.34
290
0.35
RTX3060
0.24
170
0.42
RTX3050
0.16
130
0.37
RTX2080Ti
0.29
250
0.35
RTX2080
0.21
215
0.29
RTX2070
0.19
175
0.32
RTX2060
0.16
160
0.3
AMD Radeon VII
0.12
300
0.12
GTX1080Ti
0.2
250
0.24
GTX1070
0.13
150
0.26
GTX1660super
0.09
125
0.216
AMD Rx580 8G
0.095
185
0.154
Considering the average network loss of 20%, the performance-to-watt ratio of actual distributed AI training relative to centralized A100 80G computing clusters is as follows:
GPU
Training Performance/
Relative to A100 80G
TDP/w
Performance per watt/
Relative to A100 80G
RTX4090
0.552
450
0.367
RTX4080
0.404
320
0.378
RTX4070
0.336
200
0.504
RTX3090
0.384
350
0.328
RTX3080
0.304
320
0.28
RTX3070Ti
0.272
290
0.28
RTX3060
0.192
170
0.336
RTX3050
0.128
130
0.296
RTX2080Ti
0.232
250
0.28
RTX2080
0.168
215
0.232
RTX2070
0.152
175
0.256
RTX2060
0.128
160
0.24
AMD Radeon VII
0.096
300
0.096
GTX1080Ti
0.16
250
0.192
GTX1070
0.104
150
0.208
GTX1660super
0.072
125
0.173
AMD Rx580 8G
0.076
185
0.123
It is estimated that it will take two years to develop, test, and deploy a cloud computing platform to manage distributed AI computing power, which requires the development of a collective bidding system for computing power supply and demand, adaptation to various training frameworks (TensorFlow, Pytorch, Kera, Theano, etc.), and adaptation to AMD The ROCm computing platform modifies the linux kernel to form a multi-GPU runtime environment, develops virtual machines and clients suitable for providing a distributed computing environment for personal computers, a disaster recovery processing and scheduling system for master nodes, develops network compression algorithms and optimizes the point-to-point network layer to reduce network latency, etc.