1.AI training requires very high bandwidth for GPU-to-GPU communication. Centralized clusters have NVLink and Infinity Band to ensure the transmission rate. How can distributed AI training meet this requirement?

Distributed AI training needs to compress the data transmitted by the network.

At present, the network data compression of AI training is mainly based on two facts:

1. Most of the data based on AI training are sparse, and 99.9% of the gradient exchanges in distributed random gradient descent training are redundant.

2. The change of activation value can be compressed.

Therefore, there is a basis for greatly compressing communication transmission data.

References can be found in the appendix.

Generally speaking, the current compression technology in the industry can keep the training speed under the 1GBps network bandwidth at 80% of the 10GBps network bandwidth, resulting in a 20% performance loss.


Industry research on AI training network transmission data compression:



2.How is the workload of AI training proved?

All models generate structured data with specific tags.

When the master node of the cluster performs the summary processing of the calculation results, it performs syntax and semantic checks on the obtained data.

The compute node container can only load the container image specified by the user on the demand side. This check is done through the native function of the container and K8S: verifying the hash and tags of the loaded image.

Computing tasks, processes and results will be recorded on the blockchain for subsequent review.

In this mode, if one want to fake, the cost is greater than the reward it can get, and it will become a serious loss, which is not economically feasible.

3.What is the difference between it and traditional computing clusters and existing distributed computing projects?

Traditional clusters are Ethernet fixed IPs, and existing distributed computing projects can only use public network fixed IPs, and can only be deployed and configured by professionals to connect between servers. DEKUBE is a non-fixed IP on the internet. Ordinary PC users do not understand how traditional clusters are deployed and configured, so there are two problems in using the computing resources of ordinary users:

1. Peer-to-peer networks (P2P networks) are generally used for mutual transmission between non-fixed IPs on the Internet, such as BT, eMule, and traditional blockchain. But this transfer speed is low and unstable. That's why we've enhanced the point-to-point network layer to increase the speed and stability by many times to meet the needs of inter-GPU communication.

2. We have made the world's only computing node client to make the entire integration process simple for ordinary users, and can be used directly without typing commands or writing code. Traditional clusters, including cloud computing service providers, and existing distributed computing projects can't do this.

4.How to tolerate faults in AI large model training?

During the large-scale training, due to the large number of nodes, the hardware, software and network will inevitably have intermittent failures and interruptions, such as an A100 card running at full load for one day, the damage rate is 1/10000.

Therefore, the training process of large language model often has its own fault-tolerant mechanism.

For specific solutions, please refer to:



5.How does DEKUBE ensure security?

The user's model data and output results are encrypted and stored on the storage node, and only the user's own confidential information can be accessed. Even system administrators can't access it.

The host's communication with the cluster master node and client server is carried out in an encrypted channel, and its information cannot be read by the outside world.

The client monitors host user actions, and if the host user tries to read the client's data, the host is kicked out of the cluster.

6.There are many types of GPU cards in home computers, and the performance is also uneven, are there any alignment requirements for the GPU cards when training large models?

Large model training generally has alignment requirements for GPU performance, so in DEKUBE task allocation system, the same task will allocate GPUs with similar performance.

7.What are the requirements for CPU, motherboard, and memory for AI training?

A typical GPU requires four CPU cores.

Requires twice the CPU memory capacity of the GPU.

The CPU and motherboard need to allocate at least 8 PCI-E lanes to each GPU.

8.The GPU memory consumption of large model AI training is very large, and the GPU memory of ordinary game graphics cards is very small. How to solve this problem?

There are two types of GPU memory consumption for deep learning:

Model state and residual state, both of which consume a lot of GPU memory.

1. There are three main types of model states: Optimizer, Gradients, and Parameters.

Splitting these three types of data into each GPU can produce three splitting methods:

(where the number of model parameters is Ψ, K is the ratio between the optimizer data volume and the parameters, and Nd is the number of split GPUs)

POS: Only the optimizer status is evenly divided.

POS+g: Divide the gradient and optimizer states evenly.

POS+g+p: Divide the optimizer state, gradient, and parameters evenly.

The OS and OS+G methods do not increase the amount of traffic, and are lossless splitting methods, which can greatly reduce the GPU memory requirements. The OS+G+P method can reduce the GPU memory requirement infinitely when there are enough GPUs, so that low-end GPUs can participate in the training of super-large models, but will increase the traffic volume. When the os+g+p data of one GPU is split into 64 GPUs, the traffic between GPUs will increase by 50%. This traffic can be compressed in many ways, and in general the increase in the communication load is not large.

2. Residual states refer to the memory occupation in addition to the model state, including activation values, various buffers and unusable memory fragmentations, of which the main memory occupation comes from the activation value

Activation optimization also uses the sharding method and checkpointing, which is a method of exchanging compute for memory space, which is widely used for storage optimization of GPU memory. In the 2017 SKT paper, a checkpoint was created to reduce the memory requirement to 1/3 of the original memory requirement with a performance of 94% of the original performance.

In the process of model training, some temporary buffers of different sizes are often created, and the solution is to create a fixed buffer in advance, which is no longer dynamically created during the training process, and if the data to be transmitted is small, multiple sets of data will be integrated and transmitted at one time to improve efficiency.

One of the main reasons for the fragmentation of GPU memory is that after checkpointing, those unsaved activation values are constantly created and destroyed, and the solution is to pre-allocate a continuous GPU memory, store the model state and checkpoints in the resident GPU memory, and use the remaining GPU memory for dynamic creation and destruction.

9.If a large number of miners are in one place, there needs to be a network configuration recommendation.

For more than 100 GPU cards, it is recommended to have a bandwidth of more than 1 Gbps (both the upload bandwidth and the download bandwidth are greater than 1 Gbps).

It is recommended that the bandwidth of more than 10 GPU cards be more than 500 Mbit/s (both upload bandwidth and download bandwidth are greater than 500 Mbps).

10.Please list the ratio of computing power/power consumption of RX580 or higher AMD cards and GTX1660 or higher Nvidia cards to A100.

The current price of the Amazon 8-card A100 80G version is $40.96 per hour, or $3,686 per card per month.

Assuming that the training performance of the NVIDIA A100 80G is 1, the computing power and power consumption of each graphics card are as follows:

Considering the average network loss of 20%, the performance-to-watt ratio of actual distributed AI training relative to centralized A100 80G computing clusters is as follows:

11.If we encounter a new centralized competitor, from zero to 1, how long can we expect to develop and deploy a working competitor system?

It is estimated that it will take two years to develop, test, and deploy a cloud computing platform to manage distributed AI computing power, which requires the development of a collective bidding system for computing power supply and demand, adaptation to various training frameworks (TensorFlow, Pytorch, Kera, Theano, etc.), and adaptation to AMD The ROCm computing platform modifies the linux kernel to form a multi-GPU runtime environment, develops virtual machines and clients suitable for providing a distributed computing environment for personal computers, a disaster recovery processing and scheduling system for master nodes, develops network compression algorithms and optimizes the point-to-point network layer to reduce network latency, etc.

Last updated