📖Litepaper

Background: Why distributed Al training matters

The emergence of ChatGPT has brought about the rapid development of the AI industry in the past year, with the exponential growth of large models and their parameters (GPT3 has 175B parameters, and GPT4 has reached 1.5 Trillion, and it takes $100 million to train once), while the capacity bottleneck of high-end graphics cards for AI computing is extremely high (the price of a single H100 card is nearly $40,000), resulting in a sharp increase in the time and money cost of large model training.

But on the other hand, a large amount of scattered AI computing power is idle. The vast majority of PC graphics cards are idle most of the time. Moreover, due to the transformation of Ethereum from POW to POS, a large number of GPU computing resources are released and idle too. From the point of view of economy and efficient use of resources, it is desirable to revitalize these idle GPUs.

Vision: The leading distributed AI training network

DEKUBE aspires to revolutionize the global landscape of artificial intelligence by establishing itself as the premier provider of distributed AI computing power. Our vision is to eradicate barriers to innovation, making AI accessible to all corners of society and industries. Through our cutting-edge and secure platform, we aim to empower enterprises and individuals alike, enabling them to harness the full potential of distributed AI computing seamlessly and efficiently. DEKUBE's commitment to excellence lies in creating a robust and transparent ecosystem, ensuring that users worldwide can leverage our state-of-the-art AI infrastructure to drive scientific research, technological innovation, and societal progress. In essence, we envision DEKUBE as the catalyst for a new era in AI, where accessibility, scalability, and transformative impact converge to shape the future of intelligent computing.

Challenge: The technical challenge of AI distributed training

Although the future looks promising, how to make use of these scattered AI computing power is a very difficult task. We talked to hundreds of professors and CTOs in the AI industry, and more than 99% of them were concerned about GPU-to-GPU communication. In their experience, AI training requires a very high bandwidth for communication transmission, and NvLink plus more than 100 Gbps IB network cards can meet the demand, while the speed of traditional peer-to-peer networks is very low, the communication speed between two 100 Mbps home networks is often only tens of Kbps, so this seems to be an impossible task. Another difficulty is that setting up one's computer to provide cloud computing services is a matter with a considerable technical threshold, and the vast majority of PC users do not have professional command line operation skills.

Solutions

Introduction

So how can we make use of these scattered computing resources?

We need to do at least three things. The first is to increase the transmission speed of traditional P2P networks as much as possible to make full use of the bandwidth provided by their network providers. The second is to compress the amount of communication data transmitted by the AI training. The third thing is to make the process of providing users with their own computing power into a normal application, skipping the complex code and command-line process.

In response to the first issue, we developed the P2P network with an enhanced network layer that enables the speed of traditional peer-to-peer networks to be as fast as network providers should be. For the second issue, we have integrated various existing AI training communication transmission compression technologies to compress the amount of data transmitted by communication to less than 5% of the previous data volume, and for the third problem, we have developed a one-click installation of the computing node client, which can be easily connected to the DEKUBE cluster to provide cloud computing services for computing power demanders.

P2P Network

We have developed the P2P network, an industry-leading high-speed decentralized and secure public chain with an efficiency (TPS) of up to 12,000+, a well-developed peer-to-peer network layer that far exceeds the speed of traditional peer-to-peer networks, supporting the efficient network transmission needs of AI training.

How to increase the transmission speed of the P2P network? We can imagine that every home network can access mainstream websites very quickly, and can reach the speed claimed by the network provider. This is because the servers of mainstream websites are located on the backbone network of telecommunications companies. Therefore, our solution is to add relay nodes on the backbone network in various regions, so that the nodes of the P2P network first send the data to the backbone network relay node, and then the relay node forwards it to the target node. To develop this network-enhanced blockchain, we spent five years writing more than 800,000 lines of C++-based code and spent tens of millions of dollars. We have tested that the improved P2P network can reach the speed claimed by the network provider, which is hundreds of times higher than the original lower limit of the transfer speed.

At present, when many distributed computing projects process transactions, they will let the fixed master node assign tasks to trusted sub-nodes for calculation and processing, and the next step is to skip the inspection stage, ignore the verification process, and directly enter the database, and the information that is not verified will lead to the risk of wrong data and user property loss. On the premise of ensuring the signature verification and decentralized design of the master node and committee, the speed of the P2P network is more than 6 times that of other public blockchains that meet this requirement.

At the same time, the P2P network can take into account both secure decentralization and high-speed stable transaction processing.

The P2P Network uses the unique SDBFT (Simplified Decentralized Byzantine Fault Tolerance) consensus algorithm is a Byzantine fault-tolerant consensus algorithm that can tolerate certain anomalies through simplified decentralization. Compared with the traditional BFT algorithm, it has advantages in efficiency and implementation difficulty, and has the advantages of high performance and good fault tolerance.

Data compression for communication between GPUs

99.9% of the gradient exchange in distributed SGD are redundant, and Deep Gradient Compression (DGC) algorithm greatly reduces the communication bandwidth. DGC algorithm can compress the gradient data in communication transmission to 1/600 of the original data, achieving the training effect of 1Gbps to the original 10Gbps bandwidth.

AQ-SGD compresses the change of activations for the same training example across epochs, in a decentralized network with slow connectivity (e.g., 100 Mbps), the performance of AQ-SGD is only 18% slower compared to an uncompressed approach in a high-speed datacenter network (e.g., 10 Gbps).

ZeRO++ leverages quantization, in combination with data, and communication remapping, to reduce total communication volume by 4x compared with ZeRO, without impacting model quality.

Combined, these compression measures can reduce the amount of data communicated between GPUs for AI training to less than 5% of the original amount.

Currently, the main AI training frameworks are PyTorch and TensorFlow. At present, DeepSpeed has handled the communication compression between GPUs in the PyTorch ecosystem relatively well. The TensorFlow ecosystem doesn't yet have such a system to handle this. We are developing a memory splitting and inter-GPU communication compression program based on TensorFlow, which combines the above compression algorithms and will be available soon.

Easy-to-use client

We provide a one-click installation client, including the management of complete local computing resources, cluster network registration and connection, and k8s container functions, which can be directly installed and used by novice users without any computer professional background, and provide computing resources to obtain rewards. It has privacy protection features to ensure the security of data in containers. Once a host user attempts to read data from the container, the host will have its permissions and cluster removed.

Architecture

Subsystem

P2P Network

The whole system is based on the P2P network, and the functions of the blockchain are:

The complete life flow of a task is documented: created, run, and output.
It is used to maintain the network state of all nodes, and uses the blockchain public key address as the network transmission address, breaking through the plane barriers of the traditional centralized cluster network and the upper limit of power supply.
The final billing settlement is used to record the task.

Enhanced Peer-to-Peer Network Layer

Enhanced the peer-to-peer network layer and is the core function of DEKUBE NETWORK. It is positioned to build a secure and efficient communication infrastructure that enhances the peer-to-peer network through blockchain technology to provide reliable data transmission and task collaboration support for the entire system. Its key functions are to secure communications, maintain node network state, and provide a solid foundation for task management and distributed computing.

Kubernetes container management system

A flexible and powerful platform for container orchestration, resource management, and task scheduling, supporting large-scale distributed AI training.

Task Creation system

Parses diverse resources required for computational tasks configures relevant resources and environments, and writes task information to the blockchain, ensuring tamper resistance. Delivers the task allocation and scheduling system.

Task Preparation system

The task preparation system provides a common configuration, allowing users to easily set up the environment required for computing tasks. The compute node client has privacy protection features to ensure the privacy and security of user data.

Task management and scheduling system

With the task creation system and task assignment and scheduling system, users can easily submit computing tasks, and the system automatically allocates and schedules computing resources in the cluster to achieve efficient task management.

Monitoring system & security architecture

Node monitoring system and secure network architecture to ensure communication security and system stability through encrypted transmission, storage, and node monitoring.

Client & Interaction system

Connect nodes, provide a user-friendly interface, and support task management.

Billing system

Record task accounts on the blockchain to provide transparent and fair billing.

Workflow

Tasks are configured by consumers in Dekube according to the process, paid for, and then released to producers. After the producer completes the calculation, the result is fed back to the consumer, and the system verifies it and distributes the reward.

Task Preparation System provides convenient support for task environments such as images, models, datasets, and output buckets, which are commonly used in environment configuration.
Task Creation System parses the resources required for computing tasks, configures related resources and environments, and writes task information to the blockchain. Delivered to the task assignment and scheduling system.
Task Allocation System checks the task requirements, assigns appropriate online computing nodes to participate in the calculation, and organizes resources in the cluster to ensure that the tasks are completed on time.
Validation： The platform checks the calculation results of the task and verifies the tags, data items, and training statistics to ensure that the calculation output is compliant and correct.
Incentives： Record and settle accounts for computational tasks on the blockchain.

NextTutorial for Windows

Last updated 1 month ago