The launch of ChatGPT in November 2022 has opened the eyes of various industry players to AI large language model. This frenzied dynamic permeated the Crypto space, and this article aims to introduce the development of AI, its current status, and the industry that has emerged from the combination of AI+Crypto.
Machine learning (ML) is a technology with empirical learning capabilities, which learns to discriminate animals, language translation, and other specific tasks by learning from large data sets. Machine learning belongs to the most practical way of realizing artificial intelligence at present, according to whether the learned data is labeled and features it can be divided into supervised learning and unsupervised learning.
There are many types of models that can accomplish supervised learning, including tree-based models, graph models, and the recently emerged neural networks. With the rapid development of computing power and data, deep learning has been further developed based on the architecture of neural networks. Current deep learning architectures commonly include, but are not limited to, CNNs, RNNs, and attention mechanisms.
Classification of machine learning, source: HashKey Capital
Different deep learning networks have the basic architecture of input layer, hidden layer, and output layer, the input layer is usually text, video, audio, and other data after being processed “tokenize/embedding”. The hidden layer has a different design (model shape) depending on the dataset and the purpose of the task, as shown in the table.
Types of Neural Networks, Source: Organized by HashKey Capital
30 years of neural network development, source: organized by HashKey Capital
Training neural networks first originated in the mid-1980s when Jordan trained a neural network to learn sequential patterns in his 1986 paper Serial Order: A Parallel Distributed Processing Approach. The tiny network had only a few neurons.
In the 1990s Jeffrey Ehrman expanded the neural network to a 50-neuron network with the discovery that the network spatially clusters words based on meaning. For example, it separated inanimate and animate nouns, and within these two categories animate objects were subdivided into human and nonhuman categories, and inanimate was categorized as breakable and edible. This indicates that the network has the ability to learn hierarchical explanations.
He further observed that words can be represented as points in a high-dimensional space, and then a sequence of words or sentences can be viewed as a path. This major breakthrough allows textual datasets to be digitized, vectorized, and processed by computers.
Source: http://3b1b.co/neural-networks
In 2011 Confluence researchers trained larger networks involving thousands of neurons and millions of connections, and a bottleneck was found in the study in the network’s ability to maintain coherent context over long sequences.
In 2017, OpenAI built on Kathy’s work by training on 82 million Amazon reviews in which emotional neurons were discovered. Such neurons perfectly categorized the emotions of the text.
Source: Learning to Generate Reviews and Discovering Sentiment
Regarding the limitations of context size, this paper 2017 Attention Is All You Need presents a solution. The paper creates a dynamic layer network that adapts connection weights based on the context of the network. It works by allowing words in the input to view, compare other words, and find the most relevant ones. The closer these words are in concept, the closer they are in space and can have higher connection weights. However, the paper only focused on the translation problem.
Thus OpenAI researchers tried a more powerful transformer architecture and launched GPT-3 in 2020, which attracted widespread attention from industries around the world, this time with the network reaching 175B parameters, 96 layers, and a 1,000-word context window.
Take the following 28x28 pixel digital image as an example, the neurons correspond to each pixel of the 28x28 input image, totaling 784 neurons, the numbers in the neurons are the activation values, which range from 0–1.
28x28 pixel digital image, Source: http://3b1b.co/neural-networks
These 784 neurons form the input layer of the network. The final layer is the output layer, which contains ten neurons representing the numbers 0–9, again with activation values ranging from 0–1. The middle layer is the hidden layer, where the activation value of the previous layer determines the activation value of the next layer as the neural network operates.
The depth of deep learning lies in the fact that the model learns many “layers” of transformations, each with a different representation. As shown in the figure below, for example, in 9, different layers can recognize different features. The closer the input layer is to the lower level of detail of the data, the closer the output layer is to the more specific concepts that can be used to differentiate.
Source: http://3b1b.co/neural-networks
As the model gets bigger, the hidden layers in the middle involve hundreds of billions of weights per layer, and it’s these weights and biases that really determine what the network is actually doing. The process of machine learning is the process of finding the right parameters, which are weights and biases.
The transformer architecture used in GPT, a large language model, has an intermediate hidden layer consisting of 96 layers of decoder modules, of which GPT1, GPT2, and GPT3 have 12, 48, and 96 layers, respectively. The decoder in turn contains attention and forward feedback neural network components.
The computing or learning process involves defining a cost function (or loss function) that sums the squares of the differences between the network’s computed output predictions and the actual values, and when the sum is small, the model performs within acceptable limits.
Training starts by randomly parameterizing the network and finalizing the model parameters of the network by finding the parameter that minimizes the cost function. The way to converge the cost function is by gradient descent, by which the degree of impact of each parameter change on the cost/loss is examined, and then the parameters are adjusted according to that degree of impact.
The process of calculating the parameter gradient introduces backward propagation or backpropagation, which traverses the network from the output layer to the input layer in reverse order according to the chain rule. The algorithm also requires the storage of any intermediate variables (partial derivatives) needed to compute the gradient.
There are three main factors that affect the performance of AI large language models during their training, namely the number of model parameters, dataset size, and the amount of computing.
Source: OpenAI report, Scaling Laws for Neural Language Models
This is consistent with the development of datasets and computers (computing power) in reality, but it can also be seen in the table below that computing power is growing faster than available data, while memory is the slowest to develop.
The development of dataset, memory and computing power, Source: https://github.com/d2l-ai
Faced with a large model, overfitting tends to occur when the training data is too small, and in general, the accuracy of the more complex model improves as the amount of data increases. Regarding the data requirement needed for a large model, it can be decided based on the rule of 10, which suggests that the amount of data should be 10 times the parameter, but some deep learning algorithms apply 1:1.
Supervised learning requires the use of labeled + featured datasets to arrive at valid results.
Source: Fashion-MNIST Clothing Categorization Dataset
Despite the rapid increase in data over the past decade or two and the currently available open-source datasets including Kaggle, Azure, AWS, Google database, etc., limited, scarce, and expensive amounts of data are gradually becoming a bottleneck for AI development due to the issues of privacy, increasing model parameters, and data reproducibility. Different data solutions are proposed aiming to alleviate this problem.
Data Augmentation techniques may be an effective solution by providing insufficient data to the model without acquiring new samples, such as scaling, rotation, reflection, cropping, translating, adding Gaussian noise, mixup, etc.
Synthetic data is another option. Synthetic data are data that can be artificially generated by computer simulation or algorithms with or without a previous reference dataset. Regarding the development of tools for generating synthetic data, Ian J. Goodfellow invented the Generative Adversarial Network (GAN), which is a deep learning architecture.
It trains two neural networks to compete with each other which can generate new, more realistic data from a given training dataset. The architecture supports generating images, filling in missing information, generating training data for other models, generating 3D models based on 2D data, and more.
It is still early in the development of the field, with most of the existing companies doing synthetic data being founded in 2021 or 2022, and a few in 2023.
The state of financing for synthetic data companies. Source : https://frontline.vc/blog/synthetic-data/
AI training process involves a large number of matrix operations, from word embedding, transformer QKV matrix, to softmax operations, and so on through the matrix operations, the entire model parameters are also carried in the matrix.
example of vector database, Source : https://x.com/ProfTomYeh/status/1795076707386360227
Large models bring massive computer hardware demand, which is mainly categorized into training and inference.
Pre-training and fine-tuning can be further divided under training. As mentioned before, building a network model first requires randomly initializing the parameters, then training the network and continuously adjusting the parameters until the network’s loss reaches an acceptable range. The difference between pre-training and fine-tuning is that
pre-training starts with each layer of parameters from random initialization, while some layers of fine-tuning can directly use the parameters of the previously trained model as the initialization parameters for this task (freezing the parameters of the previous layers) and acting on a specific dataset.
Source: https://d2l.ai/chapter_computer-vision/fine-tuning.html
Pre-training and fine-tuning both involve model parameter changes, which ultimately result in a model or parameter optimization, while inference is the calculation of inference by loading a model after user inputs and ultimately obtaining feedback and output results.
Pre-training, fine-tuning, and inference are ranked from largest to smallest in terms of their computer requirements. The following table compares the computer hardware requirements of training and inference. The computer hardware requirements of the two are significantly different in terms of computing power, memory, and communication/bandwidth due to the differences in the computation process and accuracy requirements, and at the same time there is an Impossible Trilemma in computing power, memory, and communication/bandwidth.
*The statistical measurements in this table are based on a single model processing a single token, a single parameter.
*FLOPs: floating-point operations per second, the number of matrix computations.
*DP, TP, PP: data parallel, tensor parallel, pipeline parallel.
Computer hardware comparison between training and inferencing, Source: Organized by HashKey Capital
The process of training a neural network requires alternating between forward and backward propagation, using the gradient given by the backward propagation to update the model parameters. Inference, on the other hand, requires only forward propagation. This difference becomes an influencing factor that primarily differentiates the computer hardware resources requirements for training and inference.
In terms of computing power, as shown in the table there is a simple multiplicative relationship between the number of model parameters and computing power consumption, with training requiring 6–8 floating-point operations and inference requiring 2. This is due to the backpropagation involved in training, which requires twice as much computing power as forward propagation, and thus the training’s computing power consumption is much higher than inference.
In terms of memory, the backpropagation used for training reuses the intermediate values stored in the forward propagation in order to avoid repeated computations. Therefore, the training process needs to keep the intermediate values until the backpropagation is completed. The resulting memory consumption during training mainly contains model parameters, intermediate activation values generated during forward computation, gradients generated by backward propagation computation, and optimizer states. The inference stage does not need backpropagation, and does not need optimizer state and gradient, etc., and its memory consumption usage is much smaller than that of training.
In terms of communication/bandwidth, in order to improve AI training performance, mainstream model training usually uses three parallel strategies: data parallel, tensor parallel, and pipeline parallel.
Source: OpenAI, https://openai.com/index/techniques-for-training-large-neural-networks/
For these three strategies, it is projected that TP communication frequency is the largest, the communication volume is the highest, and is related to the number of tokens, model width, and number of layers. The communication volume and frequency of PP is smaller than that of TP, and is related to the number of tokens, and the width of the model. The communication volume and frequency of DP is the smallest and is independent of the input tokens.
The bottleneck of computer hardware resources in large models is mainly limited by computing power, bandwidth/communication and memory, and there are checks and balances among the three, resulting in the Impossible Trilemma problem. For example, due to communication bottlenecks, cluster performance cannot be improved by simply optimizing the power of a single computer.
Therefore, although parallel architectures are used to accelerate cluster performance, most parallel architectures actually sacrifice communication or storage for computing power.
Sacrificing communication and storage for computing power:
In PP, if a GPU is assigned to each layer of the transformers, despite the increase in computational power in time units, the communication requirements between the layers also increase, resulting in increased data volume and latency. Also, the intermediate state storage requirement for forward propagation increases extremely fast.
Sacrificing communication for computing power:
In TP, each transformer is disassembled for parallel computation. Since the transformer consists of two components (Attention head and feed-forward network), the task can be split within the layer for either the Attention head or the feed-forward neural network. This TP approach can alleviate the problem of too much PP hierarchy due to GPUs not being able to fit the model. However, this approach still has serious communication overhead.
In this paper, we believe that currently there are the following major categories of AI in the crypto field:
Source: Organized by HashKey Capital
As mentioned earlier the three most critical components in AI are data, models, and computing power, which serve as the infrastructure to empower crypto AI.
Their combination actually happens to form a computing network, with a large number of middleware appearing in the computation process in order to be efficient as well as more in line with the crypto spirit. Downstream are Agents based on these verifiable results, which can further serve different roles for different user audiences.
Another flowchart can be used to express the basic ecology of crypto AI as follows:
Ecological flowchart, source: organized by HashKey Capital
Of course, tokenomic mechanisms are needed in the crypto space to incentivize coordinating the participation of different players.
For datasets, one can choose between public data sources or one’s own specific private data sources.
Data Source:
Synthetic Data Platform:
Others:
Data labeling service platform, by assigning the labeling order task to different workers, these workers can get the corresponding token incentive after completing the task such as Cropo, Public AI and so on. However, the current problem is that there are more people doing data labeling than data, while AI companies have stable data labeling suppliers for their labeled data needs, due to the sticky existence of which makes their willingness to switch decentralized platforms weak. These platforms may only be able to obtain the allocation of the remaining part of the order from the data labeling suppliers.
Generalized computing networks, which refer to networks that aggregate resources such as GPUs and CPUs to be able to provide generalized computing services which means no distinction between training and inference.
In the Crypto space, Gensyn, invested by a16z, proposes a decentralized training computing network.
The process is that after a user submits a training requirement task, the platform analyzes it, evaluates the required computing power as well as splits it into a minimum number of ML works, at which point the validator periodically grabs the analyzed task to generate thresholds for the comparison of downstream learning proofs.
Once the task enters the training phase it is executed by the Solver, which periodically stores the model weights and response indexes from the training dataset as well as generates the learning proofs, and the verifier also performs the computational work rerunning some of the proofs to perform distance calculations to verify that they match the proofs. Whistleblowers perform arbitration based on a Graph-based pinpoint challenge program to check whether the validation work was performed correctly.
Fine-tuning is easier and less costly to implement than directly pre-training a large model, simply by fine-tuning the pre-trained model with a specific dataset, and adapting the model to a specific task while preserving the original model.
Hugging Face can be accessed as a pre-trained language model resource provider to the distributed platform, the user selects the model to be fine-tuned according to the task requirements and then uses the GPUs and other resources provided by the computing network for the fine-tuning of the task, which needs to be based on the complexity of the task to determine the size of the dataset, the complexity of the model, and to further determine the need for a higher level of resources such as the A100.
In addition to Gensyn, a platform that can support pre-training, most computing platforms can also support fine-tuning.
Compared to training (pre-training and fine-tuning), which requires tuning of model parameters, the computational process of inference involves only forward propagation and requires less computing power. Most decentralized computing networks currently focus on inference services.
When inferencing is carried out this stage is already the stage of model use, then middleware can be introduced at the right time:
On-chain smart contract to retrieve the results of off-chain AI computes:
Another layer of privacy can be added to the computing network, which mainly includes data privacy and model privacy, where data privacy is far more important than model privacy.
Most computing networks build different validation systems to ensure that the system runs accurately, while the link is a part that has not yet been introduced in the traditional AI field.
The main role of ZK proof is the following 2 points:
Modulus Labs has shown that it is possible to create proofs for 18 million parameter models in 60–70 seconds using Polygon’s Plonky proof system. For small models, it is possible to use ZKML at this stage, but the cost is still significant:
Source: https://medium.com/@ModulusLabs/chapter-5-the-cost-of-intelligence-da26dbf93307
Given the limitations of ZKML described above, OPML is an alternative. Although weaker than ZKML in terms of security, its memory consumption and proof computation time are significantly better than that of ZKML. according to the ORA report, it is shown that for the same 7B-LLaMA model (with a model size of about 26GB) opML can be processed within 32GB of memory, whereas the memory consumption of the circuits in zkML can be on the order of terabytes or even petabytes.
Trusted Execution Environment provides hardware-level security and can be an alternative to ZKML and OPML. TEE-proof is generated as a result of internal computation within TEE and its computational cost is much lower than that of zk-proof. Also, the proof size of TEE is usually a fixed constant (signature length) and thus has the advantage of a smaller footprint and lower cost of on-chain validation.
In addition to verification, TEE has the advantage of keeping sensitive data isolated, ensuring that external processes or computations cannot access or alter the data within it.
Projects that use TEE include:
Source: https://arxiv.org/pdf/2401.17555, Marlin Protocol
In addition, ORA protocol has developed opp/ai (Optimistic Privacy-Preserving AI on Blockchain) in addition to its own ZKML and OPML validation, and is not included in the above comparison table.
Agent has the ability to analyze the incoming information, evaluate the current environmental conditions and make decisions. Agent composition is shown in the following figure, in which the LLM is the core component, in addition, it is necessary to feed the appropriate prompt to the LLM, and through the Memory to store short-term data and long-term historical data (external data).
Since complex tasks cannot be completed at once, they need to be split into smaller tasks by Plan, in addition to this Agent can also call external APIs to get additional information, including current information, code execution capabilities, access to proprietary information sources, and so on.
Source: A Survey on Large Language Model based Autonomous Agents
The decision-making ability of Agents did not have a certain breakthrough until the emergence of the Large Language Model LLM in recent years. A report has collated the number of papers published on Agents from 2021 to 2023, as shown in the figure below, in reality there are only about a dozen research papers in 2021, but there are hundreds of papers published on them in 2023. The paper its categorized Agents into 7 categories.
Source: A Survey on Large Language Model based Autonomous Agents
In web3, the scenarios in which Agents exist are still limited compared to the web2 world, and currently include automated clearing, constructing code components (writing smart contracts, writing zk circuits), real-time risk control, and executing strategies such as arbitrage and yield farming.
Based on different Agents can be combined/abstracted/created a specific application, at the same time, there are some coordination platforms available for users to choose what kind of Agents to use to build a specific type of application. But most of them are limited to the development of Agents.
Some developers will use some AI to help their platforms to be smarter, for example, in security projects, machine learning is used to distinguish attack vulnerabilities; DeFi protocols use AI to build real-time monitoring tools; and data analytics platforms also use AI to help with data cleaning and analysis.
In this article, we would like to highlight the following 3 points:
In crypto, a number of computing networks emerge inevitably make users feel that GPU is AI, but as analyzed in the previous section, there is an impossible trilemma of computing networks, i.e., computing power, bandwidth/communication, and memory, as well as three kinds of parallel strategies used in model training, such as data parallel, tensor parallel, and pipeline parallel, all point to the checks and balances that are imposed on setting up the framework of computing network.
The reason behind the fact that the same model and data do not necessarily yield the same result is the use of floating point computation. This difference in computation also has an impact on the construction of the computing network.
AI Agents have only begun to show more utility in recent years, and we expect more Agents to appear in the market. But how Agents work in crypto or how to find the right token incentives remains a challenge.
AI into Crypto was originally published in HashKey Capital Insights on Medium, where people are continuing the conversation by highlighting and responding to this story.
【免责声明】市场有风险,投资需谨慎。本文不构成投资建议,用户应考虑本文中的任何意见、观点或结论是否符合其特定状况。据此投资,责任自负。