AI & Computing Power

Preface#

In the past six months, the author has had a low desire to update due to a breakup. Later, I followed the research group to study image recognition and algorithm analysis, autonomous driving path planning, and intelligent vehicle control algorithms, which led to a stagnation in updates. Thanks to the nurturing of the Dean of Computer Science and discussions on recent popular directions, I was asked to do a technical sharing related to large models, which reminded me to update my blog.

In this blog post, I will focus on three topics:

From cloud computing power to edge computing power.
Data privacy in the AI era.
Computing power demands in the AI era.
The content includes the latest QwQ, large language model training methods, EXO-based network computing power sharing, and global computing power statistics.

From Cloud Computing Power to Edge Computing Power#

Let's introduce this topic using Apple's product, Apple Intelligence.

In June 2024, at the Worldwide Developers Conference, Apple announced:

The personal intelligent system Apple Intelligence introduces powerful generative models for iPhone, iPad, and Mac.

As the most typical application scenario for natural language large models, after the explosive popularity of ChatGPT, Huawei, Xiaomi, and many other companies began to train and release their own voice assistants based on large language models, while Apple announced the integration of GPT. However, due to policy reasons in mainland China, Apple's AI features have not been able to function normally in the region to this day.

Until February of this year:

Apple will integrate Alibaba's large language model and other AI services for devices like the iPhone in China. Apple announced the integration of Alibaba's AI large model for the Chinese version of Apple products.

CNBC

At that time, everyone must have been confused. It was precisely when Deepseek was gaining popularity. Why did Apple choose to integrate Alibaba's Qwen series instead of the open-source large model Deepseek? When Apple made this decision, its stock even plummeted. NVIDIA's stock had also dropped sharply due to Deepseek; why would Apple make this choice?

Before answering this question, let's focus on Deepseek. It has powerful performance, and its advantage is that it is open-source and lightweight. Compared to other closed-source models, just 671B parameters are enough to compete with the world's most advanced models. But how much GPU memory is needed to deploy it?

A whopping 700G! If we want to deploy a Deepseek R1 617B version with normal output speed, we might need about 700G of GPU memory, roughly 10 A100 GPUs, which could cost around 2 million.

Although this is significantly less computing power than what GPT-3.5 and GPT-4.0 required at the time, when GPT could only be used in the cloud, relying on the massive computing power cluster owned by OpenAI, which is unattainable for ordinary enterprises. Fortunately, OpenAI's products have driven a revolution in this area, allowing Deepseek to be born, and thanks to Deepseek's open-source nature, more new models can stand on the shoulders of giants, leading to the birth of QwQ!

On March 6, 2025, Alibaba's Qianwen team released the QwQ open-source large model, with only 32B parameters!

The powerful QwQ, with its small number of parameters (32B), is enough to rival the full version of Deepseek R1 671B, and even surpass it! QwQ answers the reason for Apple's choice of Alibaba; for personal users, it only requires a single 4090 GPU to run, reducing costs from 2 million to 20,000. But is this the reality? Recent media have been praising it, but the truth is...?

The Alibaba team describes QwQ as follows:

This is a model with 32 billion parameters, whose performance can rival that of DeepSeek-R1 with 671 billion parameters (of which 37 billion are activated).

"QwQ-32B: Experience the Power of Reinforcement Learning"

Indeed, QwQ only compared its performance with the activated 370B Deepseek... right?

Architecture#

What I did not mention at the technical seminar is that QwQ only compared its performance with the activated 370B Deepseek, but this is a significant part I want to discuss, the architecture issue of Deepseek and the QwQ model based on its approach. (Because the sharing session invited two non-professional teachers and the public to listen and ask us questions, I had to keep it simple and easy to understand, so many details were omitted during the sharing session.)

Thanks to the Dense architecture, consumer-grade computing devices can run QwQ, while Deepseek uses the MoE architecture, or Hybrid MoE mixed expert architecture. Based on the results, it seems that the Dense architecture is better? In fact, that is not the case; we need to start from the architectural details.

What are the differences between these two architectures?

The Mixture of Experts (MoE) model is an architecture that divides the model into multiple expert sub-networks and dynamically selects the appropriate expert for computation based on the characteristics of the input data. Each "expert" has strong processing capabilities in a specific area, and MoE intelligently selects the appropriate expert for computation based on task requirements. This mechanism significantly enhances the model's expressiveness and flexibility while ensuring a smaller computational overhead. Especially when dealing with large-scale datasets, the MoE model can avoid redundant computations by precisely selecting different experts to handle specific tasks, thus effectively reducing resource consumption.

To put it simply, the MoE architecture is designed to specialize AI functions, delegating specific tasks to designated "expert" modules, which is also the reason why Deepseek only activated about 370B during inference.

In contrast to the MoE model, the Dense model is a traditional deep neural network architecture. The design philosophy of the Dense model is very simple—every neuron (or computational unit) participates in every computation. Regardless of the difficulty of the task, every parameter of the Dense model is involved in each computation. This allows the Dense model to perform relatively stably on simpler tasks, but it struggles with complex problems.

Because the Dense model does not intelligently select suitable computational units like MoE, it requires all parameters to be computed and updated during each training session, leading to a massive computational load and storage requirements. Therefore, the computational cost of the Dense model is relatively high, especially when dealing with large-scale datasets or complex tasks, its efficiency significantly decreases.

In simple terms, the Dense architecture involves all parameters at once, exerting great force but lacking finesse.

Earlier, I mentioned that the Dense architecture seems superior in results, but architecturally, the MoE architecture should be better. So why does QwQ, using the Dense architecture, surpass Deepseek's MoE architecture?

Starting from performance requirements, the MoE architecture benefits from distributing tasks among various expert models, granting it powerful computational reasoning capabilities. However, it has extremely high hardware requirements, needing robust parallel computing capabilities. In contrast, the Dense architecture is the opposite; even in a CPU + memory operating mode, it does not significantly reduce its operational efficiency.

MoE is like a pampered super student, while Dense is like an easy-to-nurture little one.

In summary, small parameter models using Dense architecture can improve quality, while large parameter models using MoE can enhance efficiency.

Apart from the small parameter model compensating for QwQ's shortcomings in the Dense architecture, more importantly, is its training method.

Technical Details#

The title of the QwQ release page reads: Experience the Power of Reinforcement Learning.

The Qwen team explains: Recent research shows that reinforcement learning can significantly enhance a model's reasoning capabilities. For example, DeepSeek R1 achieves state-of-the-art performance by integrating cold start data and multi-stage training, enabling it to engage in deep thinking and complex reasoning.

Since the technical report on QwQ is still scarce, it is trained based on Deepseek's model, so we can interpret why QwQ has such powerful performance by combining some of Deepseek's training methods.

The training process of QwQ-32B is divided into three stages: pre-training, supervised fine-tuning, and reinforcement learning, with reinforcement learning further divided into two key stages, allowing QwQ to achieve performance surpassing Deepseek.

To understand the reasons, let's gradually decode what "cold start" is. What is reinforcement learning? What are the two key stages?

"Cold Start"

To help everyone better understand cold start, let me give a daily life example.

Big Data Recommendation Algorithm

Recommendation algorithms suggest videos, products, etc., based on each user's preferences.

However, a newly registered account lacks prior accumulated data, making it impossible for the system to make accurate recommendations.

This is "cold start."

When large models are in a cold start, they behave like a "child who knows nothing," constantly making mistakes and generating a bunch of illogical answers, and may even fall into meaningless loops.

Using "cold start data," in the early stages of AI training, a small batch of high-quality reasoning data is used to fine-tune the model, akin to providing AI with a "beginner's guide."

Referring to Deepseek's optimized cold start steps:

Generate data from large models – Researchers use few-shot prompts.
Generate data from Deepseek R1 Zero – Since R1-Zero has some reasoning capabilities, researchers select readable reasoning results from it, reorganizing them as cold start data.
Manual screening and optimization – Some data is manually reviewed and optimized for clarity and intuitiveness in the reasoning process.

Ultimately, DeepSeek-R1 used thousands of cold start data for initial fine-tuning before proceeding to reinforcement learning training.

Reinforcement Learning#

Other machine learning methods mainly include supervised learning and unsupervised learning, while reinforcement learning is the third paradigm of machine learning, apart from supervised and unsupervised learning.

The characteristics of reinforcement learning can be summarized in four points:

There is no supervisor, only a reward signal.
Feedback is delayed rather than immediate.
It has a time-series nature.
The actions of the intelligent agent affect subsequent data.

Four Basic Elements

A reinforcement learning system generally includes four elements: policy, reward, value, and environment (or model). Next, we will introduce these four elements one by one.

Policy
The policy defines the actions taken by the agent in a given state; in other words, it is a mapping from states to actions. In fact, the state includes both the environment state and the agent state. We summarize the characteristics of the policy in three points:

The policy defines the agent's behavior.
It is a mapping from states to actions.
The policy itself can be a concrete mapping or a random distribution.

Reward
The reward signal defines the objective of the reinforcement learning problem. At each time step, the scalar value sent by the environment to the reinforcement learning agent is the reward. We summarize the characteristics of the reward in three points:

The reward is a scalar feedback signal.
It reflects how well the agent performed at a certain step.
The agent's task is to maximize the total reward accumulated over a period.

Value
Value, or value function, is a very important concept in reinforcement learning. Unlike the immediacy of rewards, the value function measures long-term returns. We often say, "One must be grounded while also looking up at the stars." Evaluating the value function is akin to "looking up at the stars," judging the returns of current actions from a long-term perspective, rather than just focusing on immediate rewards. We summarize the characteristics of the value function in three points:

The value function predicts future rewards.
It can assess the quality of states.
Calculating the value function requires analyzing transitions between states.

Environment (Model)

The external environment, or model, simulates the environment. Given the state and action, the model allows us to predict the next state and corresponding rewards. We summarize the characteristics of the model in two points:

The model can predict the next performance of the environment.
The performance can be reflected through the predicted state and rewards.

Reinforcement Learning Architecture
This part is from CSDN

Two Key Stages#

First Stage of Reinforcement Learning

Focuses on enhancing mathematical and programming capabilities. Starting from the cold start checkpoint, the reward-driven reinforcement learning expansion method based on results is employed.

Mathematical problem training uses a specialized accuracy validator rather than a traditional reward model.

Programming task training evaluates whether the code passes predefined test cases using a code execution server.

Second Stage of Reinforcement Learning

Emphasizes enhancing general capabilities.

The model introduces a general reward model and rule validator for training.

Even with a small number of training steps, it significantly improves instruction following, human preference alignment, and agent performance, achieving enhancements in general capabilities without significantly reducing the mathematical and programming abilities gained in the first stage.

Relying on these technologies enables the QwQ model to achieve remarkable efficiency, allowing it to compete with MoE large parameter models and bringing computing power from the cloud to the edge (local). But why pursue efficient models with low computing power demands by returning to edge computing modes?

Data Privacy in the AI Era#

Every day, each of us leaves a large amount of data on the internet.

Most of this data involves privacy issues.

In the AI era, protecting personal privacy data has become a significant challenge.

Image source: ChatGPT

Apple has been focusing on user data privacy for a long time.

They emphasize protecting user data privacy, and even experts from Huawei have written thousands of words analyzing Apple's data privacy protection strategies.

Individual users cannot possess large computing power clusters locally; currently, the only way for individual users to utilize large model functions is to connect to the cloud, which poses a severe challenge to data privacy.

For instance, there have been regions that prohibited Tesla cars due to concerns about road data being processed overseas. This raises a series of sensitive issues. We are very concerned that the data used by AI may be transmitted for illegal purposes, harming national interests, so protecting data security is crucial.

Running large models on the edge can reduce costs and enhance the protection of privacy data, thus requiring models to be more efficient.

If the model is efficient enough, it can achieve more and higher precision tasks on the edge under the same computing power.

One of the future directions for AI development should be, like QwQ, achieving higher efficiency and lower edge computing power demands.

Computing Power Demands in the AI Era#

Thanks to the development of AI technology, we can now run AI models locally and perform computations using a CPU + memory mode. The open-source project EXO shares computing power by distributing multiple layers of AI models across different devices, further reducing the operational costs of AI.

However, this only allows for operation; it cannot achieve efficient operation and AI training. To accomplish these, we still need extremely high computing power.

As previously mentioned, all large models and their training methods cannot escape the support of "computing power." Whether it is the powerful Deepseek or the efficient QwQ, they cannot escape the reliance on computing power resources, whether it is the 700G memory requirement or the consumer-grade 4090 GPU or even using CPU computation.

Image source: ChatGPT

Global Electricity Demand Data Comparison#

Year	Global Electricity Demand	Major Changes and Trends
2019	About 25,000 TWh	China's electricity demand growth is expected to be 5%, with total electricity consumption around 7.28 trillion - 7.41 trillion kWh.
2024	About 30,000 TWh	Mainly driven by data centers and AI training demands. Power consumption of Google's and Microsoft's data centers has reached 24 TWh, doubling since 2019.

Global AI Computing Power Demand Data Comparison#

Year	Global AI Computing Power	Major Changes and Trends
2019	About 10¹⁸ FLOP/s	AI computing power is mainly provided by traditional GPUs and TPUs, with relatively stable growth, not yet entering an explosive period.
2024	About 10²¹ FLOP/s	AI computing power has significantly increased, with global machine learning hardware performance growing at a rate of 43% per year, and top hardware efficiency doubling every 1.9 years. Google has over 1 million equivalent H100 computing power, and the global computing capacity supported by NVIDIA doubles on average every 10 months.

Expected Funding Investments by Major Companies in 2025#

Company	AI Computing Power (Equivalent H100)	Funding Expenditure	Major Changes and Trends
Google	Over 1 million	$75 billion (about 540 billion yuan)	For AI infrastructure.
Microsoft	750,000 - 900,000	$80 billion (about 576 billion yuan)	For AI data center construction, mainly for training AI models and deploying AI applications.
Alibaba	230,000 (NVIDIA GPU)	150 billion yuan	Mainly for AI and cloud computing infrastructure construction.
Tencent	230,000 (NVIDIA GPU)	82 billion yuan	For intelligent computing centers, mainly for GPU servers and network construction.

It is evident that with the explosive development of AI, the global demand for computing power has reached unprecedented heights, and electricity demand has also increased significantly. Major manufacturers are intensifying their efforts to seize the computing power heights, highlighting the importance of computing power in the AI era.

At the same time, there has even been a notion that "computing power = national strength"!

This article is synchronized and updated to xLog by Mix Space. The original link is https://fmcf.cc/posts/technology/AIComputingPower