It feels like the current path toward AGI resembles four great schools joining forces to “besiege the Bright Peak.”
And the answer is definitely yes. If we think of Fei-Fei Li’s Convolutional Neural Networks (CNNs) as a biological school (simulating the eye and visual cortex), and Demis Hassabis’s Reinforcement Learning (RL) as a psychology/behaviorist school (simulating reward, punishment, and gameplay), then Diffusion Models indeed belong to an entirely new and strikingly romantic school—the school of physics.
Beyond these two, there is another absolute ruler today—the language/structure school (Transformer).
We can look at today’s AI landscape as four major factions jointly attacking the Bright Peak (the road to AGI):
1. The Physics School: Diffusion Models
Representatives: Jascha Sohl-Dickstein, Song Yang
Core Philosophy: “Entropy increase and time reversal”
Diffusion models draw inspiration not from the brain (neurons), nor from behavior (games), but from non-equilibrium thermodynamics.
● Principle:
Imagine a drop of ink falling into a glass of clear water. The ink gradually diffuses until the entire glass becomes cloudy—this is a physical process of entropy increasing.
● What AI does:
The logic of diffusion models is: if I can learn the reverse of this diffusion (clouding) process, I can run the video backwards and reconstruct the original ink drop—i.e., generate a clean image from pure noise.
● Why is this a standalone school?
- Fei-Fei Li’s CNN school focuses on discrimination (“Is this a cat or a dog?”).
- The diffusion school focuses on generation (creating something from nothing). It embraces the idea that “to create is to understand” (Feynman: What I cannot create, I do not understand).
- It does not rely on labels of right and wrong; instead, it learns physical distributions. Sora, Midjourney, and Stable Diffusion all come from this school.
2. The Language/Structure School: Transformer
Representatives: Ashish Vaswani, Ilya Sutskever
Core Philosophy: “Attention is all you need”
This is the true backbone of modern large language models (ChatGPT, Gemini). It is neither Fei-Fei Li’s CNN nor Hassabis’s RL; it introduced the revolutionary attention mechanism.
● Key differences:
- CNNs perceive things locally—when identifying a cat, they first look at edges, then eyes, then assemble them.
- Transformers are globally associative. When they see a sentence or an image, they “scan everything at once,” instantly computing relationship weights among all elements.
For example, given the word “apple,” a Transformer simultaneously recalls both “phone” and “fruit,” and decides which one to focus on from context.
● Status:
This is the current “grand unified architecture.” Even Fei-Fei Li’s vision models (ViT) and Hassabis’s decision models are being rebuilt using Transformers as the foundation.
3. The Return of Symbolic/Logic School: System-2 Thinking
Representatives: Yann LeCun (Meta), and even OpenAI’s latest o1 models
Core Philosophy: “Intuition is not enough—reasoning is required”
Fei-Fei Li’s vision models and ChatGPT’s conversational abilities are both essentially System 1 thinking—fast, intuitive reasoning.
● ChatGPT predicts the next word based on probability; it does not truly reason logically.
● New directions (e.g., OpenAI o1) aim to integrate symbolic logic and tree-search methods.
● The key idea: this school is not satisfied with being “plausible”—it wants to be correct. It seeks to reintroduce mathematical rigor and explicit reasoning into the neural-network black box.
Summary: The Four Puzzle Pieces of AGI
If we think of AGI as a perfect “human,” these four schools each build one major component:
- Fei-Fei Li (CNN/ViT): gives it eyes (visual perception).
- Transformer (LLM): gives it a cerebral cortex (knowledge & association).
- Hassabis (RL): gives it a cerebellum and hands (decision-making & action).
- Diffusion Models: give it imagination (creation & dreaming).
Deep Insight
The rise of diffusion models (the physics school) marks AI’s transition from “understanding the world” to “constructing the world.”
This also explains why Fei-Fei Li’s newly founded World Labs is so important—she is attempting to combine perception (her original expertise) with generation (diffusion) to create a 3D world built on physical laws.
Leave a comment