1. On Evaluation Systems
The rapid advancement of AI suggests that in the future, humanity’s core task may shift to defining evaluation systems, while AI handles the rest.
This week, OpenAI released a benchmark report showing that GPT-4o and GPT-4o-mini lead all major LLM benchmarks, surpassing Gemini 2.5 Pro and Claude Sonnet 3.7.

For any system, establishing a clear evaluation framework is crucial—only with measurable metrics can a system form a closed-loop feedback mechanism, enabling rapid self-iteration. The design of these metrics reflects deep strategic thinking about the system’s purpose.
A recent blog post by an OpenAI researcher (The Second Half) argues that AI is entering its ”second half”—a shift from ”solving problems” to “defining problems.” The new era’s core challenge will be: Evaluation matters more than training. Instead of asking, “Can we train a model to solve X?”, we must now ask, “What should AI solve? How do we measure real progress?” This demands a product manager’s mindset in reshaping capabilities.
2. Different Metrics Drive Different Business Outcomes
Taobao once dominated China’s e-commerce market, but Pinduoduo emerged to capture nearly half of Taobao’s market share. A key factor? Divergent core metrics.
- Taobao prioritized UV Value (User Visit Value):
- UV Value = Conversion Rate × Average Order Value
- At one point, Taobao accepted lower conversion rates if it led to higher order values, optimizing for premium users.
- Pinduoduo, however, focused on order volume:
- Its strategy maximized conversion rates through aggressive low pricing.
- While this didn’t maximize UV Value, it fueled rapid user acquisition and scale.
This case exemplifies how metrics define strategy in platform economics:
- UV Value → “Precision operations” (monetizing existing users).
- Order Volume → “Growth hacking” (expanding market share).
Another example: Tencent’s WeChat Pay.
In 2016, when WeChat Pay expanded, its goal wasn’t GMV (Gross Merchandise Volume) but penetration rate. Zhang Long (WeChat’s founder) emphasized making payments a daily habit rather than chasing transaction counts. The team prioritized merchant adoption, especially small vendors, leading to long-term dominance.
In the AI era, Manus introduced a new metric:
- AHPU (Agentic Hours Per User) → Measures actual AI task execution time per user, replacing traditional DAU (Daily Active Users).
3. Financial Metrics: Measuring Economic Moats
Warren Buffett often speaks of ”economic moats”—but how do we quantify them? Gross margin is a key indicator.
- Nvidia: ~72% gross margin (strong pricing power).
- TSMC: ~53% (dominance in semiconductor manufacturing).
A gross margin >50% typically signals a durable competitive advantage.
Nvidia’s case study:
From late 2022 (post-ChatGPT boom) to April 2024, Nvidia’s gross margin surged from <55% to 78%. Why?
- AI-driven GPU demand → Supply shortages → Nvidia raised prices (e.g., 4,000GPUsvs.previous1,000 levels).
- AMD couldn’t replicate this—despite cheaper GPUs—because Nvidia’s tech superiority created an unmatched moat.
Why gross margin (not net margin)?
- Gross margin isolates core business competitiveness (pricing power, cost control).
- Net margin can be distorted by non-operational factors (taxes, one-time expenses).
(Source: WeChat Article)
4. Understanding the U.S. Economy Through Data
The U.S. heavily relies on economic indicators for policymaking. Key metrics:
- CPI (Consumer Price Index)
- PPI (Producer Price Index)
- Employment Rate (arguably the most critical—economic resilience hinges on jobs).
Japan’s recent strong employment metrics suggest economic stability. (Casual note: Macro isn’t my expertise, just observations.)
Conclusion: Metrics Reflect Strategic Depth
Designing a data-driven evaluation system requires deep business insight. Recently, Tencent simplified publishing on WeChat Official Accounts (lowering creator barriers)—so I’ll write more.
Leave a comment