架构视角:LLM演进的历史必然

大语言模型的发展史是一部架构创新的历史。从2017年Transformer的横空出世,到GPT系列引领生成式AI浪潮,再到开源模型百花齐放,每一次突破都源于架构层面的关键创新。本文梳理LLM架构演进的关键节点,剖析技术演进的内在逻辑。

LLM架构演进的关键驱动力

  • 规模定律(Scaling Laws):模型性能随规模可预测增长
  • 涌现能力(Emergence):规模达到临界点产生质变
  • 效率优化:在有限资源下最大化模型能力
  • 工程实践:训练稳定性和推理速度的 engineering 优化

奠基时代:Transformer的诞生(2017-2018)

Attention Is All You Need

2017年,Google发布的Transformer论文彻底改变了NLP领域。其核心创新——自注意力机制,解决了RNN的并行化和长距离依赖问题:

# Transformer核心:自注意力机制
import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    """
    缩放点积注意力
    这是Transformer的革命性创新
    """
    def __init__(self, d_model: int = 512, num_heads: int = 8):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # 线性投影
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape
        
        # Q, K, V投影并分头
        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        
        # 注意力计算: softmax(QK^T / sqrt(d_k)) V
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn = torch.softmax(scores, dim=-1)
        context = torch.matmul(attn, V)
        
        # 合并头并输出投影
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        return self.W_o(context)


class OriginalTransformerBlock(nn.Module):
    """
    原始Transformer编码器块(2017)
    Post-LN + 残差连接
    """
    def __init__(self, d_model: int = 512, num_heads: int = 8, d_ff: int = 2048):
        super().__init__()
        self.self_attn = SelfAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        # Post-LN:先计算,再归一化
        attn_out = self.self_attn(x, mask)
        x = self.norm1(x + attn_out)  # 残差连接 + LayerNorm
        
        ff_out = self.feed_forward(x)
        x = self.norm2(x + ff_out)
        
        return x

BERT:双向编码器的巅峰

class BERTArchitecture(nn.Module):
    """
    BERT架构(2018)
    Encoder-only,双向注意力,适合理解任务
    """
    def __init__(self, vocab_size: int, d_model: int = 768, num_layers: int = 12):
        super().__init__()
        
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(512, d_model)  # 可学习位置编码
        self.segment_embed = nn.Embedding(2, d_model)  # 句子分段嵌入
        
        self.encoder_layers = nn.ModuleList([
            OriginalTransformerBlock(d_model, num_heads=12)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        
        # MLM和NSP任务头
        self.mlm_head = nn.Linear(d_model, vocab_size)
        self.nsp_head = nn.Linear(d_model, 2)
    
    def forward(self, input_ids, segment_ids, attention_mask):
        # 三种嵌入相加
        seq_len = input_ids.size(1)
        positions = torch.arange(seq_len).unsqueeze(0).to(input_ids.device)
        
        x = self.token_embed(input_ids) + \
            self.pos_embed(positions) + \
            self.segment_embed(segment_ids)
        
        # 通过编码器层
        for layer in self.encoder_layers:
            x = layer(x, attention_mask)
        
        x = self.norm(x)
        
        # 预训练任务输出
        mlm_logits = self.mlm_head(x)
        nsp_logits = self.nsp_head(x[:, 0])  # [CLS] token
        
        return mlm_logits, nsp_logits

Transformer架构的划时代意义

  • 完全并行:摒弃RNN的序列依赖,支持GPU高效并行
  • 长距离依赖:注意力机制直接连接任意距离的词
  • 统一框架:编码器-解码器架构统一了NLP各类任务
  • 计算复杂度:O(n²)的注意力复杂度限制长序列处理

生成时代:GPT系列的崛起(2018-2020)

GPT-1:生成式预训练的开端

class GPT1Architecture(nn.Module):
    """
    GPT-1架构(2018)
    Decoder-only,自回归生成,单向注意力
    """
    def __init__(self, vocab_size: int, d_model: int = 768, num_layers: int = 12):
        super().__init__()
        
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(512, d_model)
        
        # Decoder-only:只使用Transformer的解码器部分
        self.blocks = nn.ModuleList([
            DecoderBlock(d_model, num_heads=12)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        
        # 权重绑定
        self.token_embed.weight = self.lm_head.weight
    
    def forward(self, input_ids):
        batch_size, seq_len = input_ids.shape
        positions = torch.arange(seq_len).unsqueeze(0).to(input_ids.device)
        
        x = self.token_embed(input_ids) + self.pos_embed(positions)
        
        # 因果掩码(三角掩码)
        mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)
        mask = mask.to(input_ids.device)
        
        for block in self.blocks:
            x = block(x, mask)
        
        x = self.norm(x)
        return self.lm_head(x)

GPT-2/GPT-3:规模的力量

模型 发布时间 参数量 层数 上下文 关键创新
GPT-1 2018.06 1.17亿 12 512 生成式预训练
GPT-2 2019.02 15亿 48 1024 Zero-shot能力
GPT-3 2020.05 1750亿 96 2048 In-context learning
# GPT-3的In-context Learning示例
# 无需微调,仅通过提示中的示例学习新任务

def gpt3_prompt_example():
    """
    Few-shot prompting使GPT-3具备强大的任务适应能力
    """
    prompt = """
将英文翻译成中文:
English: Hello, how are you?
Chinese: 你好,你好吗?

English: What is your name?
Chinese: 你叫什么名字?

English: Thank you very much!
Chinese: 非常感谢!

English: The quick brown fox jumps over the lazy dog.
Chinese:"""
    
    # GPT-3会根据前面的示例,学会翻译任务
    # 输出: 那只敏捷的棕色狐狸跳过了那只懒狗。
    return prompt


class ScalingLaws:
    """
    规模定律:性能随计算量、参数量、数据量幂律增长
    L(N, D) = A/N^α + B/D^β + E
    """
    @staticmethod
    def estimate_loss(num_params: float, num_tokens: float) -> float:
        """
        估算模型损失
        num_params: 模型参数量(十亿)
        num_tokens: 训练token数(十亿)
        """
        A, B, E = 1.5, 1.5, 1.0
        alpha, beta = 0.34, 0.28
        
        param_term = A / (num_params ** alpha)
        data_term = B / (num_tokens ** beta)
        
        return param_term + data_term + E

优化时代:架构的工程化创新(2020-2023)

Pre-LayerNorm:训练稳定性突破

class PreNormTransformerBlock(nn.Module):
    """
    Pre-LayerNorm架构(GPT-3采用)
    相比Post-LN,训练更稳定,可以使用更大学习率
    """
    def __init__(self, d_model: int, num_heads: int, d_ff: int):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn = SelfAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
    
    def forward(self, x, mask=None):
        # Pre-Norm:先归一化,再计算
        x = x + self.attn(self.norm1(x), mask)
        x = x + self.ff(self.norm2(x))
        return x

LLaMA:开源模型的工程典范

class LLaMABlock(nn.Module):
    """
    LLaMA架构创新(2023)
    1. RMSNorm替代LayerNorm
    2. RoPE位置编码
    3. SwiGLU激活函数
    4. 分组查询注意力(GQA)
    """
    def __init__(self, d_model: int, num_heads: int, num_kv_heads: int):
        super().__init__()
        
        # RMSNorm:更高效的归一化
        self.input_norm = RMSNorm(d_model)
        self.post_attn_norm = RMSNorm(d_model)
        
        # 分组查询注意力
        self.attn = GroupedQueryAttention(d_model, num_heads, num_kv_heads)
        
        # SwiGLU前馈网络
        self.feed_forward = SwiGLUFeedForward(d_model)
    
    def forward(self, x, mask=None, kv_cache=None):
        # 自注意力
        h = x + self.attn(self.input_norm(x), mask, kv_cache)
        
        # 前馈网络
        out = h + self.feed_forward(self.post_attn_norm(h))
        
        return out


class RMSNorm(nn.Module):
    """RMSNorm:LLaMA使用的归一化方式"""
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight


class RoPE(nn.Module):
    """
    旋转位置编码(Rotary Position Embedding)
    通过旋转矩阵编码相对位置
    """
    def __init__(self, dim: int, max_seq_len: int = 2048, base: float = 10000.0):
        super().__init__()
        
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer('inv_freq', inv_freq)
        
        t = torch.arange(max_seq_len)
        freqs = torch.einsum('i,j->ij', t, inv_freq)
        emb = torch.cat((freqs, freqs), dim=-1)
        
        self.register_buffer('cos_cached', emb.cos())
        self.register_buffer('sin_cached', emb.sin())
    
    def rotate_half(self, x):
        x1, x2 = x[..., ::2], x[..., 1::2]
        return torch.stack((-x2, x1), dim=-1).flatten(-2)
    
    def forward(self, q, k, seq_len):
        cos = self.cos_cached[:seq_len]
        sin = self.sin_cached[:seq_len]
        return q * cos + self.rotate_half(q) * sin, k * cos + self.rotate_half(k) * sin


class SwiGLUFeedForward(nn.Module):
    """SwiGLU激活函数"""
    def __init__(self, d_model: int):
        super().__init__()
        hidden_dim = int(2 / 3 * 4 * d_model)  # SwiGLU特殊维度
        self.w1 = nn.Linear(d_model, hidden_dim, bias=False)
        self.w2 = nn.Linear(d_model, hidden_dim, bias=False)
        self.w3 = nn.Linear(hidden_dim, d_model, bias=False)
    
    def forward(self, x):
        return self.w3(nn.functional.silu(self.w1(x)) * self.w2(x))

主流开源模型架构对比

模型 归一化 位置编码 激活函数 注意力 上下文
GPT-3 Pre-LN 可学习 GELU MHA 2K
LLaMA 2 RMSNorm RoPE SwiGLU GQA 4K
LLaMA 3 RMSNorm RoPE SwiGLU GQA 128K
Mistral RMSNorm RoPE SwiGLU Sliding Window 32K
Qwen 2 RMSNorm RoPE SwiGLU GQA 128K

效率时代:长上下文与推理优化(2023-至今)

长上下文技术

class LongContextTechniques:
    """
    长上下文处理技术演进
    """
    
    @staticmethod
    def alibi_attention(scores, seq_len):
        """
        ALiBi (Attention with Linear Biases)
        通过距离惩罚实现长度外推
        """
        # 为注意力分数添加基于距离的线性偏置
        distances = torch.arange(seq_len).unsqueeze(0) - torch.arange(seq_len).unsqueeze(1)
        slopes = torch.tensor([2**(-8 * (i + 1) / 8) for i in range(num_heads)])
        bias = distances.unsqueeze(0) * slopes.unsqueeze(1).unsqueeze(2)
        return scores - bias
    
    @staticmethod
    def ntk_aware_scaling(rope_base, seq_len, max_train_len):
        """
        NTK-aware位置编码扩展
        通过调整RoPE的base实现上下文长度外推
        """
        if seq_len <= max_train_len:
            return rope_base
        
        # 动态调整base
        scale = seq_len / max_train_len
        new_base = rope_base * (scale ** (d_model / (d_model - 2)))
        return new_base
    
    @staticmethod
    def yarn_scaling(rope_base, seq_len, max_train_len, beta_fast=32, beta_slow=1):
        """
        YaRN (Yet another RoPE extension method)
        更优的长度外推方法
        """
        scale = seq_len / max_train_len
        
        # 计算频率调整因子
        freq_factors = []
        for dim in range(0, d_model, 2):
            freq = 1.0 / (rope_base ** (dim / d_model))
            
            if dim < d_model * beta_fast / (beta_fast + beta_slow):
                freq_factors.append(freq / scale)
            else:
                freq_factors.append(freq)
        
        return freq_factors


class SlidingWindowAttention(nn.Module):
    """
    滑动窗口注意力(Mistral使用)
    限制注意力范围,降低长序列计算复杂度
    """
    def __init__(self, d_model: int, num_heads: int, window_size: int = 4096):
        super().__init__()
        self.attn = SelfAttention(d_model, num_heads)
        self.window_size = window_size
    
    def create_window_mask(self, seq_len):
        """创建滑动窗口掩码"""
        mask = torch.zeros(seq_len, seq_len)
        for i in range(seq_len):
            start = max(0, i - self.window_size)
            mask[i, start:i+1] = 1
        return mask
    
    def forward(self, x):
        seq_len = x.size(1)
        mask = self.create_window_mask(seq_len).to(x.device)
        return self.attn(x, mask)

推理加速架构

class InferenceOptimizations:
    """
    推理优化技术
    """
    
    class KVCache:
        """KV缓存:避免重复计算"""
        def __init__(self, max_batch, max_seq, num_heads, head_dim):
            self.k_cache = torch.zeros(max_batch, num_heads, max_seq, head_dim)
            self.v_cache = torch.zeros(max_batch, num_heads, max_seq, head_dim)
            self.current_len = 0
        
        def update(self, k, v):
            seq_len = k.size(2)
            self.k_cache[:, :, self.current_len:self.current_len+seq_len] = k
            self.v_cache[:, :, self.current_len:self.current_len+seq_len] = v
            self.current_len += seq_len
        
        def get(self):
            return self.k_cache[:, :, :self.current_len], self.v_cache[:, :, :self.current_len]
    
    @staticmethod
    def speculative_decoding(draft_model, target_model, input_ids, gamma=5):
        """
        投机解码:小模型起草,大模型验证
        加速2-3倍
        """
        # 1. 小模型快速生成gamma个token
        draft_tokens = draft_model.generate(input_ids, max_new_tokens=gamma)
        
        # 2. 大模型并行验证
        target_logits = target_model(draft_tokens).logits
        
        # 3. 接受或拒绝
        accepted = []
        for i in range(gamma):
            if accept_token(target_logits[i], draft_tokens[i]):
                accepted.append(draft_tokens[i])
            else:
                # 从调整后的分布采样
                accepted.append(resample(target_logits[i]))
                break
        
        return accepted

架构决策总结

设计决策 历史选择 现代推荐 演进原因
架构类型 Encoder-Decoder Decoder-only 生成能力更强,训练更高效
归一化位置 Post-LN Pre-RMSNorm 训练稳定,推理更快
位置编码 绝对/可学习 RoPE 更好的长度外推
注意力 Multi-Head Grouped Query 推理效率提升
激活函数 ReLU/GELU SwiGLU 表达能力更强
上下文长度 2K 128K+ 应用场景扩展

架构演进中的教训

  • 盲目增大模型:数据质量和训练方法同样重要
  • 忽视推理成本:训练是一次性的,推理是持续的
  • 过度复杂化:简单的架构往往更稳定、更易扩展
  • 闭门造车:开源社区协作加速创新

总结

LLM架构的演进是一部工程创新的历史。从Transformer的基础设计,到GPT系列的规模探索,再到LLaMA的效率优化,每一次突破都源于对问题的深入理解和对细节的极致追求。

架构的选择永远是权衡的艺术。在追求更大规模的同时,我们也看到效率优化的重要性。未来的LLM架构将在能力、效率、可控性之间寻找更好的平衡点,向着更通用、更可靠、更普惠的方向演进。