架构视角:LLM演进的历史必然
大语言模型的发展史是一部架构创新的历史。从2017年Transformer的横空出世,到GPT系列引领生成式AI浪潮,再到开源模型百花齐放,每一次突破都源于架构层面的关键创新。本文梳理LLM架构演进的关键节点,剖析技术演进的内在逻辑。
LLM架构演进的关键驱动力
- 规模定律(Scaling Laws):模型性能随规模可预测增长
- 涌现能力(Emergence):规模达到临界点产生质变
- 效率优化:在有限资源下最大化模型能力
- 工程实践:训练稳定性和推理速度的 engineering 优化
奠基时代:Transformer的诞生(2017-2018)
Attention Is All You Need
2017年,Google发布的Transformer论文彻底改变了NLP领域。其核心创新——自注意力机制,解决了RNN的并行化和长距离依赖问题:
# Transformer核心:自注意力机制
import torch
import torch.nn as nn
import math
class SelfAttention(nn.Module):
"""
缩放点积注意力
这是Transformer的革命性创新
"""
def __init__(self, d_model: int = 512, num_heads: int = 8):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# 线性投影
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
batch_size, seq_len, _ = x.shape
# Q, K, V投影并分头
Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# 注意力计算: softmax(QK^T / sqrt(d_k)) V
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
context = torch.matmul(attn, V)
# 合并头并输出投影
context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
return self.W_o(context)
class OriginalTransformerBlock(nn.Module):
"""
原始Transformer编码器块(2017)
Post-LN + 残差连接
"""
def __init__(self, d_model: int = 512, num_heads: int = 8, d_ff: int = 2048):
super().__init__()
self.self_attn = SelfAttention(d_model, num_heads)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x, mask=None):
# Post-LN:先计算,再归一化
attn_out = self.self_attn(x, mask)
x = self.norm1(x + attn_out) # 残差连接 + LayerNorm
ff_out = self.feed_forward(x)
x = self.norm2(x + ff_out)
return x
BERT:双向编码器的巅峰
class BERTArchitecture(nn.Module):
"""
BERT架构(2018)
Encoder-only,双向注意力,适合理解任务
"""
def __init__(self, vocab_size: int, d_model: int = 768, num_layers: int = 12):
super().__init__()
self.token_embed = nn.Embedding(vocab_size, d_model)
self.pos_embed = nn.Embedding(512, d_model) # 可学习位置编码
self.segment_embed = nn.Embedding(2, d_model) # 句子分段嵌入
self.encoder_layers = nn.ModuleList([
OriginalTransformerBlock(d_model, num_heads=12)
for _ in range(num_layers)
])
self.norm = nn.LayerNorm(d_model)
# MLM和NSP任务头
self.mlm_head = nn.Linear(d_model, vocab_size)
self.nsp_head = nn.Linear(d_model, 2)
def forward(self, input_ids, segment_ids, attention_mask):
# 三种嵌入相加
seq_len = input_ids.size(1)
positions = torch.arange(seq_len).unsqueeze(0).to(input_ids.device)
x = self.token_embed(input_ids) + \
self.pos_embed(positions) + \
self.segment_embed(segment_ids)
# 通过编码器层
for layer in self.encoder_layers:
x = layer(x, attention_mask)
x = self.norm(x)
# 预训练任务输出
mlm_logits = self.mlm_head(x)
nsp_logits = self.nsp_head(x[:, 0]) # [CLS] token
return mlm_logits, nsp_logits
Transformer架构的划时代意义
- ✅ 完全并行:摒弃RNN的序列依赖,支持GPU高效并行
- ✅ 长距离依赖:注意力机制直接连接任意距离的词
- ✅ 统一框架:编码器-解码器架构统一了NLP各类任务
- ❌ 计算复杂度:O(n²)的注意力复杂度限制长序列处理
生成时代:GPT系列的崛起(2018-2020)
GPT-1:生成式预训练的开端
class GPT1Architecture(nn.Module):
"""
GPT-1架构(2018)
Decoder-only,自回归生成,单向注意力
"""
def __init__(self, vocab_size: int, d_model: int = 768, num_layers: int = 12):
super().__init__()
self.token_embed = nn.Embedding(vocab_size, d_model)
self.pos_embed = nn.Embedding(512, d_model)
# Decoder-only:只使用Transformer的解码器部分
self.blocks = nn.ModuleList([
DecoderBlock(d_model, num_heads=12)
for _ in range(num_layers)
])
self.norm = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# 权重绑定
self.token_embed.weight = self.lm_head.weight
def forward(self, input_ids):
batch_size, seq_len = input_ids.shape
positions = torch.arange(seq_len).unsqueeze(0).to(input_ids.device)
x = self.token_embed(input_ids) + self.pos_embed(positions)
# 因果掩码(三角掩码)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)
mask = mask.to(input_ids.device)
for block in self.blocks:
x = block(x, mask)
x = self.norm(x)
return self.lm_head(x)
GPT-2/GPT-3:规模的力量
| 模型 | 发布时间 | 参数量 | 层数 | 上下文 | 关键创新 |
|---|---|---|---|---|---|
| GPT-1 | 2018.06 | 1.17亿 | 12 | 512 | 生成式预训练 |
| GPT-2 | 2019.02 | 15亿 | 48 | 1024 | Zero-shot能力 |
| GPT-3 | 2020.05 | 1750亿 | 96 | 2048 | In-context learning |
# GPT-3的In-context Learning示例
# 无需微调,仅通过提示中的示例学习新任务
def gpt3_prompt_example():
"""
Few-shot prompting使GPT-3具备强大的任务适应能力
"""
prompt = """
将英文翻译成中文:
English: Hello, how are you?
Chinese: 你好,你好吗?
English: What is your name?
Chinese: 你叫什么名字?
English: Thank you very much!
Chinese: 非常感谢!
English: The quick brown fox jumps over the lazy dog.
Chinese:"""
# GPT-3会根据前面的示例,学会翻译任务
# 输出: 那只敏捷的棕色狐狸跳过了那只懒狗。
return prompt
class ScalingLaws:
"""
规模定律:性能随计算量、参数量、数据量幂律增长
L(N, D) = A/N^α + B/D^β + E
"""
@staticmethod
def estimate_loss(num_params: float, num_tokens: float) -> float:
"""
估算模型损失
num_params: 模型参数量(十亿)
num_tokens: 训练token数(十亿)
"""
A, B, E = 1.5, 1.5, 1.0
alpha, beta = 0.34, 0.28
param_term = A / (num_params ** alpha)
data_term = B / (num_tokens ** beta)
return param_term + data_term + E
优化时代:架构的工程化创新(2020-2023)
Pre-LayerNorm:训练稳定性突破
class PreNormTransformerBlock(nn.Module):
"""
Pre-LayerNorm架构(GPT-3采用)
相比Post-LN,训练更稳定,可以使用更大学习率
"""
def __init__(self, d_model: int, num_heads: int, d_ff: int):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.attn = SelfAttention(d_model, num_heads)
self.norm2 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model)
)
def forward(self, x, mask=None):
# Pre-Norm:先归一化,再计算
x = x + self.attn(self.norm1(x), mask)
x = x + self.ff(self.norm2(x))
return x
LLaMA:开源模型的工程典范
class LLaMABlock(nn.Module):
"""
LLaMA架构创新(2023)
1. RMSNorm替代LayerNorm
2. RoPE位置编码
3. SwiGLU激活函数
4. 分组查询注意力(GQA)
"""
def __init__(self, d_model: int, num_heads: int, num_kv_heads: int):
super().__init__()
# RMSNorm:更高效的归一化
self.input_norm = RMSNorm(d_model)
self.post_attn_norm = RMSNorm(d_model)
# 分组查询注意力
self.attn = GroupedQueryAttention(d_model, num_heads, num_kv_heads)
# SwiGLU前馈网络
self.feed_forward = SwiGLUFeedForward(d_model)
def forward(self, x, mask=None, kv_cache=None):
# 自注意力
h = x + self.attn(self.input_norm(x), mask, kv_cache)
# 前馈网络
out = h + self.feed_forward(self.post_attn_norm(h))
return out
class RMSNorm(nn.Module):
"""RMSNorm:LLaMA使用的归一化方式"""
def __init__(self, dim: int, eps: float = 1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x):
return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight
class RoPE(nn.Module):
"""
旋转位置编码(Rotary Position Embedding)
通过旋转矩阵编码相对位置
"""
def __init__(self, dim: int, max_seq_len: int = 2048, base: float = 10000.0):
super().__init__()
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer('inv_freq', inv_freq)
t = torch.arange(max_seq_len)
freqs = torch.einsum('i,j->ij', t, inv_freq)
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer('cos_cached', emb.cos())
self.register_buffer('sin_cached', emb.sin())
def rotate_half(self, x):
x1, x2 = x[..., ::2], x[..., 1::2]
return torch.stack((-x2, x1), dim=-1).flatten(-2)
def forward(self, q, k, seq_len):
cos = self.cos_cached[:seq_len]
sin = self.sin_cached[:seq_len]
return q * cos + self.rotate_half(q) * sin, k * cos + self.rotate_half(k) * sin
class SwiGLUFeedForward(nn.Module):
"""SwiGLU激活函数"""
def __init__(self, d_model: int):
super().__init__()
hidden_dim = int(2 / 3 * 4 * d_model) # SwiGLU特殊维度
self.w1 = nn.Linear(d_model, hidden_dim, bias=False)
self.w2 = nn.Linear(d_model, hidden_dim, bias=False)
self.w3 = nn.Linear(hidden_dim, d_model, bias=False)
def forward(self, x):
return self.w3(nn.functional.silu(self.w1(x)) * self.w2(x))
主流开源模型架构对比
| 模型 | 归一化 | 位置编码 | 激活函数 | 注意力 | 上下文 |
|---|---|---|---|---|---|
| GPT-3 | Pre-LN | 可学习 | GELU | MHA | 2K |
| LLaMA 2 | RMSNorm | RoPE | SwiGLU | GQA | 4K |
| LLaMA 3 | RMSNorm | RoPE | SwiGLU | GQA | 128K |
| Mistral | RMSNorm | RoPE | SwiGLU | Sliding Window | 32K |
| Qwen 2 | RMSNorm | RoPE | SwiGLU | GQA | 128K |
效率时代:长上下文与推理优化(2023-至今)
长上下文技术
class LongContextTechniques:
"""
长上下文处理技术演进
"""
@staticmethod
def alibi_attention(scores, seq_len):
"""
ALiBi (Attention with Linear Biases)
通过距离惩罚实现长度外推
"""
# 为注意力分数添加基于距离的线性偏置
distances = torch.arange(seq_len).unsqueeze(0) - torch.arange(seq_len).unsqueeze(1)
slopes = torch.tensor([2**(-8 * (i + 1) / 8) for i in range(num_heads)])
bias = distances.unsqueeze(0) * slopes.unsqueeze(1).unsqueeze(2)
return scores - bias
@staticmethod
def ntk_aware_scaling(rope_base, seq_len, max_train_len):
"""
NTK-aware位置编码扩展
通过调整RoPE的base实现上下文长度外推
"""
if seq_len <= max_train_len:
return rope_base
# 动态调整base
scale = seq_len / max_train_len
new_base = rope_base * (scale ** (d_model / (d_model - 2)))
return new_base
@staticmethod
def yarn_scaling(rope_base, seq_len, max_train_len, beta_fast=32, beta_slow=1):
"""
YaRN (Yet another RoPE extension method)
更优的长度外推方法
"""
scale = seq_len / max_train_len
# 计算频率调整因子
freq_factors = []
for dim in range(0, d_model, 2):
freq = 1.0 / (rope_base ** (dim / d_model))
if dim < d_model * beta_fast / (beta_fast + beta_slow):
freq_factors.append(freq / scale)
else:
freq_factors.append(freq)
return freq_factors
class SlidingWindowAttention(nn.Module):
"""
滑动窗口注意力(Mistral使用)
限制注意力范围,降低长序列计算复杂度
"""
def __init__(self, d_model: int, num_heads: int, window_size: int = 4096):
super().__init__()
self.attn = SelfAttention(d_model, num_heads)
self.window_size = window_size
def create_window_mask(self, seq_len):
"""创建滑动窗口掩码"""
mask = torch.zeros(seq_len, seq_len)
for i in range(seq_len):
start = max(0, i - self.window_size)
mask[i, start:i+1] = 1
return mask
def forward(self, x):
seq_len = x.size(1)
mask = self.create_window_mask(seq_len).to(x.device)
return self.attn(x, mask)
推理加速架构
class InferenceOptimizations:
"""
推理优化技术
"""
class KVCache:
"""KV缓存:避免重复计算"""
def __init__(self, max_batch, max_seq, num_heads, head_dim):
self.k_cache = torch.zeros(max_batch, num_heads, max_seq, head_dim)
self.v_cache = torch.zeros(max_batch, num_heads, max_seq, head_dim)
self.current_len = 0
def update(self, k, v):
seq_len = k.size(2)
self.k_cache[:, :, self.current_len:self.current_len+seq_len] = k
self.v_cache[:, :, self.current_len:self.current_len+seq_len] = v
self.current_len += seq_len
def get(self):
return self.k_cache[:, :, :self.current_len], self.v_cache[:, :, :self.current_len]
@staticmethod
def speculative_decoding(draft_model, target_model, input_ids, gamma=5):
"""
投机解码:小模型起草,大模型验证
加速2-3倍
"""
# 1. 小模型快速生成gamma个token
draft_tokens = draft_model.generate(input_ids, max_new_tokens=gamma)
# 2. 大模型并行验证
target_logits = target_model(draft_tokens).logits
# 3. 接受或拒绝
accepted = []
for i in range(gamma):
if accept_token(target_logits[i], draft_tokens[i]):
accepted.append(draft_tokens[i])
else:
# 从调整后的分布采样
accepted.append(resample(target_logits[i]))
break
return accepted
架构决策总结
| 设计决策 | 历史选择 | 现代推荐 | 演进原因 |
|---|---|---|---|
| 架构类型 | Encoder-Decoder | Decoder-only | 生成能力更强,训练更高效 |
| 归一化位置 | Post-LN | Pre-RMSNorm | 训练稳定,推理更快 |
| 位置编码 | 绝对/可学习 | RoPE | 更好的长度外推 |
| 注意力 | Multi-Head | Grouped Query | 推理效率提升 |
| 激活函数 | ReLU/GELU | SwiGLU | 表达能力更强 |
| 上下文长度 | 2K | 128K+ | 应用场景扩展 |
架构演进中的教训
- ❌ 盲目增大模型:数据质量和训练方法同样重要
- ❌ 忽视推理成本:训练是一次性的,推理是持续的
- ❌ 过度复杂化:简单的架构往往更稳定、更易扩展
- ❌ 闭门造车:开源社区协作加速创新
总结
LLM架构的演进是一部工程创新的历史。从Transformer的基础设计,到GPT系列的规模探索,再到LLaMA的效率优化,每一次突破都源于对问题的深入理解和对细节的极致追求。
架构的选择永远是权衡的艺术。在追求更大规模的同时,我们也看到效率优化的重要性。未来的LLM架构将在能力、效率、可控性之间寻找更好的平衡点,向着更通用、更可靠、更普惠的方向演进。