Appearance
第11天:LLM原理与架构
学习目标
- 理解Transformer架构的核心原理
- 掌握Self-Attention机制的实现
- 了解位置编码的作用和类型
- 能够从零实现Self-Attention
课程内容
1. Transformer架构概述
1.1 Transformer的诞生
背景:
- 2017年,Google发表论文《Attention Is All You Need》
- 彻底改变了NLP领域
- 成为现代大语言模型的基础架构
核心思想:
- 完全基于注意力机制
- 摒弃RNN的序列处理方式
- 实现并行计算
1.2 Transformer整体架构
编码器-解码器结构:
输入 → 编码器 → 解码器 → 输出编码器:
- 多头自注意力层
- 前馈神经网络层
- 残差连接和层归一化
解码器:
- 掩码多头自注意力层
- 编码器-解码器注意力层
- 前馈神经网络层
- 残差连接和层归一化
代码示例:
python
import torch
import torch.nn as nn
import torch.nn.functional as F
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
super().__init__()
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, src, src_mask=None):
# 自注意力
src2, _ = self.self_attn(src, src, src, attn_mask=src_mask)
src = src + self.dropout1(src2)
src = self.norm1(src)
# 前馈网络
src2 = self.linear2(self.dropout(F.relu(self.linear1(src))))
src = src + self.dropout2(src2)
src = self.norm2(src)
return src2. Self-Attention机制
2.1 Attention的核心思想
问题:
- 如何让模型关注输入序列中的重要部分?
- 如何捕捉长距离依赖?
解决方案:
- 通过计算查询(Query)、键(Key)、值(Value)之间的关系
- 动态分配权重
2.2 Self-Attention的数学原理
步骤1:计算Q、K、V
python
def compute_qkv(x, W_q, W_k, W_v):
"""
计算查询、键、值
Args:
x: 输入向量 [batch_size, seq_len, d_model]
W_q, W_k, W_v: 权重矩阵 [d_model, d_k]
Returns:
Q, K, V: 查询、键、值向量
"""
Q = torch.matmul(x, W_q) # [batch_size, seq_len, d_k]
K = torch.matmul(x, W_k) # [batch_size, seq_len, d_k]
V = torch.matmul(x, W_v) # [batch_size, seq_len, d_v]
return Q, K, V步骤2:计算注意力分数
python
def compute_attention_scores(Q, K):
"""
计算注意力分数
Args:
Q: 查询向量 [batch_size, seq_len, d_k]
K: 键向量 [batch_size, seq_len, d_k]
Returns:
scores: 注意力分数 [batch_size, seq_len, seq_len]
"""
# Q * K^T
scores = torch.matmul(Q, K.transpose(-2, -1))
# 缩放
d_k = Q.size(-1)
scores = scores / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
return scores步骤3:计算注意力权重
python
def compute_attention_weights(scores, mask=None):
"""
计算注意力权重
Args:
scores: 注意力分数 [batch_size, seq_len, seq_len]
mask: 掩码矩阵 [batch_size, seq_len, seq_len]
Returns:
weights: 注意力权重 [batch_size, seq_len, seq_len]
"""
# 应用掩码
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Softmax归一化
weights = F.softmax(scores, dim=-1)
return weights步骤4:计算输出
python
def compute_attention_output(weights, V):
"""
计算注意力输出
Args:
weights: 注意力权重 [batch_size, seq_len, seq_len]
V: 值向量 [batch_size, seq_len, d_v]
Returns:
output: 注意力输出 [batch_size, seq_len, d_v]
"""
output = torch.matmul(weights, V)
return output2.3 完整的Self-Attention实现
python
class SelfAttention(nn.Module):
def __init__(self, d_model, d_k, d_v):
super().__init__()
self.d_model = d_model
self.d_k = d_k
self.d_v = d_v
# 权重矩阵
self.W_q = nn.Linear(d_model, d_k, bias=False)
self.W_k = nn.Linear(d_model, d_k, bias=False)
self.W_v = nn.Linear(d_model, d_v, bias=False)
# 输出投影
self.W_o = nn.Linear(d_v, d_model, bias=False)
def forward(self, x, mask=None):
"""
前向传播
Args:
x: 输入 [batch_size, seq_len, d_model]
mask: 掩码 [batch_size, seq_len, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
batch_size, seq_len, _ = x.size()
# 计算Q、K、V
Q = self.W_q(x) # [batch_size, seq_len, d_k]
K = self.W_k(x) # [batch_size, seq_len, d_k]
V = self.W_v(x) # [batch_size, seq_len, d_v]
# 计算注意力分数
scores = torch.matmul(Q, K.transpose(-2, -1))
scores = scores / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
# 应用掩码
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# 计算注意力权重
weights = F.softmax(scores, dim=-1)
# 计算输出
output = torch.matmul(weights, V)
# 输出投影
output = self.W_o(output)
return output, weights3. Multi-Head Attention
3.1 多头注意力的思想
为什么需要多头?
- 不同的头可以关注不同的信息
- 捕捉更丰富的语义关系
- 提高模型表达能力
3.2 Multi-Head Attention实现
python
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, nhead):
super().__init__()
self.d_model = d_model
self.nhead = nhead
assert d_model % nhead == 0, "d_model must be divisible by nhead"
self.d_k = d_model // nhead
self.d_v = d_model // nhead
# 多个头的权重矩阵
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
# 输出投影
self.W_o = nn.Linear(d_model, d_model, bias=False)
def forward(self, x, mask=None):
"""
前向传播
Args:
x: 输入 [batch_size, seq_len, d_model]
mask: 掩码 [batch_size, seq_len, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
batch_size, seq_len, _ = x.size()
# 计算Q、K、V
Q = self.W_q(x) # [batch_size, seq_len, d_model]
K = self.W_k(x) # [batch_size, seq_len, d_model]
V = self.W_v(x) # [batch_size, seq_len, d_model]
# 重塑为多头
Q = Q.view(batch_size, seq_len, self.nhead, self.d_k).transpose(1, 2)
K = K.view(batch_size, seq_len, self.nhead, self.d_k).transpose(1, 2)
V = V.view(batch_size, seq_len, self.nhead, self.d_v).transpose(1, 2)
# [batch_size, nhead, seq_len, d_k/v]
# 计算注意力分数
scores = torch.matmul(Q, K.transpose(-2, -1))
scores = scores / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
# 应用掩码
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# 计算注意力权重
weights = F.softmax(scores, dim=-1)
# 计算输出
output = torch.matmul(weights, V)
# 合并多头
output = output.transpose(1, 2).contiguous()
output = output.view(batch_size, seq_len, self.d_model)
# 输出投影
output = self.W_o(output)
return output, weights4. 位置编码
4.1 为什么需要位置编码
问题:
- Self-Attention是位置无关的
- 无法捕捉序列的顺序信息
解决方案:
- 为每个位置添加位置信息
- 让模型理解序列的顺序
4.2 位置编码的类型
4.2.1 正弦位置编码
python
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
# 创建位置编码矩阵
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# 添加batch维度
pe = pe.unsqueeze(0)
# 注册为buffer
self.register_buffer('pe', pe)
def forward(self, x):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
Returns:
output: 添加位置编码后的输出
"""
x = x + self.pe[:, :x.size(1), :]
return x4.2.2 可学习位置编码
python
class LearnablePositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
# 可学习的位置编码
self.position_embeddings = nn.Embedding(max_len, d_model)
def forward(self, x):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
Returns:
output: 添加位置编码后的输出
"""
batch_size, seq_len, _ = x.size()
# 生成位置索引
positions = torch.arange(seq_len, device=x.device)
position_embeddings = self.position_embeddings(positions)
# 添加位置编码
x = x + position_embeddings.unsqueeze(0)
return x4.2.3 相对位置编码
python
class RelativePositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
self.d_model = d_model
self.max_len = max_len
# 相对位置编码
self.relative_position_embeddings = nn.Embedding(2 * max_len, d_model)
def forward(self, x):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
Returns:
output: 添加相对位置编码后的输出
"""
batch_size, seq_len, _ = x.size()
# 计算相对位置
positions = torch.arange(seq_len, device=x.device)
relative_positions = positions.unsqueeze(1) - positions.unsqueeze(0)
# 将相对位置映射到正数范围
relative_positions = relative_positions + self.max_len
# 获取相对位置编码
relative_embeddings = self.relative_position_embeddings(relative_positions)
return x, relative_embeddings5. 前馈神经网络
5.1 FFN的作用
作用:
- 增加模型的表达能力
- 捕捉非线性关系
- 提供特征变换
5.2 FFN实现
python
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
self.activation = nn.ReLU()
def forward(self, x):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
# 第一层
x = self.linear1(x)
x = self.activation(x)
x = self.dropout(x)
# 第二层
x = self.linear2(x)
return x6. 完整的Transformer层
python
class TransformerLayer(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super().__init__()
# 多头自注意力
self.self_attn = MultiHeadAttention(d_model, nhead)
# 前馈网络
self.ffn = FeedForward(d_model, d_ff, dropout)
# 层归一化
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# Dropout
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
mask: 掩码 [batch_size, seq_len, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
# 自注意力 + 残差连接 + 层归一化
attn_output, attn_weights = self.self_attn(x, mask)
x = x + self.dropout1(attn_output)
x = self.norm1(x)
# 前馈网络 + 残差连接 + 层归一化
ffn_output = self.ffn(x)
x = x + self.dropout2(ffn_output)
x = self.norm2(x)
return x, attn_weights实践任务
任务:从零实现Self-Attention
目标:实现一个完整的Self-Attention模块
要求:
- 实现Self-Attention类
- 实现Multi-Head Attention类
- 实现位置编码
- 测试注意力机制
代码框架:
python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class SelfAttention(nn.Module):
def __init__(self, d_model, d_k, d_v):
super().__init__()
# TODO: 初始化权重矩阵
def forward(self, x, mask=None):
# TODO: 实现Self-Attention前向传播
pass
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, nhead):
super().__init__()
# TODO: 初始化多头注意力
def forward(self, x, mask=None):
# TODO: 实现多头注意力前向传播
pass
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
# TODO: 实现位置编码
def forward(self, x):
# TODO: 添加位置编码
pass
# 测试代码
if __name__ == "__main__":
# 测试Self-Attention
batch_size = 2
seq_len = 10
d_model = 512
nhead = 8
x = torch.randn(batch_size, seq_len, d_model)
# 测试Self-Attention
self_attn = SelfAttention(d_model, d_model//nhead, d_model//nhead)
output, weights = self_attn(x)
print(f"Self-Attention output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
# 测试Multi-Head Attention
multi_head_attn = MultiHeadAttention(d_model, nhead)
output, weights = multi_head_attn(x)
print(f"Multi-Head Attention output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
# 测试位置编码
pos_encoding = PositionalEncoding(d_model)
output = pos_encoding(x)
print(f"Positional encoding output shape: {output.shape}")课后作业
作业1:Self-Attention可视化
题目:可视化Self-Attention的注意力权重
要求:
- 实现注意力权重可视化
- 分析不同位置的注意力分布
- 理解注意力机制的工作原理
作业2:不同位置编码对比
题目:对比不同类型的位置编码
要求:
- 实现正弦位置编码
- 实现可学习位置编码
- 实现相对位置编码
- 对比三种位置编码的效果
作业3:Transformer架构分析
题目:深入分析Transformer架构
要求:
- 分析Transformer的各个组件
- 理解每个组件的作用
- 分析Transformer的优势和局限
参考资料
必读文献
Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS.
- Transformer原始论文
Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
- BERT论文
Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners". OpenAI Blog.
- GPT-2论文
推荐阅读
The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
- Transformer图解
The Illustrated Self-Attention: https://jalammar.github.io/illustrated-self-attention/
- Self-Attention图解
在线资源
Hugging Face Transformers: https://huggingface.co/docs/transformers/
- Transformers库文档
PyTorch Transformer Tutorial: https://pytorch.org/tutorials/beginner/transformer_tutorial.html
- PyTorch Transformer教程
扩展阅读
Transformer变体
- BART (2019): Denoising Sequence-to-Sequence Pre-training
- T5 (2019): Exploring the Limits of Transfer Learning
- GPT-3 (2020): Language Models are Few-Shot Learners
注意力机制变体
- Sparse Attention (2020): Longformer, BigBird
- Linear Attention (2020): Linformer, Performer
- Efficient Attention (2021): Flash Attention
下节预告
下一节我们将学习主流LLM架构对比,深入了解GPT、BERT、T5、LLaMA、PaLM、Gemini等主流大语言模型的架构差异和特点。

扫描二维码关注"架构师AI杜"公众号,获取更多技术内容和最新动态
