Skip to content

第11天:LLM原理与架构

学习目标

  • 理解Transformer架构的核心原理
  • 掌握Self-Attention机制的实现
  • 了解位置编码的作用和类型
  • 能够从零实现Self-Attention

课程内容

1. Transformer架构概述

1.1 Transformer的诞生

背景

  • 2017年,Google发表论文《Attention Is All You Need》
  • 彻底改变了NLP领域
  • 成为现代大语言模型的基础架构

核心思想

  • 完全基于注意力机制
  • 摒弃RNN的序列处理方式
  • 实现并行计算

1.2 Transformer整体架构

编码器-解码器结构

输入 → 编码器 → 解码器 → 输出

编码器

  • 多头自注意力层
  • 前馈神经网络层
  • 残差连接和层归一化

解码器

  • 掩码多头自注意力层
  • 编码器-解码器注意力层
  • 前馈神经网络层
  • 残差连接和层归一化

代码示例

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, src, src_mask=None):
        # 自注意力
        src2, _ = self.self_attn(src, src, src, attn_mask=src_mask)
        src = src + self.dropout1(src2)
        src = self.norm1(src)
        
        # 前馈网络
        src2 = self.linear2(self.dropout(F.relu(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        
        return src

2. Self-Attention机制

2.1 Attention的核心思想

问题

  • 如何让模型关注输入序列中的重要部分?
  • 如何捕捉长距离依赖?

解决方案

  • 通过计算查询(Query)、键(Key)、值(Value)之间的关系
  • 动态分配权重

2.2 Self-Attention的数学原理

步骤1:计算Q、K、V

python
def compute_qkv(x, W_q, W_k, W_v):
    """
    计算查询、键、值
    
    Args:
        x: 输入向量 [batch_size, seq_len, d_model]
        W_q, W_k, W_v: 权重矩阵 [d_model, d_k]
    
    Returns:
        Q, K, V: 查询、键、值向量
    """
    Q = torch.matmul(x, W_q)  # [batch_size, seq_len, d_k]
    K = torch.matmul(x, W_k)  # [batch_size, seq_len, d_k]
    V = torch.matmul(x, W_v)  # [batch_size, seq_len, d_v]
    
    return Q, K, V

步骤2:计算注意力分数

python
def compute_attention_scores(Q, K):
    """
    计算注意力分数
    
    Args:
        Q: 查询向量 [batch_size, seq_len, d_k]
        K: 键向量 [batch_size, seq_len, d_k]
    
    Returns:
        scores: 注意力分数 [batch_size, seq_len, seq_len]
    """
    # Q * K^T
    scores = torch.matmul(Q, K.transpose(-2, -1))
    
    # 缩放
    d_k = Q.size(-1)
    scores = scores / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    return scores

步骤3:计算注意力权重

python
def compute_attention_weights(scores, mask=None):
    """
    计算注意力权重
    
    Args:
        scores: 注意力分数 [batch_size, seq_len, seq_len]
        mask: 掩码矩阵 [batch_size, seq_len, seq_len]
    
    Returns:
        weights: 注意力权重 [batch_size, seq_len, seq_len]
    """
    # 应用掩码
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Softmax归一化
    weights = F.softmax(scores, dim=-1)
    
    return weights

步骤4:计算输出

python
def compute_attention_output(weights, V):
    """
    计算注意力输出
    
    Args:
        weights: 注意力权重 [batch_size, seq_len, seq_len]
        V: 值向量 [batch_size, seq_len, d_v]
    
    Returns:
        output: 注意力输出 [batch_size, seq_len, d_v]
    """
    output = torch.matmul(weights, V)
    return output

2.3 完整的Self-Attention实现

python
class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k, d_v):
        super().__init__()
        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        
        # 权重矩阵
        self.W_q = nn.Linear(d_model, d_k, bias=False)
        self.W_k = nn.Linear(d_model, d_k, bias=False)
        self.W_v = nn.Linear(d_model, d_v, bias=False)
        
        # 输出投影
        self.W_o = nn.Linear(d_v, d_model, bias=False)
    
    def forward(self, x, mask=None):
        """
        前向传播
        
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 掩码 [batch_size, seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, _ = x.size()
        
        # 计算Q、K、V
        Q = self.W_q(x)  # [batch_size, seq_len, d_k]
        K = self.W_k(x)  # [batch_size, seq_len, d_k]
        V = self.W_v(x)  # [batch_size, seq_len, d_v]
        
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
        
        # 应用掩码
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # 计算注意力权重
        weights = F.softmax(scores, dim=-1)
        
        # 计算输出
        output = torch.matmul(weights, V)
        
        # 输出投影
        output = self.W_o(output)
        
        return output, weights

3. Multi-Head Attention

3.1 多头注意力的思想

为什么需要多头?

  • 不同的头可以关注不同的信息
  • 捕捉更丰富的语义关系
  • 提高模型表达能力

3.2 Multi-Head Attention实现

python
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, nhead):
        super().__init__()
        self.d_model = d_model
        self.nhead = nhead
        
        assert d_model % nhead == 0, "d_model must be divisible by nhead"
        
        self.d_k = d_model // nhead
        self.d_v = d_model // nhead
        
        # 多个头的权重矩阵
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        
        # 输出投影
        self.W_o = nn.Linear(d_model, d_model, bias=False)
    
    def forward(self, x, mask=None):
        """
        前向传播
        
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 掩码 [batch_size, seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, _ = x.size()
        
        # 计算Q、K、V
        Q = self.W_q(x)  # [batch_size, seq_len, d_model]
        K = self.W_k(x)  # [batch_size, seq_len, d_model]
        V = self.W_v(x)  # [batch_size, seq_len, d_model]
        
        # 重塑为多头
        Q = Q.view(batch_size, seq_len, self.nhead, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.nhead, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.nhead, self.d_v).transpose(1, 2)
        # [batch_size, nhead, seq_len, d_k/v]
        
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
        
        # 应用掩码
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # 计算注意力权重
        weights = F.softmax(scores, dim=-1)
        
        # 计算输出
        output = torch.matmul(weights, V)
        
        # 合并多头
        output = output.transpose(1, 2).contiguous()
        output = output.view(batch_size, seq_len, self.d_model)
        
        # 输出投影
        output = self.W_o(output)
        
        return output, weights

4. 位置编码

4.1 为什么需要位置编码

问题

  • Self-Attention是位置无关的
  • 无法捕捉序列的顺序信息

解决方案

  • 为每个位置添加位置信息
  • 让模型理解序列的顺序

4.2 位置编码的类型

4.2.1 正弦位置编码

python
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # 创建位置编码矩阵
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # 添加batch维度
        pe = pe.unsqueeze(0)
        
        # 注册为buffer
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 添加位置编码后的输出
        """
        x = x + self.pe[:, :x.size(1), :]
        return x

4.2.2 可学习位置编码

python
class LearnablePositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # 可学习的位置编码
        self.position_embeddings = nn.Embedding(max_len, d_model)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 添加位置编码后的输出
        """
        batch_size, seq_len, _ = x.size()
        
        # 生成位置索引
        positions = torch.arange(seq_len, device=x.device)
        position_embeddings = self.position_embeddings(positions)
        
        # 添加位置编码
        x = x + position_embeddings.unsqueeze(0)
        
        return x

4.2.3 相对位置编码

python
class RelativePositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        self.d_model = d_model
        self.max_len = max_len
        
        # 相对位置编码
        self.relative_position_embeddings = nn.Embedding(2 * max_len, d_model)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 添加相对位置编码后的输出
        """
        batch_size, seq_len, _ = x.size()
        
        # 计算相对位置
        positions = torch.arange(seq_len, device=x.device)
        relative_positions = positions.unsqueeze(1) - positions.unsqueeze(0)
        
        # 将相对位置映射到正数范围
        relative_positions = relative_positions + self.max_len
        
        # 获取相对位置编码
        relative_embeddings = self.relative_position_embeddings(relative_positions)
        
        return x, relative_embeddings

5. 前馈神经网络

5.1 FFN的作用

作用

  • 增加模型的表达能力
  • 捕捉非线性关系
  • 提供特征变换

5.2 FFN实现

python
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.ReLU()
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 第一层
        x = self.linear1(x)
        x = self.activation(x)
        x = self.dropout(x)
        
        # 第二层
        x = self.linear2(x)
        
        return x

6. 完整的Transformer层

python
class TransformerLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 多头自注意力
        self.self_attn = MultiHeadAttention(d_model, nhead)
        
        # 前馈网络
        self.ffn = FeedForward(d_model, d_ff, dropout)
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 掩码 [batch_size, seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力 + 残差连接 + 层归一化
        attn_output, attn_weights = self.self_attn(x, mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 前馈网络 + 残差连接 + 层归一化
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)
        x = self.norm2(x)
        
        return x, attn_weights

实践任务

任务:从零实现Self-Attention

目标:实现一个完整的Self-Attention模块

要求

  1. 实现Self-Attention类
  2. 实现Multi-Head Attention类
  3. 实现位置编码
  4. 测试注意力机制

代码框架

python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k, d_v):
        super().__init__()
        # TODO: 初始化权重矩阵
    
    def forward(self, x, mask=None):
        # TODO: 实现Self-Attention前向传播
        pass

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, nhead):
        super().__init__()
        # TODO: 初始化多头注意力
    
    def forward(self, x, mask=None):
        # TODO: 实现多头注意力前向传播
        pass

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        # TODO: 实现位置编码
    
    def forward(self, x):
        # TODO: 添加位置编码
        pass

# 测试代码
if __name__ == "__main__":
    # 测试Self-Attention
    batch_size = 2
    seq_len = 10
    d_model = 512
    nhead = 8
    
    x = torch.randn(batch_size, seq_len, d_model)
    
    # 测试Self-Attention
    self_attn = SelfAttention(d_model, d_model//nhead, d_model//nhead)
    output, weights = self_attn(x)
    print(f"Self-Attention output shape: {output.shape}")
    print(f"Attention weights shape: {weights.shape}")
    
    # 测试Multi-Head Attention
    multi_head_attn = MultiHeadAttention(d_model, nhead)
    output, weights = multi_head_attn(x)
    print(f"Multi-Head Attention output shape: {output.shape}")
    print(f"Attention weights shape: {weights.shape}")
    
    # 测试位置编码
    pos_encoding = PositionalEncoding(d_model)
    output = pos_encoding(x)
    print(f"Positional encoding output shape: {output.shape}")

课后作业

作业1:Self-Attention可视化

题目:可视化Self-Attention的注意力权重

要求

  1. 实现注意力权重可视化
  2. 分析不同位置的注意力分布
  3. 理解注意力机制的工作原理

作业2:不同位置编码对比

题目:对比不同类型的位置编码

要求

  1. 实现正弦位置编码
  2. 实现可学习位置编码
  3. 实现相对位置编码
  4. 对比三种位置编码的效果

作业3:Transformer架构分析

题目:深入分析Transformer架构

要求

  1. 分析Transformer的各个组件
  2. 理解每个组件的作用
  3. 分析Transformer的优势和局限

参考资料

必读文献

  1. Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS.

    • Transformer原始论文
  2. Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.

    • BERT论文
  3. Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners". OpenAI Blog.

    • GPT-2论文

推荐阅读

  1. The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/

    • Transformer图解
  2. The Illustrated Self-Attention: https://jalammar.github.io/illustrated-self-attention/

    • Self-Attention图解

在线资源

  1. Hugging Face Transformers: https://huggingface.co/docs/transformers/

    • Transformers库文档
  2. PyTorch Transformer Tutorial: https://pytorch.org/tutorials/beginner/transformer_tutorial.html

    • PyTorch Transformer教程

扩展阅读

Transformer变体

  • BART (2019): Denoising Sequence-to-Sequence Pre-training
  • T5 (2019): Exploring the Limits of Transfer Learning
  • GPT-3 (2020): Language Models are Few-Shot Learners

注意力机制变体

  • Sparse Attention (2020): Longformer, BigBird
  • Linear Attention (2020): Linformer, Performer
  • Efficient Attention (2021): Flash Attention

下节预告

下一节我们将学习主流LLM架构对比,深入了解GPT、BERT、T5、LLaMA、PaLM、Gemini等主流大语言模型的架构差异和特点。


架构师AI杜公众号二维码

扫描二维码关注"架构师AI杜"公众号,获取更多技术内容和最新动态