第11天：LLM原理与架构

学习目标

本节将带领读者深入理解Transformer架构的核心原理，掌握Self-Attention机制的实现，了解位置编码的作用和类型，并能够从零实现Self-Attention。

课程内容

1. Transformer架构概述

1.1 Transformer的诞生

Transformer的诞生是深度学习领域的一个重要里程碑。2017年，Google发表论文《Attention Is All You Need》，彻底改变了NLP领域，成为现代大语言模型的基础架构。Transformer的核心思想是完全基于注意力机制，摒弃RNN的序列处理方式，实现并行计算，从而大大提高了训练效率和模型性能。

1.2 Transformer整体架构

Transformer采用编码器-解码器结构，输入经过编码器处理后，再由解码器生成输出。编码器包含多头自注意力层、前馈神经网络层、残差连接和层归一化等组件，负责将输入序列转换为高维表示。解码器包含掩码多头自注意力层、编码器-解码器注意力层、前馈神经网络层、残差连接和层归一化等组件，负责根据编码器的输出生成目标序列。

编码器-解码器结构：

输入 → 编码器 → 解码器 → 输出

编码器：

编码器包括多头自注意力层、前馈神经网络层、残差连接和层归一化等组件。这些组件共同作用，使得编码器能够有效地处理输入序列。

解码器：

解码器包括掩码多头自注意力层、编码器-解码器注意力层、前馈神经网络层、残差连接和层归一化等组件。这些组件共同作用，使得解码器能够生成输出序列。

代码示例：

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, src, src_mask=None):
        # 自注意力
        src2, _ = self.self_attn(src, src, src, attn_mask=src_mask)
        src = src + self.dropout1(src2)
        src = self.norm1(src)
        
        # 前馈网络
        src2 = self.linear2(self.dropout(F.relu(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        
        return src

2. Self-Attention机制

2.1 Attention的核心思想

问题：

Attention机制的核心思想是解决如何让模型关注输入序列中的重要部分、如何捕捉长距离依赖等问题。通过计算查询（Query）、键（Key）、值（Value）之间的关系，动态分配权重，模型可以更好地关注输入中的重要信息。

解决方案：

Attention机制的解决方案是通过计算查询（Query）、键（Key）、值（Value）之间的关系，动态分配权重。这种机制使得模型能够根据输入内容自动调整注意力分配，关注更重要的信息。

2.2 Self-Attention的数学原理

步骤1：计算Q、K、V

python

def compute_qkv(x, W_q, W_k, W_v):
    """
    计算查询、键、值
    
    Args:
        x: 输入向量 [batch_size, seq_len, d_model]
        W_q, W_k, W_v: 权重矩阵 [d_model, d_k]
    
    Returns:
        Q, K, V: 查询、键、值向量
    """
    Q = torch.matmul(x, W_q)  # [batch_size, seq_len, d_k]
    K = torch.matmul(x, W_k)  # [batch_size, seq_len, d_k]
    V = torch.matmul(x, W_v)  # [batch_size, seq_len, d_v]
    
    return Q, K, V

步骤2：计算注意力分数

python

def compute_attention_scores(Q, K):
    """
    计算注意力分数
    
    Args:
        Q: 查询向量 [batch_size, seq_len, d_k]
        K: 键向量 [batch_size, seq_len, d_k]
    
    Returns:
        scores: 注意力分数 [batch_size, seq_len, seq_len]
    """
    # Q * K^T
    scores = torch.matmul(Q, K.transpose(-2, -1))
    
    # 缩放
    d_k = Q.size(-1)
    scores = scores / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    return scores

步骤3：计算注意力权重

python

def compute_attention_weights(scores, mask=None):
    """
    计算注意力权重
    
    Args:
        scores: 注意力分数 [batch_size, seq_len, seq_len]
        mask: 掩码矩阵 [batch_size, seq_len, seq_len]
    
    Returns:
        weights: 注意力权重 [batch_size, seq_len, seq_len]
    """
    # 应用掩码
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Softmax归一化
    weights = F.softmax(scores, dim=-1)
    
    return weights

步骤4：计算输出

python

def compute_attention_output(weights, V):
    """
    计算注意力输出
    
    Args:
        weights: 注意力权重 [batch_size, seq_len, seq_len]
        V: 值向量 [batch_size, seq_len, d_v]
    
    Returns:
        output: 注意力输出 [batch_size, seq_len, d_v]
    """
    output = torch.matmul(weights, V)
    return output

2.3 完整的Self-Attention实现

python

class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k, d_v):
        super().__init__()
        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        
        # 权重矩阵
        self.W_q = nn.Linear(d_model, d_k, bias=False)
        self.W_k = nn.Linear(d_model, d_k, bias=False)
        self.W_v = nn.Linear(d_model, d_v, bias=False)
        
        # 输出投影
        self.W_o = nn.Linear(d_v, d_model, bias=False)
    
    def forward(self, x, mask=None):
        """
        前向传播
        
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 掩码 [batch_size, seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, _ = x.size()
        
        # 计算Q、K、V
        Q = self.W_q(x)  # [batch_size, seq_len, d_k]
        K = self.W_k(x)  # [batch_size, seq_len, d_k]
        V = self.W_v(x)  # [batch_size, seq_len, d_v]
        
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
        
        # 应用掩码
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # 计算注意力权重
        weights = F.softmax(scores, dim=-1)
        
        # 计算输出
        output = torch.matmul(weights, V)
        
        # 输出投影
        output = self.W_o(output)
        
        return output, weights

3. Multi-Head Attention

3.1 多头注意力的思想

为什么需要多头？

多头注意力的优势在于不同的头可以关注不同的信息、捕捉更丰富的语义关系、提高模型表达能力等方面。这些优势使得模型能够更好地处理复杂的语言任务。

3.2 Multi-Head Attention实现

python

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, nhead):
        super().__init__()
        self.d_model = d_model
        self.nhead = nhead
        
        assert d_model % nhead == 0, "d_model must be divisible by nhead"
        
        self.d_k = d_model // nhead
        self.d_v = d_model // nhead
        
        # 多个头的权重矩阵
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        
        # 输出投影
        self.W_o = nn.Linear(d_model, d_model, bias=False)
    
    def forward(self, x, mask=None):
        """
        前向传播
        
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 掩码 [batch_size, seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, _ = x.size()
        
        # 计算Q、K、V
        Q = self.W_q(x)  # [batch_size, seq_len, d_model]
        K = self.W_k(x)  # [batch_size, seq_len, d_model]
        V = self.W_v(x)  # [batch_size, seq_len, d_model]
        
        # 重塑为多头
        Q = Q.view(batch_size, seq_len, self.nhead, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.nhead, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.nhead, self.d_v).transpose(1, 2)
        # [batch_size, nhead, seq_len, d_k/v]
        
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / torch.sqrt(torch.tensor(self.d_k, dtype=torch.float32))
        
        # 应用掩码
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # 计算注意力权重
        weights = F.softmax(scores, dim=-1)
        
        # 计算输出
        output = torch.matmul(weights, V)
        
        # 合并多头
        output = output.transpose(1, 2).contiguous()
        output = output.view(batch_size, seq_len, self.d_model)
        
        # 输出投影
        output = self.W_o(output)
        
        return output, weights

4. 位置编码

4.1 为什么需要位置编码

问题：

Self-Attention是位置无关的
无法捕捉序列的顺序信息

解决方案：

为每个位置添加位置信息
让模型理解序列的顺序

4.2 位置编码的类型

4.2.1 正弦位置编码

python

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # 创建位置编码矩阵
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # 添加batch维度
        pe = pe.unsqueeze(0)
        
        # 注册为buffer
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 添加位置编码后的输出
        """
        x = x + self.pe[:, :x.size(1), :]
        return x

4.2.2 可学习位置编码

python

class LearnablePositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # 可学习的位置编码
        self.position_embeddings = nn.Embedding(max_len, d_model)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 添加位置编码后的输出
        """
        batch_size, seq_len, _ = x.size()
        
        # 生成位置索引
        positions = torch.arange(seq_len, device=x.device)
        position_embeddings = self.position_embeddings(positions)
        
        # 添加位置编码
        x = x + position_embeddings.unsqueeze(0)
        
        return x

4.2.3 相对位置编码

python

class RelativePositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        self.d_model = d_model
        self.max_len = max_len
        
        # 相对位置编码
        self.relative_position_embeddings = nn.Embedding(2 * max_len, d_model)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 添加相对位置编码后的输出
        """
        batch_size, seq_len, _ = x.size()
        
        # 计算相对位置
        positions = torch.arange(seq_len, device=x.device)
        relative_positions = positions.unsqueeze(1) - positions.unsqueeze(0)
        
        # 将相对位置映射到正数范围
        relative_positions = relative_positions + self.max_len
        
        # 获取相对位置编码
        relative_embeddings = self.relative_position_embeddings(relative_positions)
        
        return x, relative_embeddings

5. 前馈神经网络

5.1 FFN的作用

作用：

增加模型的表达能力
捕捉非线性关系
提供特征变换

5.2 FFN实现

python

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.ReLU()
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 第一层
        x = self.linear1(x)
        x = self.activation(x)
        x = self.dropout(x)
        
        # 第二层
        x = self.linear2(x)
        
        return x

6. 完整的Transformer层

python

class TransformerLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 多头自注意力
        self.self_attn = MultiHeadAttention(d_model, nhead)
        
        # 前馈网络
        self.ffn = FeedForward(d_model, d_ff, dropout)
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 掩码 [batch_size, seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力 + 残差连接 + 层归一化
        attn_output, attn_weights = self.self_attn(x, mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 前馈网络 + 残差连接 + 层归一化
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)
        x = self.norm2(x)
        
        return x, attn_weights

实践任务

任务：从零实现Self-Attention

目标：实现一个完整的Self-Attention模块

要求：

实现Self-Attention类
实现Multi-Head Attention类
实现位置编码
测试注意力机制

代码框架：

python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k, d_v):
        super().__init__()
        # TODO: 初始化权重矩阵
    
    def forward(self, x, mask=None):
        # TODO: 实现Self-Attention前向传播
        pass

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, nhead):
        super().__init__()
        # TODO: 初始化多头注意力
    
    def forward(self, x, mask=None):
        # TODO: 实现多头注意力前向传播
        pass

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        # TODO: 实现位置编码
    
    def forward(self, x):
        # TODO: 添加位置编码
        pass

# 测试代码
if __name__ == "__main__":
    # 测试Self-Attention
    batch_size = 2
    seq_len = 10
    d_model = 512
    nhead = 8
    
    x = torch.randn(batch_size, seq_len, d_model)
    
    # 测试Self-Attention
    self_attn = SelfAttention(d_model, d_model//nhead, d_model//nhead)
    output, weights = self_attn(x)
    print(f"Self-Attention output shape: {output.shape}")
    print(f"Attention weights shape: {weights.shape}")
    
    # 测试Multi-Head Attention
    multi_head_attn = MultiHeadAttention(d_model, nhead)
    output, weights = multi_head_attn(x)
    print(f"Multi-Head Attention output shape: {output.shape}")
    print(f"Attention weights shape: {weights.shape}")
    
    # 测试位置编码
    pos_encoding = PositionalEncoding(d_model)
    output = pos_encoding(x)
    print(f"Positional encoding output shape: {output.shape}")

课后作业

作业1：Self-Attention可视化

题目：可视化Self-Attention的注意力权重

要求：

实现注意力权重可视化
分析不同位置的注意力分布
理解注意力机制的工作原理

作业2：不同位置编码对比

题目：对比不同类型的位置编码

要求：

实现正弦位置编码
实现可学习位置编码
实现相对位置编码
对比三种位置编码的效果

作业3：Transformer架构分析

题目：深入分析Transformer架构

要求：

分析Transformer的各个组件
理解每个组件的作用
分析Transformer的优势和局限

参考资料

必读文献

Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS.
- Transformer原始论文
Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
- BERT论文
Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners". OpenAI Blog.
- GPT-2论文

在线资源

Hugging Face Transformers: https://huggingface.co/docs/transformers/
- Transformers库文档
PyTorch Transformer Tutorial: https://pytorch.org/tutorials/beginner/transformer_tutorial.html
- PyTorch Transformer教程

扩展阅读

Transformer变体

BART (2019): Denoising Sequence-to-Sequence Pre-training
T5 (2019): Exploring the Limits of Transfer Learning
GPT-3 (2020): Language Models are Few-Shot Learners

注意力机制变体

Sparse Attention (2020): Longformer, BigBird
Linear Attention (2020): Linformer, Performer
Efficient Attention (2021): Flash Attention

下节预告

下一节我们将学习主流LLM架构对比，深入了解GPT、BERT、T5、LLaMA、PaLM、Gemini等主流大语言模型的架构差异和特点。

扫描二维码关注"架构师AI杜"公众号，获取更多技术内容和最新动态

第11天：LLM原理与架构 ​

学习目标 ​

课程内容 ​

1. Transformer架构概述 ​

1.1 Transformer的诞生 ​

1.2 Transformer整体架构 ​

2. Self-Attention机制 ​

2.1 Attention的核心思想 ​

2.2 Self-Attention的数学原理 ​

2.3 完整的Self-Attention实现 ​

3. Multi-Head Attention ​

3.1 多头注意力的思想 ​

3.2 Multi-Head Attention实现 ​

4. 位置编码 ​

4.1 为什么需要位置编码 ​

4.2 位置编码的类型 ​

5. 前馈神经网络 ​

5.1 FFN的作用 ​

5.2 FFN实现 ​

6. 完整的Transformer层 ​

实践任务 ​

任务：从零实现Self-Attention ​

课后作业 ​

作业1：Self-Attention可视化 ​

作业2：不同位置编码对比 ​

作业3：Transformer架构分析 ​

参考资料 ​

必读文献 ​

推荐阅读 ​

在线资源 ​

扩展阅读 ​

Transformer变体 ​

注意力机制变体 ​

下节预告 ​

第11天：LLM原理与架构

学习目标

课程内容

1. Transformer架构概述

1.1 Transformer的诞生

1.2 Transformer整体架构

2. Self-Attention机制

2.1 Attention的核心思想

2.2 Self-Attention的数学原理

2.3 完整的Self-Attention实现

3. Multi-Head Attention

3.1 多头注意力的思想

3.2 Multi-Head Attention实现

4. 位置编码

4.1 为什么需要位置编码

4.2 位置编码的类型

5. 前馈神经网络

5.1 FFN的作用

5.2 FFN实现

6. 完整的Transformer层

实践任务

任务：从零实现Self-Attention

课后作业

作业1：Self-Attention可视化

作业2：不同位置编码对比

作业3：Transformer架构分析

参考资料

必读文献

推荐阅读

在线资源

扩展阅读

Transformer变体

注意力机制变体

下节预告