Skip to content

第12天:主流LLM架构对比

学习目标

  • 理解主流LLM架构的差异
  • 掌握GPT、BERT、T5、LLaMA、PaLM、Gemini的核心特点
  • 对比不同模型的优缺点
  • 能够选择合适的模型

课程内容

1. GPT系列

1.1 GPT架构特点

核心特点

  • Decoder-only架构:只使用解码器
  • 自回归生成:逐token生成
  • 因果掩码:只能看到之前的内容
  • 从左到右处理:单向注意力

架构图

输入 → Embedding → Position Encoding → 
Transformer Decoder Layers → 
Output Probabilities → Output

代码示例

python
import torch
import torch.nn as nn

class GPTDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 因果自注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 前馈网络
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 因果掩码 [seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力
        attn_output, _ = self.self_attn(x, x, x, attn_mask=mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 前馈网络
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)
        x = self.norm2(x)
        
        return x

class GPTModel(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
        super().__init__()
        
        # Token嵌入
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # 位置嵌入
        self.position_embedding = nn.Embedding(max_len, d_model)
        
        # 解码器层
        self.decoder_layers = nn.ModuleList([
            GPTDecoderLayer(d_model, nhead, d_ff)
            for _ in range(n_layers)
        ])
        
        # 输出层
        self.output_layer = nn.Linear(d_model, vocab_size)
        
        self.max_len = max_len
        self.d_model = d_model
    
    def forward(self, input_ids):
        """
        Args:
            input_ids: 输入token IDs [batch_size, seq_len]
        
        Returns:
            logits: 输出logits [batch_size, seq_len, vocab_size]
        """
        batch_size, seq_len = input_ids.shape
        
        # Token嵌入
        token_embeds = self.token_embedding(input_ids)
        
        # 位置嵌入
        positions = torch.arange(seq_len, device=input_ids.device)
        position_embeds = self.position_embedding(positions)
        
        # 组合嵌入
        x = token_embeds + position_embeds
        
        # 创建因果掩码
        mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
        mask = mask.masked_fill(mask == 1, float('-inf'))
        
        # 解码器层
        for layer in self.decoder_layers:
            x = layer(x, mask)
        
        # 输出
        logits = self.output_layer(x)
        
        return logits

1.2 GPT系列演进

GPT-1 (2018)

  • 参数量:117M
  • 架构:12层Transformer Decoder
  • 特点:无监督预训练 + 有监督微调

GPT-2 (2019)

  • 参数量:1.5B
  • 架构:48层Transformer Decoder
  • 特点:零样本学习

GPT-3 (2020)

  • 参数量:175B
  • 架构:96层Transformer Decoder
  • 特点:少样本学习

GPT-4 (2023)

  • 参数量:未公开(估计1.76T)
  • 架构:多模态Transformer
  • 特点:多模态能力、推理能力

1.3 GPT的优缺点

优点

  • 生成能力强
  • 适合文本生成任务
  • 零样本/少样本学习
  • 推理能力强

缺点

  • 单向注意力,无法利用未来信息
  • 编码任务效率低
  • 训练成本高

2. BERT系列

2.1 BERT架构特点

核心特点

  • Encoder-only架构:只使用编码器
  • 双向注意力:可以看到整个序列
  • 自编码任务:Masked Language Modeling
  • 理解为主:适合NLU任务

架构图

输入 → Embedding → Position Encoding → 
Transformer Encoder Layers → 
Output → Task-specific Head

代码示例

python
class BERTEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 双向自注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 前馈网络
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 填充掩码 [batch_size, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力
        attn_output, _ = self.self_attn(x, x, x, key_padding_mask=mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 前馈网络
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)
        x = self.norm2(x)
        
        return x

class BERTModel(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
        super().__init__()
        
        # Token嵌入
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # 位置嵌入
        self.position_embedding = nn.Embedding(max_len, d_model)
        
        # 编码器层
        self.encoder_layers = nn.ModuleList([
            BERTEncoderLayer(d_model, nhead, d_ff)
            for _ in range(n_layers)
        ])
        
        self.max_len = max_len
        self.d_model = d_model
    
    def forward(self, input_ids, attention_mask=None):
        """
        Args:
            input_ids: 输入token IDs [batch_size, seq_len]
            attention_mask: 注意力掩码 [batch_size, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len = input_ids.shape
        
        # Token嵌入
        token_embeds = self.token_embedding(input_ids)
        
        # 位置嵌入
        positions = torch.arange(seq_len, device=input_ids.device)
        position_embeds = self.position_embedding(positions)
        
        # 组合嵌入
        x = token_embeds + position_embeds
        
        # 编码器层
        for layer in self.encoder_layers:
            x = layer(x, attention_mask)
        
        return x

2.2 BERT系列演进

BERT-Base (2019)

  • 参数量:110M
  • 架构:12层Transformer Encoder
  • 特点:双向预训练

BERT-Large (2019)

  • 参数量:340M
  • 架构:24层Transformer Encoder
  • 特点:更大的模型

RoBERTa (2019)

  • 参数量:355M
  • 架构:24层Transformer Encoder
  • 特点:优化训练策略

DeBERTa (2020)

  • 参数量:1.5B
  • 架构:48层Transformer Encoder
  • 特点:解耦注意力

2.3 BERT的优缺点

优点

  • 双向注意力,理解能力强
  • 适合NLU任务
  • 预训练效果好
  • 可以微调到各种任务

缺点

  • 生成能力弱
  • 不适合文本生成
  • 推理能力有限

3. T5系列

3.1 T5架构特点

核心特点

  • Encoder-Decoder架构:编码器+解码器
  • 文本到文本框架:所有任务都转换为文本
  • 双向编码+单向解码:结合两者优势
  • 统一任务格式:简化任务处理

架构图

输入 → Embedding → Position Encoding → 
Encoder Layers → 
Decoder Layers → 
Output Probabilities → Output

代码示例

python
class T5EncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 自注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 前馈网络
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 注意力掩码 [batch_size, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力
        attn_output, _ = self.self_attn(x, x, x, key_padding_mask=mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 前馈网络
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)
        x = self.norm2(x)
        
        return x

class T5DecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 自注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 编码器-解码器注意力
        self.cross_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 前馈网络
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
    
    def forward(self, x, encoder_output, self_mask=None, cross_mask=None):
        """
        Args:
            x: 解码器输入 [batch_size, seq_len, d_model]
            encoder_output: 编码器输出 [batch_size, src_len, d_model]
            self_mask: 自注意力掩码 [seq_len, seq_len]
            cross_mask: 交叉注意力掩码 [batch_size, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力
        attn_output, _ = self.self_attn(x, x, x, attn_mask=self_mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 编码器-解码器注意力
        cross_output, _ = self.cross_attn(x, encoder_output, encoder_output, 
                                           key_padding_mask=cross_mask)
        x = x + self.dropout2(cross_output)
        x = self.norm2(x)
        
        # 前馈网络
        ffn_output = self.ffn(x)
        x = x + self.dropout3(ffn_output)
        x = self.norm3(x)
        
        return x

class T5Model(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
        super().__init__()
        
        # Token嵌入
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # 位置嵌入
        self.position_embedding = nn.Embedding(max_len, d_model)
        
        # 编码器层
        self.encoder_layers = nn.ModuleList([
            T5EncoderLayer(d_model, nhead, d_ff)
            for _ in range(n_layers)
        ])
        
        # 解码器层
        self.decoder_layers = nn.ModuleList([
            T5DecoderLayer(d_model, nhead, d_ff)
            for _ in range(n_layers)
        ])
        
        # 输出层
        self.output_layer = nn.Linear(d_model, vocab_size)
        
        self.max_len = max_len
        self.d_model = d_model
    
    def forward(self, input_ids, decoder_input_ids, 
                encoder_attention_mask=None, decoder_attention_mask=None):
        """
        Args:
            input_ids: 编码器输入 [batch_size, src_len]
            decoder_input_ids: 解码器输入 [batch_size, tgt_len]
            encoder_attention_mask: 编码器注意力掩码 [batch_size, src_len]
            decoder_attention_mask: 解码器注意力掩码 [batch_size, tgt_len]
        
        Returns:
            logits: 输出logits [batch_size, tgt_len, vocab_size]
        """
        batch_size, src_len = input_ids.shape
        _, tgt_len = decoder_input_ids.shape
        
        # 编码器嵌入
        token_embeds = self.token_embedding(input_ids)
        positions = torch.arange(src_len, device=input_ids.device)
        position_embeds = self.position_embedding(positions)
        encoder_input = token_embeds + position_embeds
        
        # 编码器
        encoder_output = encoder_input
        for layer in self.encoder_layers:
            encoder_output = layer(encoder_output, encoder_attention_mask)
        
        # 解码器嵌入
        decoder_token_embeds = self.token_embedding(decoder_input_ids)
        decoder_positions = torch.arange(tgt_len, device=decoder_input_ids.device)
        decoder_position_embeds = self.position_embedding(decoder_positions)
        decoder_input = decoder_token_embeds + decoder_position_embeds
        
        # 创建因果掩码
        causal_mask = torch.triu(torch.ones(tgt_len, tgt_len), diagonal=1)
        causal_mask = causal_mask.masked_fill(causal_mask == 1, float('-inf'))
        
        # 解码器
        decoder_output = decoder_input
        for layer in self.decoder_layers:
            decoder_output = layer(decoder_output, encoder_output, 
                                 causal_mask, decoder_attention_mask)
        
        # 输出
        logits = self.output_layer(decoder_output)
        
        return logits

3.2 T5系列演进

T5-Small (2019)

  • 参数量:60M
  • 架构:6层Encoder + 6层Decoder
  • 特点:小型模型

T5-Base (2019)

  • 参数量:220M
  • 架构:12层Encoder + 12层Decoder
  • 特点:基础模型

T5-Large (2019)

  • 参数量:770M
  • 架构:24层Encoder + 24层Decoder
  • 特点:大型模型

T5-11B (2020)

  • 参数量:11B
  • 架构:多层Encoder + Decoder
  • 特点:超大规模模型

3.3 T5的优缺点

优点

  • 统一的文本到文本框架
  • 适合多种任务
  • 编码器-解码器架构
  • 可以理解和生成

缺点

  • 计算成本高
  • 训练复杂
  • 推理速度慢

4. LLaMA系列

4.1 LLaMA架构特点

核心特点

  • Decoder-only架构:类似GPT
  • RMSNorm:更稳定的归一化
  • SwiGLU激活:更好的性能
  • 旋转位置编码:更好的位置表示

架构改进

python
class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(d_model))
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 归一化后的输出
        """
        variance = x.pow(2).mean(-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)
        return self.weight * x

class SwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.gate = nn.Linear(d_model, d_ff, bias=False)
        self.up = nn.Linear(d_model, d_ff, bias=False)
        self.down = nn.Linear(d_ff, d_model, bias=False)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        gate = self.gate(x)
        up = self.up(x)
        x = F.silu(gate) * up
        x = self.down(x)
        return x

class RotaryPositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.d_model = d_model
        self.max_len = max_len
        
        # 计算旋转角度
        inv_freq = 1.0 / (10000 ** (torch.arange(0, d_model, 2).float() / d_model))
        self.register_buffer('inv_freq', inv_freq)
    
    def forward(self, seq_len):
        """
        Args:
            seq_len: 序列长度
        
        Returns:
            cos: 余弦值 [seq_len, d_model]
            sin: 正弦值 [seq_len, d_model]
        """
        t = torch.arange(seq_len, device=self.inv_freq.device)
        freqs = torch.einsum('i,j->ij', t, self.inv_freq)
        freqs = torch.cat((freqs, freqs), dim=-1)
        
        cos = freqs.cos()
        sin = freqs.sin()
        
        return cos, sin

class LLaMADecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # RMSNorm
        self.norm1 = RMSNorm(d_model)
        self.norm2 = RMSNorm(d_model)
        
        # 多头注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # SwiGLU前馈网络
        self.ffn = SwiGLU(d_model, d_ff)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 因果掩码 [seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # RMSNorm + 自注意力
        x_norm = self.norm1(x)
        attn_output, _ = self.self_attn(x_norm, x_norm, x_norm, attn_mask=mask)
        x = x + self.dropout1(attn_output)
        
        # RMSNorm + SwiGLU前馈网络
        x_norm = self.norm2(x)
        ffn_output = self.ffn(x_norm)
        x = x + self.dropout2(ffn_output)
        
        return x

4.2 LLaMA系列演进

LLaMA-7B (2023)

  • 参数量:7B
  • 架构:32层Transformer Decoder
  • 特点:小型开源模型

LLaMA-13B (2023)

  • 参数量:13B
  • 架构:40层Transformer Decoder
  • 特点:中型开源模型

LLaMA-33B (2023)

  • 参数量:33B
  • 架构:60层Transformer Decoder
  • 特点:大型开源模型

LLaMA-65B (2023)

  • 参数量:65B
  • 架构:80层Transformer Decoder
  • 特点:超大型开源模型

4.3 LLaMA的优缺点

优点

  • 开源可用
  • 性能优秀
  • 训练数据质量高
  • 架构优化

缺点

  • 需要大量算力
  • 商业使用限制
  • 中文支持有限

5. PaLM系列

5.1 PaLM架构特点

核心特点

  • Decoder-only架构:类似GPT
  • MoE(Mixture of Experts):专家混合
  • 并行训练:大规模并行
  • 多语言支持:支持多种语言

MoE架构

python
class MoELayer(nn.Module):
    def __init__(self, d_model, n_experts, top_k=2):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k
        
        # 门控网络
        self.gate = nn.Linear(d_model, n_experts)
        
        # 专家网络
        self.experts = nn.ModuleList([
            nn.Linear(d_model, d_model * 4)
            for _ in range(n_experts)
        ])
        
        # 输出投影
        self.output_proj = nn.Linear(d_model * 4, d_model)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, d_model = x.shape
        
        # 计算门控权重
        gate_logits = self.gate(x)  # [batch_size, seq_len, n_experts]
        gate_weights = F.softmax(gate_logits, dim=-1)
        
        # 选择top-k专家
        top_k_weights, top_k_indices = torch.topk(gate_weights, self.top_k, dim=-1)
        
        # 计算专家输出
        expert_outputs = []
        for i in range(self.n_experts):
            expert_mask = (top_k_indices == i).float()
            expert_input = x * expert_mask.unsqueeze(-1)
            expert_output = self.experts[i](expert_input)
            expert_outputs.append(expert_output)
        
        # 合并专家输出
        expert_outputs = torch.stack(expert_outputs, dim=-1)  # [batch_size, seq_len, d_model*4, n_experts]
        
        # 加权求和
        output = torch.einsum('bsek,bse->bsk', expert_outputs, top_k_weights)
        output = output.sum(dim=-1)  # [batch_size, seq_len, d_model*4]
        
        # 输出投影
        output = self.output_proj(output)
        
        return output

5.2 PaLM系列演进

PaLM-540B (2022)

  • 参数量:540B
  • 架构:118层Transformer Decoder + MoE
  • 特点:超大规模模型

PaLM-2 (2023)

  • 参数量:540B
  • 架构:改进的Transformer + MoE
  • 特点:更好的多语言支持

5.3 PaLM的优缺点

优点

  • 超大规模
  • 多语言支持
  • MoE架构高效
  • 推理能力强

缺点

  • 训练成本极高
  • 不开源
  • 部署困难

6. Gemini系列

6.1 Gemini架构特点

核心特点

  • 多模态架构:文本、图像、音频、视频
  • 混合专家:不同模态的专家
  • 长上下文:支持百万token
  • 推理能力:强大的推理能力

多模态处理

python
class MultiModalEncoder(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.d_model = d_model
        
        # 文本编码器
        self.text_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead=8),
            num_layers=6
        )
        
        # 图像编码器
        self.image_encoder = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten(),
            nn.Linear(512, d_model)
        )
        
        # 音频编码器
        self.audio_encoder = nn.Sequential(
            nn.Conv1d(1, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),
            nn.Flatten(),
            nn.Linear(256, d_model)
        )
        
        # 模态融合
        self.fusion = nn.MultiheadAttention(d_model, nhead=8)
    
    def forward(self, text, image=None, audio=None):
        """
        Args:
            text: 文本输入 [batch_size, seq_len]
            image: 图像输入 [batch_size, 3, H, W]
            audio: 音频输入 [batch_size, 1, audio_len]
        
        Returns:
            output: 融合后的输出 [batch_size, seq_len, d_model]
        """
        # 文本编码
        text_output = self.text_encoder(text)
        
        # 图像编码
        if image is not None:
            image_output = self.image_encoder(image)
            image_output = image_output.unsqueeze(1)  # [batch_size, 1, d_model]
        else:
            image_output = None
        
        # 音频编码
        if audio is not None:
            audio_output = self.audio_encoder(audio)
            audio_output = audio_output.unsqueeze(1)  # [batch_size, 1, d_model]
        else:
            audio_output = None
        
        # 模态融合
        if image_output is not None or audio_output is not None:
            # 拼接多模态输出
            multimodal_outputs = [text_output]
            if image_output is not None:
                multimodal_outputs.append(image_output)
            if audio_output is not None:
                multimodal_outputs.append(audio_output)
            
            multimodal_output = torch.cat(multimodal_outputs, dim=1)
            
            # 融合
            fused_output, _ = self.fusion(text_output, multimodal_output, 
                                        multimodal_output)
            output = fused_output
        else:
            output = text_output
        
        return output

6.2 Gemini系列演进

Gemini Nano (2023)

  • 参数量:1.8B
  • 架构:轻量级多模态
  • 特点:移动端部署

Gemini Pro (2023)

  • 参数量:未公开
  • 架构:中型多模态
  • 特点:平衡性能和成本

Gemini Ultra (2023)

  • 参数量:未公开
  • 架构:大型多模态
  • 特点:最强性能

6.3 Gemini的优缺点

优点

  • 多模态能力强
  • 长上下文支持
  • 推理能力强
  • 多语言支持

缺点

  • 不开源
  • API成本高
  • 部署复杂

实践任务

任务:对比不同模型的输出

目标:调用不同模型API,对比输出结果

要求

  1. 准备相同的输入
  2. 调用不同模型API
  3. 对比输出结果
  4. 分析差异

代码框架

python
import openai
import anthropic
from transformers import AutoTokenizer, AutoModelForCausalLM

# 准备输入
prompt = "请解释什么是Transformer架构?"

# 调用GPT-4
def call_gpt4(prompt):
    client = openai.OpenAI(api_key="your-api-key")
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# 调用Claude
def call_claude(prompt):
    client = anthropic.Anthropic(api_key="your-api-key")
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

# 调用LLaMA(本地)
def call_llama(prompt):
    model_name = "meta-llama/Llama-2-7b-chat-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=500)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# 对比输出
gpt4_output = call_gpt4(prompt)
claude_output = call_claude(prompt)
llama_output = call_llama(prompt)

print("GPT-4输出:")
print(gpt4_output)
print("\nClaude输出:")
print(claude_output)
print("\nLLaMA输出:")
print(llama_output)

课后作业

作业1:模型架构对比表

题目:创建主流LLM架构对比表

要求

  1. 对比GPT、BERT、T5、LLaMA、PaLM、Gemini
  2. 从架构、参数量、特点、优缺点等方面对比
  3. 生成对比报告

作业2:模型选型分析

题目:为特定场景选择合适的模型

要求

  1. 分析不同应用场景
  2. 为每个场景选择合适的模型
  3. 说明选择理由

作业3:模型性能测试

题目:测试不同模型的性能

要求

  1. 设计测试任务
  2. 测试不同模型的表现
  3. 分析性能差异

参考资料

必读文献

  1. Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners". OpenAI Blog.

    • GPT-2论文
  2. Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.

    • BERT论文
  3. Raffel, C., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". JMLR.

    • T5论文

推荐阅读

  1. Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models". arXiv.

    • LLaMA论文
  2. Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways". arXiv.

    • PaLM论文

在线资源

  1. Hugging Face Model Hub: https://huggingface.co/models

    • 模型库
  2. Papers with Code: https://paperswithcode.com/

    • 论文和代码

扩展阅读

模型压缩

  • Distillation: 知识蒸馏
  • Quantization: 模型量化
  • Pruning: 模型剪枝

高效架构

  • Mixture of Experts: 专家混合
  • Sparse Attention: 稀疏注意力
  • Linear Attention: 线性注意力

下节预告

下一节我们将学习国内大模型详解,深入了解文心一言、通义千问、混元、豆包、GLM、Kimi、DeepSeek、Yi等国内大模型的特点和应用。


架构师AI杜公众号二维码

扫描二维码关注"架构师AI杜"公众号,获取更多技术内容和最新动态