第12天：主流LLM架构对比

学习目标

本节将带领读者深入理解主流LLM架构的差异，掌握GPT、BERT、T5、LLaMA、PaLM、Gemini的核心特点，对比不同模型的优缺点，并能够选择合适的模型。

课程内容

1. GPT系列

1.1 GPT架构特点

GPT系列采用Decoder-only架构，只使用解码器，通过自回归生成逐token生成，使用因果掩码只能看到之前的内容，从左到右处理采用单向注意力。这种架构使得GPT在文本生成任务上表现出色，特别适合对话、写作等需要连续生成的场景。

架构图：

输入 → Embedding → Position Encoding → 
Transformer Decoder Layers → 
Output Probabilities → Output

代码示例：

python

import torch
import torch.nn as nn

class GPTDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 因果自注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 前馈网络
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 因果掩码 [seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力
        attn_output, _ = self.self_attn(x, x, x, attn_mask=mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 前馈网络
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)
        x = self.norm2(x)
        
        return x

class GPTModel(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
        super().__init__()
        
        # Token嵌入
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # 位置嵌入
        self.position_embedding = nn.Embedding(max_len, d_model)
        
        # 解码器层
        self.decoder_layers = nn.ModuleList([
            GPTDecoderLayer(d_model, nhead, d_ff)
            for _ in range(n_layers)
        ])
        
        # 输出层
        self.output_layer = nn.Linear(d_model, vocab_size)
        
        self.max_len = max_len
        self.d_model = d_model
    
    def forward(self, input_ids):
        """
        Args:
            input_ids: 输入token IDs [batch_size, seq_len]
        
        Returns:
            logits: 输出logits [batch_size, seq_len, vocab_size]
        """
        batch_size, seq_len = input_ids.shape
        
        # Token嵌入
        token_embeds = self.token_embedding(input_ids)
        
        # 位置嵌入
        positions = torch.arange(seq_len, device=input_ids.device)
        position_embeds = self.position_embedding(positions)
        
        # 组合嵌入
        x = token_embeds + position_embeds
        
        # 创建因果掩码
        mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
        mask = mask.masked_fill(mask == 1, float('-inf'))
        
        # 解码器层
        for layer in self.decoder_layers:
            x = layer(x, mask)
        
        # 输出
        logits = self.output_layer(x)
        
        return logits

1.2 GPT系列演进

GPT-1 (2018)：

GPT-1的参数量为117M，采用12层Transformer Decoder架构，特点是无监督预训练 + 有监督微调。

GPT-2 (2019)：

GPT-2的参数量为1.5B，采用48层Transformer Decoder架构，特点是零样本学习。

GPT-3 (2020)：

GPT-3的参数量为175B，采用96层Transformer Decoder架构，特点是少样本学习。

GPT-4 (2023)：

GPT-4的参数量未公开（估计1.76T），采用多模态Transformer架构，特点是多模态能力、推理能力。

1.3 GPT的优缺点

优点：

GPT的优点包括生成能力强、适合文本生成任务、零样本/少样本学习、推理能力强等。这些优点使得GPT在文本生成、对话、代码生成等任务上表现出色。

缺点：

GPT的缺点包括单向注意力无法利用未来信息、编码任务效率低、训练成本高等。这些缺点限制了GPT在某些任务上的应用，比如需要理解整个文本的任务。

2. BERT系列

2.1 BERT架构特点

核心特点：

BERT架构的核心特点包括Encoder-only架构（只使用编码器）、双向注意力（可以看到整个序列）、自编码任务（Masked Language Modeling）、理解为主（适合NLU任务）等方面。

架构图：

输入 → Embedding → Position Encoding → 
Transformer Encoder Layers → 
Output → Task-specific Head

代码示例：

python

class BERTEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 双向自注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 前馈网络
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 填充掩码 [batch_size, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力
        attn_output, _ = self.self_attn(x, x, x, key_padding_mask=mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 前馈网络
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)
        x = self.norm2(x)
        
        return x

class BERTModel(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
        super().__init__()
        
        # Token嵌入
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # 位置嵌入
        self.position_embedding = nn.Embedding(max_len, d_model)
        
        # 编码器层
        self.encoder_layers = nn.ModuleList([
            BERTEncoderLayer(d_model, nhead, d_ff)
            for _ in range(n_layers)
        ])
        
        self.max_len = max_len
        self.d_model = d_model
    
    def forward(self, input_ids, attention_mask=None):
        """
        Args:
            input_ids: 输入token IDs [batch_size, seq_len]
            attention_mask: 注意力掩码 [batch_size, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len = input_ids.shape
        
        # Token嵌入
        token_embeds = self.token_embedding(input_ids)
        
        # 位置嵌入
        positions = torch.arange(seq_len, device=input_ids.device)
        position_embeds = self.position_embedding(positions)
        
        # 组合嵌入
        x = token_embeds + position_embeds
        
        # 编码器层
        for layer in self.encoder_layers:
            x = layer(x, attention_mask)
        
        return x

2.2 BERT系列演进

BERT-Base (2019)：

BERT-Base的参数量为110M，采用12层Transformer Encoder架构，特点是双向预训练。

BERT-Large (2019)：

BERT-Large的参数量为340M，采用24层Transformer Encoder架构，特点是更大的模型。

RoBERTa (2019)：

RoBERTa的参数量为355M，采用24层Transformer Encoder架构，特点是优化训练策略。

DeBERTa (2020)：

DeBERTa的参数量为1.5B，采用48层Transformer Encoder架构，特点是解耦注意力。

2.3 BERT的优缺点

优点：

BERT的优点包括双向注意力理解能力强、适合NLU任务、预训练效果好、可以微调到各种任务等方面。

缺点：

BERT的缺点包括生成能力弱、不适合文本生成、推理能力有限等方面。

3. T5系列

3.1 T5架构特点

核心特点：

T5架构的核心特点包括Encoder-Decoder架构（编码器+解码器）、文本到文本框架（所有任务都转换为文本）、双向编码+单向解码（结合两者优势）、统一任务格式（简化任务处理）等方面。

架构图：

输入 → Embedding → Position Encoding → 
Encoder Layers → 
Decoder Layers → 
Output Probabilities → Output

代码示例：

python

class T5EncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 自注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 前馈网络
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 注意力掩码 [batch_size, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力
        attn_output, _ = self.self_attn(x, x, x, key_padding_mask=mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 前馈网络
        ffn_output = self.ffn(x)
        x = x + self.dropout2(ffn_output)
        x = self.norm2(x)
        
        return x

class T5DecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # 自注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 编码器-解码器注意力
        self.cross_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # 前馈网络
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # 层归一化
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
    
    def forward(self, x, encoder_output, self_mask=None, cross_mask=None):
        """
        Args:
            x: 解码器输入 [batch_size, seq_len, d_model]
            encoder_output: 编码器输出 [batch_size, src_len, d_model]
            self_mask: 自注意力掩码 [seq_len, seq_len]
            cross_mask: 交叉注意力掩码 [batch_size, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # 自注意力
        attn_output, _ = self.self_attn(x, x, x, attn_mask=self_mask)
        x = x + self.dropout1(attn_output)
        x = self.norm1(x)
        
        # 编码器-解码器注意力
        cross_output, _ = self.cross_attn(x, encoder_output, encoder_output, 
                                           key_padding_mask=cross_mask)
        x = x + self.dropout2(cross_output)
        x = self.norm2(x)
        
        # 前馈网络
        ffn_output = self.ffn(x)
        x = x + self.dropout3(ffn_output)
        x = self.norm3(x)
        
        return x

class T5Model(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
        super().__init__()
        
        # Token嵌入
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # 位置嵌入
        self.position_embedding = nn.Embedding(max_len, d_model)
        
        # 编码器层
        self.encoder_layers = nn.ModuleList([
            T5EncoderLayer(d_model, nhead, d_ff)
            for _ in range(n_layers)
        ])
        
        # 解码器层
        self.decoder_layers = nn.ModuleList([
            T5DecoderLayer(d_model, nhead, d_ff)
            for _ in range(n_layers)
        ])
        
        # 输出层
        self.output_layer = nn.Linear(d_model, vocab_size)
        
        self.max_len = max_len
        self.d_model = d_model
    
    def forward(self, input_ids, decoder_input_ids, 
                encoder_attention_mask=None, decoder_attention_mask=None):
        """
        Args:
            input_ids: 编码器输入 [batch_size, src_len]
            decoder_input_ids: 解码器输入 [batch_size, tgt_len]
            encoder_attention_mask: 编码器注意力掩码 [batch_size, src_len]
            decoder_attention_mask: 解码器注意力掩码 [batch_size, tgt_len]
        
        Returns:
            logits: 输出logits [batch_size, tgt_len, vocab_size]
        """
        batch_size, src_len = input_ids.shape
        _, tgt_len = decoder_input_ids.shape
        
        # 编码器嵌入
        token_embeds = self.token_embedding(input_ids)
        positions = torch.arange(src_len, device=input_ids.device)
        position_embeds = self.position_embedding(positions)
        encoder_input = token_embeds + position_embeds
        
        # 编码器
        encoder_output = encoder_input
        for layer in self.encoder_layers:
            encoder_output = layer(encoder_output, encoder_attention_mask)
        
        # 解码器嵌入
        decoder_token_embeds = self.token_embedding(decoder_input_ids)
        decoder_positions = torch.arange(tgt_len, device=decoder_input_ids.device)
        decoder_position_embeds = self.position_embedding(decoder_positions)
        decoder_input = decoder_token_embeds + decoder_position_embeds
        
        # 创建因果掩码
        causal_mask = torch.triu(torch.ones(tgt_len, tgt_len), diagonal=1)
        causal_mask = causal_mask.masked_fill(causal_mask == 1, float('-inf'))
        
        # 解码器
        decoder_output = decoder_input
        for layer in self.decoder_layers:
            decoder_output = layer(decoder_output, encoder_output, 
                                 causal_mask, decoder_attention_mask)
        
        # 输出
        logits = self.output_layer(decoder_output)
        
        return logits

3.2 T5系列演进

T5-Small (2019)：

T5-Small的参数量为60M，采用6层Encoder + 6层Decoder架构，特点是小型模型。

T5-Base (2019)：

T5-Base的参数量为220M，采用12层Encoder + 12层Decoder架构，特点是基础模型。

T5-Large (2019)：

T5-Large的参数量为770M，采用24层Encoder + 24层Decoder架构，特点是大型模型。

T5-11B (2020)：

T5-11B的参数量为11B，采用多层Encoder + Decoder架构，特点是超大规模模型。

3.3 T5的优缺点

优点：

T5的优点包括统一的文本到文本框架、适合多种任务、编码器-解码器架构、可以理解和生成等方面。

缺点：

T5的缺点包括计算成本高、训练复杂、推理速度慢等方面。

4. LLaMA系列

4.1 LLaMA架构特点

核心特点：

LLaMA架构的核心特点包括Decoder-only架构（类似GPT）、RMSNorm（更稳定的归一化）、SwiGLU激活（更好的性能）、旋转位置编码（更好的位置表示）等方面。

架构改进：

python

class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(d_model))
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 归一化后的输出
        """
        variance = x.pow(2).mean(-1, keepdim=True)
        x = x * torch.rsqrt(variance + self.eps)
        return self.weight * x

class SwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.gate = nn.Linear(d_model, d_ff, bias=False)
        self.up = nn.Linear(d_model, d_ff, bias=False)
        self.down = nn.Linear(d_ff, d_model, bias=False)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        gate = self.gate(x)
        up = self.up(x)
        x = F.silu(gate) * up
        x = self.down(x)
        return x

class RotaryPositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.d_model = d_model
        self.max_len = max_len
        
        # 计算旋转角度
        inv_freq = 1.0 / (10000 ** (torch.arange(0, d_model, 2).float() / d_model))
        self.register_buffer('inv_freq', inv_freq)
    
    def forward(self, seq_len):
        """
        Args:
            seq_len: 序列长度
        
        Returns:
            cos: 余弦值 [seq_len, d_model]
            sin: 正弦值 [seq_len, d_model]
        """
        t = torch.arange(seq_len, device=self.inv_freq.device)
        freqs = torch.einsum('i,j->ij', t, self.inv_freq)
        freqs = torch.cat((freqs, freqs), dim=-1)
        
        cos = freqs.cos()
        sin = freqs.sin()
        
        return cos, sin

class LLaMADecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, d_ff, dropout=0.1):
        super().__init__()
        
        # RMSNorm
        self.norm1 = RMSNorm(d_model)
        self.norm2 = RMSNorm(d_model)
        
        # 多头注意力
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        
        # SwiGLU前馈网络
        self.ffn = SwiGLU(d_model, d_ff)
        
        # Dropout
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 因果掩码 [seq_len, seq_len]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        # RMSNorm + 自注意力
        x_norm = self.norm1(x)
        attn_output, _ = self.self_attn(x_norm, x_norm, x_norm, attn_mask=mask)
        x = x + self.dropout1(attn_output)
        
        # RMSNorm + SwiGLU前馈网络
        x_norm = self.norm2(x)
        ffn_output = self.ffn(x_norm)
        x = x + self.dropout2(ffn_output)
        
        return x

4.2 LLaMA系列演进

LLaMA-7B (2023)：

LLaMA-7B的参数量为7B，采用32层Transformer Decoder架构，特点是小型开源模型。

LLaMA-13B (2023)：

LLaMA-13B的参数量为13B，采用40层Transformer Decoder架构，特点是中型开源模型。

LLaMA-33B (2023)：

LLaMA-33B的参数量为33B，采用60层Transformer Decoder架构，特点是大型开源模型。

LLaMA-65B (2023)：

LLaMA-65B的参数量为65B，采用80层Transformer Decoder架构，特点是超大型开源模型。

4.3 LLaMA的优缺点

优点：

LLaMA的优点包括开源可用、性能优秀、训练数据质量高、架构优化等方面。

缺点：

LLaMA的缺点包括需要大量算力、商业使用限制、中文支持有限等方面。

5. PaLM系列

5.1 PaLM架构特点

核心特点：

PaLM架构的核心特点包括Decoder-only架构（类似GPT）、MoE（Mixture of Experts）（专家混合）、并行训练（大规模并行）、多语言支持（支持多种语言）等方面。

MoE架构：

python

class MoELayer(nn.Module):
    def __init__(self, d_model, n_experts, top_k=2):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k
        
        # 门控网络
        self.gate = nn.Linear(d_model, n_experts)
        
        # 专家网络
        self.experts = nn.ModuleList([
            nn.Linear(d_model, d_model * 4)
            for _ in range(n_experts)
        ])
        
        # 输出投影
        self.output_proj = nn.Linear(d_model * 4, d_model)
    
    def forward(self, x):
        """
        Args:
            x: 输入 [batch_size, seq_len, d_model]
        
        Returns:
            output: 输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, d_model = x.shape
        
        # 计算门控权重
        gate_logits = self.gate(x)  # [batch_size, seq_len, n_experts]
        gate_weights = F.softmax(gate_logits, dim=-1)
        
        # 选择top-k专家
        top_k_weights, top_k_indices = torch.topk(gate_weights, self.top_k, dim=-1)
        
        # 计算专家输出
        expert_outputs = []
        for i in range(self.n_experts):
            expert_mask = (top_k_indices == i).float()
            expert_input = x * expert_mask.unsqueeze(-1)
            expert_output = self.experts[i](expert_input)
            expert_outputs.append(expert_output)
        
        # 合并专家输出
        expert_outputs = torch.stack(expert_outputs, dim=-1)  # [batch_size, seq_len, d_model*4, n_experts]
        
        # 加权求和
        output = torch.einsum('bsek,bse->bsk', expert_outputs, top_k_weights)
        output = output.sum(dim=-1)  # [batch_size, seq_len, d_model*4]
        
        # 输出投影
        output = self.output_proj(output)
        
        return output

5.2 PaLM系列演进

PaLM-540B (2022)：

PaLM-540B的参数量为540B，采用118层Transformer Decoder + MoE架构，特点是超大规模模型。

PaLM-2 (2023)：

PaLM-2的参数量为540B，采用改进的Transformer + MoE架构，特点是更好的多语言支持。

5.3 PaLM的优缺点

优点：

PaLM的优点包括超大规模、多语言支持、MoE架构高效、推理能力强等方面。

缺点：

PaLM的缺点包括训练成本极高、不开源、部署困难等方面。

6. Gemini系列

6.1 Gemini架构特点

核心特点：

Gemini架构的核心特点包括多模态架构（文本、图像、音频、视频）、混合专家（不同模态的专家）、长上下文（支持百万token）、推理能力（强大的推理能力）等方面。

多模态处理：

python

class MultiModalEncoder(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.d_model = d_model
        
        # 文本编码器
        self.text_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead=8),
            num_layers=6
        )
        
        # 图像编码器
        self.image_encoder = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten(),
            nn.Linear(512, d_model)
        )
        
        # 音频编码器
        self.audio_encoder = nn.Sequential(
            nn.Conv1d(1, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),
            nn.Flatten(),
            nn.Linear(256, d_model)
        )
        
        # 模态融合
        self.fusion = nn.MultiheadAttention(d_model, nhead=8)
    
    def forward(self, text, image=None, audio=None):
        """
        Args:
            text: 文本输入 [batch_size, seq_len]
            image: 图像输入 [batch_size, 3, H, W]
            audio: 音频输入 [batch_size, 1, audio_len]
        
        Returns:
            output: 融合后的输出 [batch_size, seq_len, d_model]
        """
        # 文本编码
        text_output = self.text_encoder(text)
        
        # 图像编码
        if image is not None:
            image_output = self.image_encoder(image)
            image_output = image_output.unsqueeze(1)  # [batch_size, 1, d_model]
        else:
            image_output = None
        
        # 音频编码
        if audio is not None:
            audio_output = self.audio_encoder(audio)
            audio_output = audio_output.unsqueeze(1)  # [batch_size, 1, d_model]
        else:
            audio_output = None
        
        # 模态融合
        if image_output is not None or audio_output is not None:
            # 拼接多模态输出
            multimodal_outputs = [text_output]
            if image_output is not None:
                multimodal_outputs.append(image_output)
            if audio_output is not None:
                multimodal_outputs.append(audio_output)
            
            multimodal_output = torch.cat(multimodal_outputs, dim=1)
            
            # 融合
            fused_output, _ = self.fusion(text_output, multimodal_output, 
                                        multimodal_output)
            output = fused_output
        else:
            output = text_output
        
        return output

6.2 Gemini系列演进

Gemini Nano (2023)：

Gemini Nano的参数量为1.8B，采用轻量级多模态架构，特点是移动端部署。

Gemini Pro (2023)：

Gemini Pro的参数量未公开，采用中型多模态架构，特点是平衡性能和成本。

Gemini Ultra (2023)：

Gemini Ultra的参数量未公开，采用大型多模态架构，特点是最强性能。

6.3 Gemini的优缺点

优点：

Gemini的优点包括多模态能力强、长上下文支持、推理能力强、多语言支持等方面。

缺点：

Gemini的缺点包括不开源、API成本高、部署复杂等方面。

实践任务

任务：对比不同模型的输出

目标：调用不同模型API，对比输出结果

要求：

准备相同的输入
调用不同模型API
对比输出结果
分析差异

代码框架：

python

import openai
import anthropic
from transformers import AutoTokenizer, AutoModelForCausalLM

# 准备输入
prompt = "请解释什么是Transformer架构？"

# 调用GPT-4
def call_gpt4(prompt):
    client = openai.OpenAI(api_key="your-api-key")
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# 调用Claude
def call_claude(prompt):
    client = anthropic.Anthropic(api_key="your-api-key")
    response = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

# 调用LLaMA（本地）
def call_llama(prompt):
    model_name = "meta-llama/Llama-2-7b-chat-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=500)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# 对比输出
gpt4_output = call_gpt4(prompt)
claude_output = call_claude(prompt)
llama_output = call_llama(prompt)

print("GPT-4输出:")
print(gpt4_output)
print("\nClaude输出:")
print(claude_output)
print("\nLLaMA输出:")
print(llama_output)

课后作业

作业1：模型架构对比表

题目：创建主流LLM架构对比表

要求：

对比GPT、BERT、T5、LLaMA、PaLM、Gemini
从架构、参数量、特点、优缺点等方面对比
生成对比报告

作业2：模型选型分析

题目：为特定场景选择合适的模型

要求：

分析不同应用场景
为每个场景选择合适的模型
说明选择理由

作业3：模型性能测试

题目：测试不同模型的性能

要求：

设计测试任务
测试不同模型的表现
分析性能差异

参考资料

必读文献

Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners". OpenAI Blog.
- GPT-2论文
Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
- BERT论文
Raffel, C., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". JMLR.
- T5论文

在线资源

Hugging Face Model Hub: https://huggingface.co/models
- 模型库
Papers with Code: https://paperswithcode.com/
- 论文和代码

扩展阅读

模型压缩

模型压缩技术包括Distillation（知识蒸馏）、Quantization（模型量化）、Pruning（模型剪枝）等方法。

高效架构

高效架构包括Mixture of Experts（专家混合）、Sparse Attention（稀疏注意力）、Linear Attention（线性注意力）等方法。

下节预告

下一节我们将学习国内大模型详解，深入了解文心一言、通义千问、混元、豆包、GLM、Kimi、DeepSeek、Yi等国内大模型的特点和应用。

扫描二维码关注"架构师AI杜"公众号，获取更多技术内容和最新动态

第12天：主流LLM架构对比 ​

学习目标 ​

课程内容 ​

1. GPT系列 ​

1.1 GPT架构特点 ​

1.2 GPT系列演进 ​

1.3 GPT的优缺点 ​

2. BERT系列 ​

2.1 BERT架构特点 ​

2.2 BERT系列演进 ​

2.3 BERT的优缺点 ​

3. T5系列 ​

3.1 T5架构特点 ​

3.2 T5系列演进 ​

3.3 T5的优缺点 ​

4. LLaMA系列 ​

4.1 LLaMA架构特点 ​

4.2 LLaMA系列演进 ​

4.3 LLaMA的优缺点 ​

5. PaLM系列 ​

5.1 PaLM架构特点 ​

5.2 PaLM系列演进 ​

5.3 PaLM的优缺点 ​

6. Gemini系列 ​

6.1 Gemini架构特点 ​

6.2 Gemini系列演进 ​

6.3 Gemini的优缺点 ​

实践任务 ​

任务：对比不同模型的输出 ​

课后作业 ​

作业1：模型架构对比表 ​

作业2：模型选型分析 ​

作业3：模型性能测试 ​

参考资料 ​

必读文献 ​

推荐阅读 ​

在线资源 ​

扩展阅读 ​

模型压缩 ​

高效架构 ​

下节预告 ​

第12天：主流LLM架构对比

学习目标

课程内容

1. GPT系列

1.1 GPT架构特点

1.2 GPT系列演进

1.3 GPT的优缺点

2. BERT系列

2.1 BERT架构特点

2.2 BERT系列演进

2.3 BERT的优缺点

3. T5系列

3.1 T5架构特点

3.2 T5系列演进

3.3 T5的优缺点

4. LLaMA系列

4.1 LLaMA架构特点

4.2 LLaMA系列演进

4.3 LLaMA的优缺点

5. PaLM系列

5.1 PaLM架构特点

5.2 PaLM系列演进

5.3 PaLM的优缺点

6. Gemini系列

6.1 Gemini架构特点

6.2 Gemini系列演进

6.3 Gemini的优缺点

实践任务

任务：对比不同模型的输出

课后作业

作业1：模型架构对比表

作业2：模型选型分析

作业3：模型性能测试

参考资料

必读文献

推荐阅读

在线资源

扩展阅读

模型压缩

高效架构

下节预告