Appearance
第12天:主流LLM架构对比
学习目标
- 理解主流LLM架构的差异
- 掌握GPT、BERT、T5、LLaMA、PaLM、Gemini的核心特点
- 对比不同模型的优缺点
- 能够选择合适的模型
课程内容
1. GPT系列
1.1 GPT架构特点
核心特点:
- Decoder-only架构:只使用解码器
- 自回归生成:逐token生成
- 因果掩码:只能看到之前的内容
- 从左到右处理:单向注意力
架构图:
输入 → Embedding → Position Encoding →
Transformer Decoder Layers →
Output Probabilities → Output代码示例:
python
import torch
import torch.nn as nn
class GPTDecoderLayer(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super().__init__()
# 因果自注意力
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# 前馈网络
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
# 层归一化
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# Dropout
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
mask: 因果掩码 [seq_len, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
# 自注意力
attn_output, _ = self.self_attn(x, x, x, attn_mask=mask)
x = x + self.dropout1(attn_output)
x = self.norm1(x)
# 前馈网络
ffn_output = self.ffn(x)
x = x + self.dropout2(ffn_output)
x = self.norm2(x)
return x
class GPTModel(nn.Module):
def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
super().__init__()
# Token嵌入
self.token_embedding = nn.Embedding(vocab_size, d_model)
# 位置嵌入
self.position_embedding = nn.Embedding(max_len, d_model)
# 解码器层
self.decoder_layers = nn.ModuleList([
GPTDecoderLayer(d_model, nhead, d_ff)
for _ in range(n_layers)
])
# 输出层
self.output_layer = nn.Linear(d_model, vocab_size)
self.max_len = max_len
self.d_model = d_model
def forward(self, input_ids):
"""
Args:
input_ids: 输入token IDs [batch_size, seq_len]
Returns:
logits: 输出logits [batch_size, seq_len, vocab_size]
"""
batch_size, seq_len = input_ids.shape
# Token嵌入
token_embeds = self.token_embedding(input_ids)
# 位置嵌入
positions = torch.arange(seq_len, device=input_ids.device)
position_embeds = self.position_embedding(positions)
# 组合嵌入
x = token_embeds + position_embeds
# 创建因果掩码
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
mask = mask.masked_fill(mask == 1, float('-inf'))
# 解码器层
for layer in self.decoder_layers:
x = layer(x, mask)
# 输出
logits = self.output_layer(x)
return logits1.2 GPT系列演进
GPT-1 (2018):
- 参数量:117M
- 架构:12层Transformer Decoder
- 特点:无监督预训练 + 有监督微调
GPT-2 (2019):
- 参数量:1.5B
- 架构:48层Transformer Decoder
- 特点:零样本学习
GPT-3 (2020):
- 参数量:175B
- 架构:96层Transformer Decoder
- 特点:少样本学习
GPT-4 (2023):
- 参数量:未公开(估计1.76T)
- 架构:多模态Transformer
- 特点:多模态能力、推理能力
1.3 GPT的优缺点
优点:
- 生成能力强
- 适合文本生成任务
- 零样本/少样本学习
- 推理能力强
缺点:
- 单向注意力,无法利用未来信息
- 编码任务效率低
- 训练成本高
2. BERT系列
2.1 BERT架构特点
核心特点:
- Encoder-only架构:只使用编码器
- 双向注意力:可以看到整个序列
- 自编码任务:Masked Language Modeling
- 理解为主:适合NLU任务
架构图:
输入 → Embedding → Position Encoding →
Transformer Encoder Layers →
Output → Task-specific Head代码示例:
python
class BERTEncoderLayer(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super().__init__()
# 双向自注意力
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# 前馈网络
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
# 层归一化
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# Dropout
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
mask: 填充掩码 [batch_size, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
# 自注意力
attn_output, _ = self.self_attn(x, x, x, key_padding_mask=mask)
x = x + self.dropout1(attn_output)
x = self.norm1(x)
# 前馈网络
ffn_output = self.ffn(x)
x = x + self.dropout2(ffn_output)
x = self.norm2(x)
return x
class BERTModel(nn.Module):
def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
super().__init__()
# Token嵌入
self.token_embedding = nn.Embedding(vocab_size, d_model)
# 位置嵌入
self.position_embedding = nn.Embedding(max_len, d_model)
# 编码器层
self.encoder_layers = nn.ModuleList([
BERTEncoderLayer(d_model, nhead, d_ff)
for _ in range(n_layers)
])
self.max_len = max_len
self.d_model = d_model
def forward(self, input_ids, attention_mask=None):
"""
Args:
input_ids: 输入token IDs [batch_size, seq_len]
attention_mask: 注意力掩码 [batch_size, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
batch_size, seq_len = input_ids.shape
# Token嵌入
token_embeds = self.token_embedding(input_ids)
# 位置嵌入
positions = torch.arange(seq_len, device=input_ids.device)
position_embeds = self.position_embedding(positions)
# 组合嵌入
x = token_embeds + position_embeds
# 编码器层
for layer in self.encoder_layers:
x = layer(x, attention_mask)
return x2.2 BERT系列演进
BERT-Base (2019):
- 参数量:110M
- 架构:12层Transformer Encoder
- 特点:双向预训练
BERT-Large (2019):
- 参数量:340M
- 架构:24层Transformer Encoder
- 特点:更大的模型
RoBERTa (2019):
- 参数量:355M
- 架构:24层Transformer Encoder
- 特点:优化训练策略
DeBERTa (2020):
- 参数量:1.5B
- 架构:48层Transformer Encoder
- 特点:解耦注意力
2.3 BERT的优缺点
优点:
- 双向注意力,理解能力强
- 适合NLU任务
- 预训练效果好
- 可以微调到各种任务
缺点:
- 生成能力弱
- 不适合文本生成
- 推理能力有限
3. T5系列
3.1 T5架构特点
核心特点:
- Encoder-Decoder架构:编码器+解码器
- 文本到文本框架:所有任务都转换为文本
- 双向编码+单向解码:结合两者优势
- 统一任务格式:简化任务处理
架构图:
输入 → Embedding → Position Encoding →
Encoder Layers →
Decoder Layers →
Output Probabilities → Output代码示例:
python
class T5EncoderLayer(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super().__init__()
# 自注意力
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# 前馈网络
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
# 层归一化
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# Dropout
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
mask: 注意力掩码 [batch_size, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
# 自注意力
attn_output, _ = self.self_attn(x, x, x, key_padding_mask=mask)
x = x + self.dropout1(attn_output)
x = self.norm1(x)
# 前馈网络
ffn_output = self.ffn(x)
x = x + self.dropout2(ffn_output)
x = self.norm2(x)
return x
class T5DecoderLayer(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super().__init__()
# 自注意力
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# 编码器-解码器注意力
self.cross_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# 前馈网络
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
# 层归一化
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
# Dropout
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.dropout3 = nn.Dropout(dropout)
def forward(self, x, encoder_output, self_mask=None, cross_mask=None):
"""
Args:
x: 解码器输入 [batch_size, seq_len, d_model]
encoder_output: 编码器输出 [batch_size, src_len, d_model]
self_mask: 自注意力掩码 [seq_len, seq_len]
cross_mask: 交叉注意力掩码 [batch_size, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
# 自注意力
attn_output, _ = self.self_attn(x, x, x, attn_mask=self_mask)
x = x + self.dropout1(attn_output)
x = self.norm1(x)
# 编码器-解码器注意力
cross_output, _ = self.cross_attn(x, encoder_output, encoder_output,
key_padding_mask=cross_mask)
x = x + self.dropout2(cross_output)
x = self.norm2(x)
# 前馈网络
ffn_output = self.ffn(x)
x = x + self.dropout3(ffn_output)
x = self.norm3(x)
return x
class T5Model(nn.Module):
def __init__(self, vocab_size, d_model, nhead, n_layers, d_ff, max_len=512):
super().__init__()
# Token嵌入
self.token_embedding = nn.Embedding(vocab_size, d_model)
# 位置嵌入
self.position_embedding = nn.Embedding(max_len, d_model)
# 编码器层
self.encoder_layers = nn.ModuleList([
T5EncoderLayer(d_model, nhead, d_ff)
for _ in range(n_layers)
])
# 解码器层
self.decoder_layers = nn.ModuleList([
T5DecoderLayer(d_model, nhead, d_ff)
for _ in range(n_layers)
])
# 输出层
self.output_layer = nn.Linear(d_model, vocab_size)
self.max_len = max_len
self.d_model = d_model
def forward(self, input_ids, decoder_input_ids,
encoder_attention_mask=None, decoder_attention_mask=None):
"""
Args:
input_ids: 编码器输入 [batch_size, src_len]
decoder_input_ids: 解码器输入 [batch_size, tgt_len]
encoder_attention_mask: 编码器注意力掩码 [batch_size, src_len]
decoder_attention_mask: 解码器注意力掩码 [batch_size, tgt_len]
Returns:
logits: 输出logits [batch_size, tgt_len, vocab_size]
"""
batch_size, src_len = input_ids.shape
_, tgt_len = decoder_input_ids.shape
# 编码器嵌入
token_embeds = self.token_embedding(input_ids)
positions = torch.arange(src_len, device=input_ids.device)
position_embeds = self.position_embedding(positions)
encoder_input = token_embeds + position_embeds
# 编码器
encoder_output = encoder_input
for layer in self.encoder_layers:
encoder_output = layer(encoder_output, encoder_attention_mask)
# 解码器嵌入
decoder_token_embeds = self.token_embedding(decoder_input_ids)
decoder_positions = torch.arange(tgt_len, device=decoder_input_ids.device)
decoder_position_embeds = self.position_embedding(decoder_positions)
decoder_input = decoder_token_embeds + decoder_position_embeds
# 创建因果掩码
causal_mask = torch.triu(torch.ones(tgt_len, tgt_len), diagonal=1)
causal_mask = causal_mask.masked_fill(causal_mask == 1, float('-inf'))
# 解码器
decoder_output = decoder_input
for layer in self.decoder_layers:
decoder_output = layer(decoder_output, encoder_output,
causal_mask, decoder_attention_mask)
# 输出
logits = self.output_layer(decoder_output)
return logits3.2 T5系列演进
T5-Small (2019):
- 参数量:60M
- 架构:6层Encoder + 6层Decoder
- 特点:小型模型
T5-Base (2019):
- 参数量:220M
- 架构:12层Encoder + 12层Decoder
- 特点:基础模型
T5-Large (2019):
- 参数量:770M
- 架构:24层Encoder + 24层Decoder
- 特点:大型模型
T5-11B (2020):
- 参数量:11B
- 架构:多层Encoder + Decoder
- 特点:超大规模模型
3.3 T5的优缺点
优点:
- 统一的文本到文本框架
- 适合多种任务
- 编码器-解码器架构
- 可以理解和生成
缺点:
- 计算成本高
- 训练复杂
- 推理速度慢
4. LLaMA系列
4.1 LLaMA架构特点
核心特点:
- Decoder-only架构:类似GPT
- RMSNorm:更稳定的归一化
- SwiGLU激活:更好的性能
- 旋转位置编码:更好的位置表示
架构改进:
python
class RMSNorm(nn.Module):
def __init__(self, d_model, eps=1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(d_model))
def forward(self, x):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
Returns:
output: 归一化后的输出
"""
variance = x.pow(2).mean(-1, keepdim=True)
x = x * torch.rsqrt(variance + self.eps)
return self.weight * x
class SwiGLU(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.gate = nn.Linear(d_model, d_ff, bias=False)
self.up = nn.Linear(d_model, d_ff, bias=False)
self.down = nn.Linear(d_ff, d_model, bias=False)
def forward(self, x):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
gate = self.gate(x)
up = self.up(x)
x = F.silu(gate) * up
x = self.down(x)
return x
class RotaryPositionalEmbedding(nn.Module):
def __init__(self, d_model, max_len=512):
super().__init__()
self.d_model = d_model
self.max_len = max_len
# 计算旋转角度
inv_freq = 1.0 / (10000 ** (torch.arange(0, d_model, 2).float() / d_model))
self.register_buffer('inv_freq', inv_freq)
def forward(self, seq_len):
"""
Args:
seq_len: 序列长度
Returns:
cos: 余弦值 [seq_len, d_model]
sin: 正弦值 [seq_len, d_model]
"""
t = torch.arange(seq_len, device=self.inv_freq.device)
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
freqs = torch.cat((freqs, freqs), dim=-1)
cos = freqs.cos()
sin = freqs.sin()
return cos, sin
class LLaMADecoderLayer(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super().__init__()
# RMSNorm
self.norm1 = RMSNorm(d_model)
self.norm2 = RMSNorm(d_model)
# 多头注意力
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# SwiGLU前馈网络
self.ffn = SwiGLU(d_model, d_ff)
# Dropout
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
mask: 因果掩码 [seq_len, seq_len]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
# RMSNorm + 自注意力
x_norm = self.norm1(x)
attn_output, _ = self.self_attn(x_norm, x_norm, x_norm, attn_mask=mask)
x = x + self.dropout1(attn_output)
# RMSNorm + SwiGLU前馈网络
x_norm = self.norm2(x)
ffn_output = self.ffn(x_norm)
x = x + self.dropout2(ffn_output)
return x4.2 LLaMA系列演进
LLaMA-7B (2023):
- 参数量:7B
- 架构:32层Transformer Decoder
- 特点:小型开源模型
LLaMA-13B (2023):
- 参数量:13B
- 架构:40层Transformer Decoder
- 特点:中型开源模型
LLaMA-33B (2023):
- 参数量:33B
- 架构:60层Transformer Decoder
- 特点:大型开源模型
LLaMA-65B (2023):
- 参数量:65B
- 架构:80层Transformer Decoder
- 特点:超大型开源模型
4.3 LLaMA的优缺点
优点:
- 开源可用
- 性能优秀
- 训练数据质量高
- 架构优化
缺点:
- 需要大量算力
- 商业使用限制
- 中文支持有限
5. PaLM系列
5.1 PaLM架构特点
核心特点:
- Decoder-only架构:类似GPT
- MoE(Mixture of Experts):专家混合
- 并行训练:大规模并行
- 多语言支持:支持多种语言
MoE架构:
python
class MoELayer(nn.Module):
def __init__(self, d_model, n_experts, top_k=2):
super().__init__()
self.n_experts = n_experts
self.top_k = top_k
# 门控网络
self.gate = nn.Linear(d_model, n_experts)
# 专家网络
self.experts = nn.ModuleList([
nn.Linear(d_model, d_model * 4)
for _ in range(n_experts)
])
# 输出投影
self.output_proj = nn.Linear(d_model * 4, d_model)
def forward(self, x):
"""
Args:
x: 输入 [batch_size, seq_len, d_model]
Returns:
output: 输出 [batch_size, seq_len, d_model]
"""
batch_size, seq_len, d_model = x.shape
# 计算门控权重
gate_logits = self.gate(x) # [batch_size, seq_len, n_experts]
gate_weights = F.softmax(gate_logits, dim=-1)
# 选择top-k专家
top_k_weights, top_k_indices = torch.topk(gate_weights, self.top_k, dim=-1)
# 计算专家输出
expert_outputs = []
for i in range(self.n_experts):
expert_mask = (top_k_indices == i).float()
expert_input = x * expert_mask.unsqueeze(-1)
expert_output = self.experts[i](expert_input)
expert_outputs.append(expert_output)
# 合并专家输出
expert_outputs = torch.stack(expert_outputs, dim=-1) # [batch_size, seq_len, d_model*4, n_experts]
# 加权求和
output = torch.einsum('bsek,bse->bsk', expert_outputs, top_k_weights)
output = output.sum(dim=-1) # [batch_size, seq_len, d_model*4]
# 输出投影
output = self.output_proj(output)
return output5.2 PaLM系列演进
PaLM-540B (2022):
- 参数量:540B
- 架构:118层Transformer Decoder + MoE
- 特点:超大规模模型
PaLM-2 (2023):
- 参数量:540B
- 架构:改进的Transformer + MoE
- 特点:更好的多语言支持
5.3 PaLM的优缺点
优点:
- 超大规模
- 多语言支持
- MoE架构高效
- 推理能力强
缺点:
- 训练成本极高
- 不开源
- 部署困难
6. Gemini系列
6.1 Gemini架构特点
核心特点:
- 多模态架构:文本、图像、音频、视频
- 混合专家:不同模态的专家
- 长上下文:支持百万token
- 推理能力:强大的推理能力
多模态处理:
python
class MultiModalEncoder(nn.Module):
def __init__(self, d_model):
super().__init__()
self.d_model = d_model
# 文本编码器
self.text_encoder = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model, nhead=8),
num_layers=6
)
# 图像编码器
self.image_encoder = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(128, 256, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(256, 512, kernel_size=3, padding=1),
nn.ReLU(),
nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten(),
nn.Linear(512, d_model)
)
# 音频编码器
self.audio_encoder = nn.Sequential(
nn.Conv1d(1, 64, kernel_size=3, stride=2, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(64, 128, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(128, 256, kernel_size=3, padding=1),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1),
nn.Flatten(),
nn.Linear(256, d_model)
)
# 模态融合
self.fusion = nn.MultiheadAttention(d_model, nhead=8)
def forward(self, text, image=None, audio=None):
"""
Args:
text: 文本输入 [batch_size, seq_len]
image: 图像输入 [batch_size, 3, H, W]
audio: 音频输入 [batch_size, 1, audio_len]
Returns:
output: 融合后的输出 [batch_size, seq_len, d_model]
"""
# 文本编码
text_output = self.text_encoder(text)
# 图像编码
if image is not None:
image_output = self.image_encoder(image)
image_output = image_output.unsqueeze(1) # [batch_size, 1, d_model]
else:
image_output = None
# 音频编码
if audio is not None:
audio_output = self.audio_encoder(audio)
audio_output = audio_output.unsqueeze(1) # [batch_size, 1, d_model]
else:
audio_output = None
# 模态融合
if image_output is not None or audio_output is not None:
# 拼接多模态输出
multimodal_outputs = [text_output]
if image_output is not None:
multimodal_outputs.append(image_output)
if audio_output is not None:
multimodal_outputs.append(audio_output)
multimodal_output = torch.cat(multimodal_outputs, dim=1)
# 融合
fused_output, _ = self.fusion(text_output, multimodal_output,
multimodal_output)
output = fused_output
else:
output = text_output
return output6.2 Gemini系列演进
Gemini Nano (2023):
- 参数量:1.8B
- 架构:轻量级多模态
- 特点:移动端部署
Gemini Pro (2023):
- 参数量:未公开
- 架构:中型多模态
- 特点:平衡性能和成本
Gemini Ultra (2023):
- 参数量:未公开
- 架构:大型多模态
- 特点:最强性能
6.3 Gemini的优缺点
优点:
- 多模态能力强
- 长上下文支持
- 推理能力强
- 多语言支持
缺点:
- 不开源
- API成本高
- 部署复杂
实践任务
任务:对比不同模型的输出
目标:调用不同模型API,对比输出结果
要求:
- 准备相同的输入
- 调用不同模型API
- 对比输出结果
- 分析差异
代码框架:
python
import openai
import anthropic
from transformers import AutoTokenizer, AutoModelForCausalLM
# 准备输入
prompt = "请解释什么是Transformer架构?"
# 调用GPT-4
def call_gpt4(prompt):
client = openai.OpenAI(api_key="your-api-key")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# 调用Claude
def call_claude(prompt):
client = anthropic.Anthropic(api_key="your-api-key")
response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# 调用LLaMA(本地)
def call_llama(prompt):
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=500)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response
# 对比输出
gpt4_output = call_gpt4(prompt)
claude_output = call_claude(prompt)
llama_output = call_llama(prompt)
print("GPT-4输出:")
print(gpt4_output)
print("\nClaude输出:")
print(claude_output)
print("\nLLaMA输出:")
print(llama_output)课后作业
作业1:模型架构对比表
题目:创建主流LLM架构对比表
要求:
- 对比GPT、BERT、T5、LLaMA、PaLM、Gemini
- 从架构、参数量、特点、优缺点等方面对比
- 生成对比报告
作业2:模型选型分析
题目:为特定场景选择合适的模型
要求:
- 分析不同应用场景
- 为每个场景选择合适的模型
- 说明选择理由
作业3:模型性能测试
题目:测试不同模型的性能
要求:
- 设计测试任务
- 测试不同模型的表现
- 分析性能差异
参考资料
必读文献
Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners". OpenAI Blog.
- GPT-2论文
Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
- BERT论文
Raffel, C., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". JMLR.
- T5论文
推荐阅读
Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models". arXiv.
- LLaMA论文
Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways". arXiv.
- PaLM论文
在线资源
Hugging Face Model Hub: https://huggingface.co/models
- 模型库
Papers with Code: https://paperswithcode.com/
- 论文和代码
扩展阅读
模型压缩
- Distillation: 知识蒸馏
- Quantization: 模型量化
- Pruning: 模型剪枝
高效架构
- Mixture of Experts: 专家混合
- Sparse Attention: 稀疏注意力
- Linear Attention: 线性注意力
下节预告
下一节我们将学习国内大模型详解,深入了解文心一言、通义千问、混元、豆包、GLM、Kimi、DeepSeek、Yi等国内大模型的特点和应用。

扫描二维码关注"架构师AI杜"公众号,获取更多技术内容和最新动态
