第4天：Transformer与大语言模型（2017-2022）

学习目标

本节将带领读者深入理解Transformer架构的诞生背景和核心思想，掌握Self-Attention机制的原理，了解BERT和GPT系列的演进历程，掌握预训练加微调范式，并理解大语言模型时代的特点。这些知识将为后续学习生成式AI和AI Agent打下坚实基础。

课程内容

1. Transformer的诞生

1.1 RNN的局限性

在Transformer出现之前，循环神经网络（RNN）是处理序列数据的主流方法。然而，RNN存在几个关键局限性。首先是序列处理问题，RNN必须按顺序处理序列，这意味着无法并行计算，训练效率低下。其次是长期依赖问题，即使引入了LSTM和GRU等改进结构，长序列处理仍然困难，信息在传递过程中逐渐丢失，难以捕捉长距离关系。第三是固定长度表示问题，最后的隐藏状态需要编码整个序列，信息压缩导致损失，难以表示复杂关系。

1.2 Transformer的突破

2017年，Google发表了里程碑式论文《Attention Is All You Need》，提出了Transformer架构。这篇论文的核心思想是完全基于注意力机制，抛弃了RNN和CNN，实现了完全并行计算，能够捕捉长距离依赖。Transformer的出现开启了自然语言处理的新时代，成为大语言模型的基础，并推动了多模态AI的发展。

2. Transformer架构

2.1 整体架构

Transformer采用Encoder-Decoder结构。输入序列经过Encoder处理，然后由Decoder生成输出序列。Encoder由多层堆叠而成，每层包含Self-Attention和Feed-Forward Network，负责处理输入序列。Decoder同样由多层堆叠而成，每层包含Self-Attention、Encoder-Decoder Attention和Feed-Forward Network，负责生成输出序列。

2.2 Self-Attention机制

2.2.1 基本思想

Self-Attention机制的基本思想是计算序列中每个元素与其他所有元素的相关性。例如对于句子"The cat sat on the mat"，对于单词"cat"，模型会计算它与"The"、"cat"、"sat"、"on"、"the"、"mat"的相关性，然后根据这些相关性加权得到"cat"的表示。这种机制使得每个词都能关注到序列中的所有其他词，从而捕捉长距离依赖关系。

2.2.2 数学表达

Self-Attention的数学表达涉及Query、Key、Value三个概念。首先通过线性变换得到Q、K、V，然后计算注意力分数为softmax(QK^T / √d_k) * V。计算步骤包括计算Query和Key的点积，除以√d_k进行缩放，应用softmax得到注意力权重，最后用权重加权Value。

示例：

python

import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """
    Q: (seq_len, d_k)
    K: (seq_len, d_k)
    V: (seq_len, d_v)
    """
    # 计算注意力分数
    scores = np.dot(Q, K.T) / np.sqrt(Q.shape[-1])
    
    # 应用softmax
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    
    # 加权Value
    output = np.dot(attention_weights, V)
    
    return output, attention_weights

# 示例
X = np.random.randn(6, 512)  # 6个词，512维
W_Q = np.random.randn(512, 64)
W_K = np.random.randn(512, 64)
W_V = np.random.randn(512, 64)

Q = np.dot(X, W_Q)
K = np.dot(X, W_K)
V = np.dot(X, W_V)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Output shape:", output.shape)
print("Attention weights shape:", weights.shape)

2.2.3 Multi-Head Attention

Multi-Head Attention使用多个注意力头，捕捉不同的关系。每个头独立计算注意力，然后将所有头的输出拼接后进行线性变换。多头注意力能够捕捉多种关系，增强表达能力，提高模型性能。

示例：

python

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # 权重矩阵
        self.W_Q = np.random.randn(d_model, d_model)
        self.W_K = np.random.randn(d_model, d_model)
        self.W_V = np.random.randn(d_model, d_model)
        self.W_O = np.random.randn(d_model, d_model)
    
    def forward(self, X):
        batch_size, seq_len, d_model = X.shape
        
        # 计算Q, K, V
        Q = np.dot(X, self.W_Q)
        K = np.dot(X, self.W_K)
        V = np.dot(X, self.W_V)
        
        # 重塑为多头
        Q = Q.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        K = K.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        V = V.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        
        # 计算注意力
        scores = np.dot(Q, K.transpose(0, 1, 3, 2)) / np.sqrt(self.d_k)
        attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
        output = np.dot(attention_weights, V)
        
        # 合并多头
        output = output.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
        
        # 线性变换
        output = np.dot(output, self.W_O)
        
        return output

2.3 Position Encoding

Self-Attention没有位置信息，无法区分词序。解决方案是添加位置编码。位置编码使用正弦和余弦函数，为每个位置生成独特的编码，然后将位置编码与词向量相加。这样模型就能理解词在序列中的位置关系。

示例：

python

def positional_encoding(seq_len, d_model):
    """
    seq_len: 序列长度
    d_model: 模型维度
    """
    PE = np.zeros((seq_len, d_model))
    
    for pos in range(seq_len):
        for i in range(d_model):
            if i % 2 == 0:
                PE[pos, i] = np.sin(pos / (10000 ** (i / d_model)))
            else:
                PE[pos, i] = np.cos(pos / (10000 ** ((i - 1) / d_model)))
    
    return PE

# 示例
PE = positional_encoding(100, 512)
print("Positional encoding shape:", PE.shape)

2.4 Feed-Forward Network

Feed-Forward Network是一个两层的全连接网络，中间使用ReLU激活函数。它的作用是进行非线性变换，增强表达能力，捕捉复杂模式。每个位置独立应用FFN，不与其他位置交互。

2.5 残差连接和层归一化

残差连接和层归一化是Transformer的关键组件。残差连接将输入直接加到输出上，即Output = LayerNorm(x + Sublayer(x))。层归一化对每个样本独立归一化，公式为LayerNorm(x) = γ * (x - μ) / √(σ² + ε) + β。这两个技术能够加速训练，稳定梯度，防止梯度消失。

2.6 完整的Transformer Block

Encoder Block包含Multi-Head Self-Attention、Add & Norm、Feed-Forward Network、Add & Norm四个组件，输入经过这些处理后输出。Decoder Block包含Masked Multi-Head Self-Attention、Add & Norm、Multi-Head Encoder-Decoder Attention、Add & Norm、Feed-Forward Network、Add & Norm六个组件，输入经过这些处理后输出。

3. BERT系列

3.1 BERT（Bidirectional Encoder Representations from Transformers）

2018年，Google发表了BERT，其核心思想是使用Transformer Encoder，实现双向上下文理解，采用预训练加微调范式。BERT的预训练包含两个任务。

任务1：Masked Language Model（MLM）

方法是随机mask输入序列中15%的token，预测被mask的token。Mask策略是80%替换为MASK标记，10%替换为随机词，10%保持不变。

任务2：Next Sentence Prediction（NSP）

方法是给定两个句子，预测它们是否连续。例如句子A是"The cat sat on the mat"，句子B是"It was very comfortable"，预测结果是连续。

3.1.2 模型架构

BERT-Base包含12层Transformer Encoder，768维隐藏层，12个注意力头，110M参数。BERT-Large包含24层Transformer Encoder，1024维隐藏层，16个注意力头，340M参数。

3.1.3 微调

微调方法是使用预训练的BERT作为特征提取器，在下游任务上添加任务特定的层，微调整个模型或部分层。

示例：

python

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# 输入文本
text = "This movie is great!"
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

# 预测
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)

print("Prediction:", predictions.item())

3.2 BERT的变体

RoBERTa（Robustly optimized BERT approach）进行了多项改进，包括更大的训练数据、更长的训练时间、移除NSP任务、更大的batch size、动态masking。ALBERT（A Lite BERT）通过参数共享、句子顺序预测（SOP）代替NSP，使用更少的参数。DistilBERT通过知识蒸馏，使用更少的层数，实现更快的推理速度。ELECTRA采用替代token检测（RTD），实现更高效的预训练。

4. GPT系列

4.1 GPT-1（2018）

GPT-1的核心思想是使用Transformer Decoder，进行单向自回归语言建模，采用预训练加微调范式。模型架构包含12层Transformer Decoder，768维隐藏层，12个注意力头，117M参数。预训练任务是给定前面的token，预测下一个token。例如输入是"The cat sat on the"，预测结果是"mat"。

4.2 GPT-2（2019）

GPT-2进行了多项改进，包括更大的模型（1.5B参数）、更多的训练数据（WebText）、更好的零样本能力。GPT-2提供了不同规模的模型，包括GPT-2 Small（117M参数）、GPT-2 Medium（345M参数）、GPT-2 Large（774M参数）、GPT-2 XL（1.5B参数）。GPT-2展现出文本生成、故事创作、代码生成、翻译等能力。

4.3 GPT-3（2020）

GPT-3是一个重大突破，拥有巨大的模型（175B参数）、强大的few-shot学习能力，展示了scaling laws。GPT-3提供了不同规模的模型，包括GPT-3 Ada（350M参数）、GPT-3 Babbage（1.3B参数）、GPT-3 Curie（6.7B参数）、GPT-3 Davinci（175B参数）。GPT-3展现出文本生成、问答、翻译、代码生成、数学推理等能力。

Few-shot Learning是GPT-3的重要特性。Zero-shot是直接给出任务描述，One-shot是给出一个示例，Few-shot是给出几个示例。例如，Zero-shot输入"翻译成中文：Hello world"输出"你好世界"。One-shot给出一个示例后，模型能完成新任务。Few-shot给出几个示例后，模型表现更好。

4.4 GPT-3.5（2022）

GPT-3.5进行了多项改进，包括更好的指令跟随能力、更强的推理能力，成为ChatGPT的基础。ChatGPT基于GPT-3.5，采用RLHF人类反馈强化学习，具备对话能力和代码能力。RLHF过程包含三个阶段，SFT监督微调使用人工标注的数据微调，RM奖励模型训练奖励模型，PPO近端策略优化使用强化学习优化。

5. 预训练+微调范式

5.1 预训练

预训练的目标是在大规模无标注数据上学习通用表示。训练数据包括文本（维基百科、Common Crawl、书籍）、代码（GitHub、Stack Overflow）、多模态（图像-文本对）。预训练任务包括MLM Masked Language Model、NSP Next Sentence Prediction、自回归语言建模。

5.2 微调

微调的目标是在下游任务上适应预训练模型。方法包括全量微调微调所有参数、部分微调只微调部分层、参数高效微调PEFT只微调少量参数。

示例：

python

from transformers import BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader

# 加载预训练模型
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# 优化器
optimizer = AdamW(model.parameters(), lr=2e-5)

# 训练循环
model.train()
for epoch in range(3):
    for batch in train_dataloader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

5.3 提示工程（Prompt Engineering）

提示工程的核心思想是通过设计提示来引导模型生成期望的输出。常用技巧包括Few-shot提供示例、Chain-of-Thought引导推理过程、Role-playing设定角色、Format specification指定输出格式。例如，Few-shot给出几个示例后，模型能更好地理解任务。Chain-of-Thought让模型逐步推理，提高复杂问题的解决能力。

6. 多模态AI的兴起

6.1 CLIP（Contrastive Language-Image Pre-training）

2021年，OpenAI发表了CLIP，其核心思想是联合训练图像和文本编码器，学习图像和文本的对齐，实现零样本图像分类。训练方法采用对比学习，正样本对是图像-文本匹配，负样本对是图像-文本不匹配。应用包括零样本图像分类、图像检索、文本检索。

示例：

python

import clip
import torch
from PIL import Image

# 加载模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 准备图像和文本
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a dog", "a cat", "a bird"]).to(device)

# 计算特征
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # 计算相似度
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)

6.2 DALL-E

2021年，OpenAI发布了DALL-E，其核心思想是文本生成图像，基于Transformer架构，展示了生成式AI的潜力。DALL-E 2于2022年发布，具有更高的图像质量、更大的模型、更强的理解能力。

6.3 Stable Diffusion

2022年，Stability AI发布了Stable Diffusion，其核心思想是文本生成图像，基于扩散模型，开源可商用。Stable Diffusion的优势是开源、可本地运行、社区活跃，推动了生成式AI的普及。

7. 大语言模型时代的特点

7.1 技术特点

Scaling Laws表明，模型性能随参数量、数据量、计算量增加而提升。超大模型展现出小模型没有的能力，称为涌现能力。涌现能力包括上下文学习、指令跟随、推理能力、代码生成等。

7.2 应用特点

大语言模型具有通用性，一个模型可以处理多种任务，无需针对每个任务训练，提示即可使用。同时具有创造性，能够进行文本生成、图像生成、代码生成、音乐生成等。

7.3 产业影响

大语言模型引发了AI应用爆发，包括ChatGPT、GitHub Copilot、Midjourney、DALL-E等。行业变革涉及软件开发、内容创作、客服、教育等领域，改变了传统的工作方式。

实践任务

任务1：实现Self-Attention

目标：从零实现Self-Attention机制。

要求：

实现Scaled Dot-Product Attention
实现Multi-Head Attention
可视化注意力权重
在简单任务上测试

代码框架：

python

import numpy as np
import matplotlib.pyplot as plt

class ScaledDotProductAttention:
    def __init__(self, d_k):
        self.d_k = d_k
    
    def forward(self, Q, K, V):
        """
        Q: (batch_size, seq_len, d_k)
        K: (batch_size, seq_len, d_k)
        V: (batch_size, seq_len, d_v)
        """
        # 计算注意力分数
        scores = np.dot(Q, K.transpose(0, 2, 1)) / np.sqrt(self.d_k)
        
        # 应用softmax
        attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
        
        # 加权Value
        output = np.dot(attention_weights, V)
        
        return output, attention_weights

class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.attention = ScaledDotProductAttention(self.d_k)
        
        # 权重矩阵
        self.W_Q = np.random.randn(d_model, d_model)
        self.W_K = np.random.randn(d_model, d_model)
        self.W_V = np.random.randn(d_model, d_model)
        self.W_O = np.random.randn(d_model, d_model)
    
    def forward(self, X):
        """
        X: (batch_size, seq_len, d_model)
        """
        batch_size, seq_len, d_model = X.shape
        
        # 计算Q, K, V
        Q = np.dot(X, self.W_Q)
        K = np.dot(X, self.W_K)
        V = np.dot(X, self.W_V)
        
        # 重塑为多头
        Q = Q.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        K = K.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        V = V.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        
        # 计算注意力
        output, attention_weights = self.attention.forward(Q, K, V)
        
        # 合并多头
        output = output.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
        
        # 线性变换
        output = np.dot(output, self.W_O)
        
        return output, attention_weights

# 测试
batch_size = 2
seq_len = 10
d_model = 512
num_heads = 8

X = np.random.randn(batch_size, seq_len, d_model)
mha = MultiHeadAttention(d_model, num_heads)
output, weights = mha.forward(X)

print("Output shape:", output.shape)
print("Attention weights shape:", weights.shape)

# 可视化注意力权重
plt.figure(figsize=(10, 8))
plt.imshow(weights[0, 0], cmap='viridis')
plt.colorbar()
plt.title('Attention Weights (Head 1, Batch 1)')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.show()

任务2：使用预训练模型

目标：使用Hugging Face的预训练模型。

要求：

使用BERT进行文本分类
使用GPT-2进行文本生成
使用CLIP进行图像分类
分析模型输出

代码框架：

python

from transformers import BertTokenizer, BertForSequenceClassification, GPT2LMHeadModel, GPT2Tokenizer
import torch

# BERT文本分类
def bert_classification(text):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
    
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    
    return predictions.item()

# GPT-2文本生成
def gpt2_generation(prompt, max_length=100):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(**inputs, max_length=max_length, num_return_sequences=1)
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# 测试
text = "This movie is great!"
print("BERT Classification:", bert_classification(text))

prompt = "Once upon a time"
print("GPT-2 Generation:", gpt2_generation(prompt))

任务3：提示工程

目标：研究不同提示策略的效果。

要求：

设计不同的提示
测试Zero-shot、One-shot、Few-shot
测试Chain-of-Thought
分析提示效果

示例：

python

def zero_shot(prompt):
    # Zero-shot提示
    return prompt

def one_shot(prompt, example):
    # One-shot提示
    return f"{example}\n\n{prompt}"

def few_shot(prompt, examples):
    # Few-shot提示
    examples_str = "\n\n".join(examples)
    return f"{examples_str}\n\n{prompt}"

def chain_of_thought(prompt):
    # Chain-of-Thought提示
    return f"{prompt}\n\n让我们一步步思考："

# 测试
question = "如果我有3个苹果，吃了1个，又买了2个，我现在有几个苹果？"

# Zero-shot
print("Zero-shot:")
print(zero_shot(question))

# One-shot
example = "Q: 1+1=?\nA: 2"
print("\nOne-shot:")
print(one_shot(question, example))

# Few-shot
examples = [
    "Q: 1+1=?\nA: 2",
    "Q: 2+2=?\nA: 4",
    "Q: 3+3=?\nA: 6"
]
print("\nFew-shot:")
print(few_shot(question, examples))

# Chain-of-Thought
print("\nChain-of-Thought:")
print(chain_of_thought(question))

课后作业

作业1：BERT vs GPT

题目：比较BERT和GPT的架构和能力。

要求：

分析BERT和GPT的架构差异
比较预训练任务
比较适用场景
撰写1500字的分析报告

作业2：预训练模型微调

题目：在下游任务上微调预训练模型。

要求：

选择一个预训练模型（如BERT）
选择一个下游任务（如情感分析）
微调模型
评估性能

作业3：提示工程研究

题目：研究提示工程的效果。

要求：

设计多种提示策略
在不同任务上测试
比较提示效果
总结最佳实践

参考资料

必读文献

Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS.
- Transformer原始论文
Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.
- BERT原始论文
Brown, T., et al. (2020). "Language Models are Few-Shot Learners". NeurIPS.
- GPT-3原始论文
Radford, A., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". ICML.
- CLIP原始论文

在线资源

Hugging Face Transformers: https://huggingface.co/docs/transformers/
- Transformers库文档
The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
- Transformer可视化解释
OpenAI API: https://platform.openai.com/docs/
- OpenAI API文档

扩展阅读

Transformer变体

Child, R., et al. (2019). "Generating Long Sequences with Sparse Transformers". arXiv.
- Sparse Transformer
Beltagy, I., Peters, M. E., & Cohan, A. (2020). "Longformer: The Long-Document Transformer". arXiv.
- Longformer

多模态AI

Ramesh, A., et al. (2021). "Zero-Shot Text-to-Image Generation". ICML.
- DALL-E论文
Rombach, R., et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". CVPR.
- Stable Diffusion论文

下节预告

下一节我们将学习生成式AI爆发（2022-2023），了解ChatGPT现象、AIGC技术、Prompt Engineering、AI应用爆发，以及它们如何开启生成式AI时代。

扫描二维码关注"架构师AI杜"公众号，获取更多技术内容和最新动态

第4天：Transformer与大语言模型（2017-2022） ​

学习目标 ​

课程内容 ​

1. Transformer的诞生 ​

1.1 RNN的局限性 ​

1.2 Transformer的突破 ​

2. Transformer架构 ​

2.1 整体架构 ​

2.2 Self-Attention机制 ​

2.3 Position Encoding ​

2.4 Feed-Forward Network ​

2.5 残差连接和层归一化 ​

2.6 完整的Transformer Block ​

3. BERT系列 ​

3.1 BERT（Bidirectional Encoder Representations from Transformers） ​

3.2 BERT的变体 ​

4. GPT系列 ​

4.1 GPT-1（2018） ​

4.2 GPT-2（2019） ​

4.3 GPT-3（2020） ​

4.4 GPT-3.5（2022） ​

5. 预训练+微调范式 ​

5.1 预训练 ​

5.2 微调 ​

5.3 提示工程（Prompt Engineering） ​

6. 多模态AI的兴起 ​

6.1 CLIP（Contrastive Language-Image Pre-training） ​

6.2 DALL-E ​

6.3 Stable Diffusion ​

7. 大语言模型时代的特点 ​

7.1 技术特点 ​

7.2 应用特点 ​

7.3 产业影响 ​

实践任务 ​

任务1：实现Self-Attention ​

任务2：使用预训练模型 ​

任务3：提示工程 ​

课后作业 ​

作业1：BERT vs GPT ​

作业2：预训练模型微调 ​

作业3：提示工程研究 ​

参考资料 ​

必读文献 ​

推荐阅读 ​

在线资源 ​

扩展阅读 ​

Transformer变体 ​

多模态AI ​

下节预告 ​

第4天：Transformer与大语言模型（2017-2022）

学习目标

课程内容

1. Transformer的诞生

1.1 RNN的局限性

1.2 Transformer的突破

2. Transformer架构

2.1 整体架构

2.2 Self-Attention机制

2.3 Position Encoding

2.4 Feed-Forward Network

2.5 残差连接和层归一化

2.6 完整的Transformer Block

3. BERT系列

3.1 BERT（Bidirectional Encoder Representations from Transformers）

3.2 BERT的变体

4. GPT系列

4.1 GPT-1（2018）

4.2 GPT-2（2019）

4.3 GPT-3（2020）

4.4 GPT-3.5（2022）

5. 预训练+微调范式

5.1 预训练

5.2 微调

5.3 提示工程（Prompt Engineering）

6. 多模态AI的兴起

6.1 CLIP（Contrastive Language-Image Pre-training）

6.2 DALL-E

6.3 Stable Diffusion

7. 大语言模型时代的特点

7.1 技术特点

7.2 应用特点

7.3 产业影响

实践任务

任务1：实现Self-Attention

任务2：使用预训练模型

任务3：提示工程

课后作业

作业1：BERT vs GPT

作业2：预训练模型微调

作业3：提示工程研究

参考资料

必读文献

推荐阅读

在线资源

扩展阅读

Transformer变体

多模态AI

下节预告