Appearance
第17天:LLM评估与选择
学习目标
- 理解LLM评估的重要性
- 掌握LLM评估指标
- 掌握基准测试方法
- 能够评估和选择合适的LLM
课程内容
1. LLM评估概述
1.1 为什么需要评估
评估的重要性:
- 选择合适的模型
- 优化模型性能
- 降低使用成本
- 提高用户体验
评估维度:
1. 性能评估:准确率、响应速度等
2. 成本评估:API费用、部署成本等
3. 易用性评估:API易用性、文档质量等
4. 可靠性评估:稳定性、可用性等1.2 评估流程
标准流程:
1. 定义评估目标
2. 选择评估指标
3. 准备测试数据
4. 执行评估
5. 分析结果
6. 做出选择2. LLM评估指标
2.1 准确性指标
2.1.1 文本生成质量
BLEU分数:
python
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference, candidate):
"""
计算BLEU分数
Args:
reference: 参考文本
candidate: 候选文本
Returns:
bleu_score: BLEU分数
"""
reference_tokens = reference.split()
candidate_tokens = candidate.split()
smoothing = SmoothingFunction().method1
bleu_score = sentence_bleu(
[reference_tokens],
candidate_tokens,
smoothing_function=smoothing
)
return bleu_score
# 示例
reference = "The quick brown fox jumps over the lazy dog"
candidate = "The quick brown fox jumps over the lazy dog"
bleu = calculate_bleu(reference, candidate)
print(f"BLEU分数: {bleu:.4f}")ROUGE分数:
python
from rouge import Rouge
def calculate_rouge(reference, candidate):
"""
计算ROUGE分数
Args:
reference: 参考文本
candidate: 候选文本
Returns:
rouge_scores: ROUGE分数
"""
rouge = Rouge()
scores = rouge.get_scores(candidate, reference)
return scores[0]
# 示例
reference = "The quick brown fox jumps over the lazy dog"
candidate = "The quick brown fox jumps over the lazy dog"
rouge = calculate_rouge(reference, candidate)
print(f"ROUGE-L分数: {rouge['rouge-l']['f']:.4f}")2.1.2 任务准确率
python
def calculate_accuracy(predictions, labels):
"""
计算准确率
Args:
predictions: 预测结果
labels: 真实标签
Returns:
accuracy: 准确率
"""
correct = sum(1 for p, l in zip(predictions, labels) if p == l)
accuracy = correct / len(predictions)
return accuracy
# 示例
predictions = ["正面", "负面", "中性", "正面"]
labels = ["正面", "负面", "中性", "负面"]
accuracy = calculate_accuracy(predictions, labels)
print(f"准确率: {accuracy:.2%}")2.1.3 代码生成质量
python
import ast
def validate_code(code):
"""
验证代码语法
Args:
code: 代码字符串
Returns:
is_valid: 是否有效
error: 错误信息
"""
try:
ast.parse(code)
return True, None
except SyntaxError as e:
return False, str(e)
# 示例
code = """
def add(a, b):
return a + b
"""
is_valid, error = validate_code(code)
if is_valid:
print("代码语法正确")
else:
print(f"代码语法错误: {error}")2.2 效率指标
2.2.1 响应时间
python
import time
def measure_response_time(client, model, prompt):
"""
测量响应时间
Args:
client: LLM客户端
model: 模型名称
prompt: 输入提示
Returns:
response_time: 响应时间(秒)
response: 响应内容
"""
start_time = time.time()
response = client.get_text(model, prompt)
end_time = time.time()
response_time = end_time - start_time
return response_time, response
# 示例
response_time, response = measure_response_time(client, "gpt-4", "你好")
print(f"响应时间: {response_time:.2f}秒")2.2.2 吞吐量
python
def measure_throughput(client, model, prompts, duration=60):
"""
测量吞吐量
Args:
client: LLM客户端
model: 模型名称
prompts: 提示列表
duration: 测试时长(秒)
Returns:
throughput: 吞吐量(请求/秒)
"""
start_time = time.time()
count = 0
for prompt in prompts:
if time.time() - start_time > duration:
break
client.get_text(model, prompt)
count += 1
elapsed_time = time.time() - start_time
throughput = count / elapsed_time
return throughput
# 示例
prompts = ["你好"] * 100
throughput = measure_throughput(client, "gpt-4", prompts)
print(f"吞吐量: {throughput:.2f} 请求/秒")2.3 成本指标
2.3.1 API成本计算
python
def calculate_cost(input_tokens, output_tokens, model_pricing):
"""
计算API成本
Args:
input_tokens: 输入token数
output_tokens: 输出token数
model_pricing: 模型定价
Returns:
cost: 成本(美元)
"""
input_cost = (input_tokens / 1000) * model_pricing['input_price']
output_cost = (output_tokens / 1000) * model_pricing['output_price']
total_cost = input_cost + output_cost
return total_cost
# 示例
model_pricing = {
'input_price': 0.03, # 每1000 token的价格
'output_price': 0.06
}
input_tokens = 1000
output_tokens = 500
cost = calculate_cost(input_tokens, output_tokens, model_pricing)
print(f"成本: ${cost:.4f}")2.3.2 部署成本计算
python
def calculate_deployment_cost(gpu_hours, gpu_cost_per_hour):
"""
计算部署成本
Args:
gpu_hours: GPU使用小时数
gpu_cost_per_hour: 每小时GPU成本
Returns:
cost: 成本(美元)
"""
cost = gpu_hours * gpu_cost_per_hour
return cost
# 示例
gpu_hours = 24 * 30 # 30天,每天24小时
gpu_cost_per_hour = 0.5 # 每小时0.5美元
cost = calculate_deployment_cost(gpu_hours, gpu_cost_per_hour)
print(f"月度部署成本: ${cost:.2f}")3. 基准测试
3.1 常用基准测试
3.1.1 MMLU(Massive Multitask Language Understanding)
python
def evaluate_mmlu(client, model, questions):
"""
评估MMLU基准
Args:
client: LLM客户端
model: 模型名称
questions: 问题列表
Returns:
accuracy: 准确率
"""
correct = 0
total = len(questions)
for question in questions:
prompt = f"""
请回答以下选择题,只输出选项字母。
问题:{question['question']}
A. {question['A']}
B. {question['B']}
C. {question['C']}
D. {question['D']}
答案:
"""
response = client.get_text(model, prompt).strip()
if response == question['answer']:
correct += 1
accuracy = correct / total
return accuracy
# 示例
questions = [
{
'question': 'Python中哪个关键字用于定义函数?',
'A': 'class',
'B': 'def',
'C': 'function',
'D': 'func',
'answer': 'B'
},
# 更多问题...
]
accuracy = evaluate_mmlu(client, "gpt-4", questions)
print(f"MMLU准确率: {accuracy:.2%}")3.1.2 HumanEval(代码生成)
python
def evaluate_human_eval(client, model, problems):
"""
评估HumanEval基准
Args:
client: LLM客户端
model: 模型名称
problems: 问题列表
Returns:
pass_rate: 通过率
"""
passed = 0
total = len(problems)
for problem in problems:
prompt = f"""
请完成以下Python函数:
```python
{problem['prompt']}
```
"""
response = client.get_text(model, prompt)
# 提取代码
code = extract_code(response)
# 验证代码
if validate_and_test(code, problem['tests']):
passed += 1
pass_rate = passed / total
return pass_rate
def extract_code(text):
"""提取代码"""
# 实现代码提取逻辑
pass
def validate_and_test(code, tests):
"""验证并测试代码"""
# 实现代码验证和测试逻辑
pass
# 示例
problems = [
{
'prompt': 'def add(a, b):\n """计算两个数的和"""\n ',
'tests': ['assert add(1, 2) == 3', 'assert add(-1, 1) == 0']
},
# 更多问题...
]
pass_rate = evaluate_human_eval(client, "gpt-4", problems)
print(f"HumanEval通过率: {pass_rate:.2%}")3.1.3 GSM8K(数学推理)
python
def evaluate_gsm8k(client, model, problems):
"""
评估GSM8K基准
Args:
client: LLM客户端
model: 模型名称
problems: 问题列表
Returns:
accuracy: 准确率
"""
correct = 0
total = len(problems)
for problem in problems:
prompt = f"""
请逐步解决以下数学问题,并给出最终答案。
问题:{problem['question']}
"""
response = client.get_text(model, prompt)
# 提取答案
answer = extract_answer(response)
if answer == problem['answer']:
correct += 1
accuracy = correct / total
return accuracy
def extract_answer(text):
"""提取答案"""
# 实现答案提取逻辑
pass
# 示例
problems = [
{
'question': '一个农场有鸡和兔子共20只,共有56条腿,鸡和兔子各有多少只?',
'answer': '鸡12只,兔子8只'
},
# 更多问题...
]
accuracy = evaluate_gsm8k(client, "gpt-4", problems)
print(f"GSM8K准确率: {accuracy:.2%}")3.2 自定义基准测试
python
class CustomBenchmark:
"""自定义基准测试类"""
def __init__(self, client, model):
self.client = client
self.model = model
self.results = []
def add_test_case(self, test_case):
"""
添加测试用例
Args:
test_case: 测试用例
"""
self.test_cases.append(test_case)
def run(self):
"""
运行基准测试
Returns:
results: 测试结果
"""
results = []
for test_case in self.test_cases:
result = self._run_test_case(test_case)
results.append(result)
self.results = results
return results
def _run_test_case(self, test_case):
"""
运行单个测试用例
Args:
test_case: 测试用例
Returns:
result: 测试结果
"""
start_time = time.time()
response = self.client.get_text(
self.model,
test_case['prompt']
)
end_time = time.time()
response_time = end_time - start_time
# 评估结果
is_correct = self._evaluate(response, test_case)
result = {
'test_case': test_case,
'response': response,
'response_time': response_time,
'is_correct': is_correct
}
return result
def _evaluate(self, response, test_case):
"""
评估响应
Args:
response: 响应
test_case: 测试用例
Returns:
is_correct: 是否正确
"""
# 实现评估逻辑
pass
def get_accuracy(self):
"""获取准确率"""
correct = sum(1 for r in self.results if r['is_correct'])
accuracy = correct / len(self.results)
return accuracy
def get_average_response_time(self):
"""获取平均响应时间"""
total_time = sum(r['response_time'] for r in self.results)
avg_time = total_time / len(self.results)
return avg_time
# 使用示例
benchmark = CustomBenchmark(client, "gpt-4")
# 添加测试用例
benchmark.add_test_case({
'prompt': '什么是Python?',
'expected': 'Python是一种编程语言'
})
# 运行测试
results = benchmark.run()
# 获取结果
accuracy = benchmark.get_accuracy()
avg_time = benchmark.get_average_response_time()
print(f"准确率: {accuracy:.2%}")
print(f"平均响应时间: {avg_time:.2f}秒")4. 模型选择
4.1 选择标准
4.1.1 性能标准
python
def evaluate_performance(client, models, test_cases):
"""
评估模型性能
Args:
client: LLM客户端
models: 模型列表
test_cases: 测试用例
Returns:
results: 评估结果
"""
results = {}
for model in models:
benchmark = CustomBenchmark(client, model)
for test_case in test_cases:
benchmark.add_test_case(test_case)
benchmark.run()
results[model] = {
'accuracy': benchmark.get_accuracy(),
'avg_response_time': benchmark.get_average_response_time()
}
return results
# 示例
models = ["gpt-4", "gpt-3.5-turbo", "claude-3-opus"]
test_cases = [
{'prompt': '什么是Python?', 'expected': 'Python是一种编程语言'},
# 更多测试用例...
]
results = evaluate_performance(client, models, test_cases)
for model, metrics in results.items():
print(f"{model}:")
print(f" 准确率: {metrics['accuracy']:.2%}")
print(f" 平均响应时间: {metrics['avg_response_time']:.2f}秒")4.1.2 成本标准
python
def calculate_total_cost(client, models, test_cases, model_pricing):
"""
计算总成本
Args:
client: LLM客户端
models: 模型列表
test_cases: 测试用例
model_pricing: 模型定价
Returns:
costs: 成本列表
"""
costs = {}
for model in models:
total_cost = 0
for test_case in test_cases:
response = client.call(model, test_case['prompt'])
input_tokens = response['usage']['prompt_tokens']
output_tokens = response['usage']['completion_tokens']
cost = calculate_cost(
input_tokens,
output_tokens,
model_pricing[model]
)
total_cost += cost
costs[model] = total_cost
return costs
# 示例
model_pricing = {
'gpt-4': {'input_price': 0.03, 'output_price': 0.06},
'gpt-3.5-turbo': {'input_price': 0.0015, 'output_price': 0.002},
'claude-3-opus': {'input_price': 0.015, 'output_price': 0.075}
}
costs = calculate_total_cost(client, models, test_cases, model_pricing)
for model, cost in costs.items():
print(f"{model}: ${cost:.4f}")4.1.3 综合评估
python
def comprehensive_evaluation(client, models, test_cases, model_pricing):
"""
综合评估
Args:
client: LLM客户端
models: 模型列表
test_cases: 测试用例
model_pricing: 模型定价
Returns:
evaluation: 评估结果
"""
# 性能评估
performance = evaluate_performance(client, models, test_cases)
# 成本评估
costs = calculate_total_cost(client, models, test_cases, model_pricing)
# 综合评分
evaluation = {}
for model in models:
# 归一化指标
accuracy = performance[model]['accuracy']
response_time = performance[model]['avg_response_time']
cost = costs[model]
# 计算综合得分
score = (accuracy * 0.5) + ((1 / response_time) * 0.3) + ((1 / cost) * 0.2)
evaluation[model] = {
'accuracy': accuracy,
'response_time': response_time,
'cost': cost,
'score': score
}
# 按得分排序
evaluation = dict(sorted(evaluation.items(), key=lambda x: x[1]['score'], reverse=True))
return evaluation
# 示例
evaluation = comprehensive_evaluation(client, models, test_cases, model_pricing)
print("综合评估结果:")
for rank, (model, metrics) in enumerate(evaluation.items(), 1):
print(f"\n第{rank}名: {model}")
print(f" 准确率: {metrics['accuracy']:.2%}")
print(f" 响应时间: {metrics['response_time']:.2f}秒")
print(f" 成本: ${metrics['cost']:.4f}")
print(f" 综合得分: {metrics['score']:.4f}")4.2 选择建议
场景1:代码生成
python
# 推荐模型
code_generation_models = [
{
'model': 'gpt-4',
'reason': '代码生成能力强,准确率高',
'cost': '高'
},
{
'model': 'claude-3-opus',
'reason': '代码质量高,注释详细',
'cost': '高'
},
{
'model': 'codellama',
'reason': '开源,可本地部署',
'cost': '低'
}
]场景2:文本生成
python
# 推荐模型
text_generation_models = [
{
'model': 'gpt-4',
'reason': '生成质量高,创意强',
'cost': '高'
},
{
'model': 'claude-3-opus',
'reason': '文本流畅,逻辑清晰',
'cost': '高'
},
{
'model': 'gpt-3.5-turbo',
'reason': '速度快,成本低',
'cost': '低'
}
]场景3:问答系统
python
# 推荐模型
qa_models = [
{
'model': 'gpt-4',
'reason': '理解能力强,回答准确',
'cost': '高'
},
{
'model': 'claude-3-opus',
'reason': '上下文长,适合长文档',
'cost': '高'
},
{
'model': 'gemini-pro',
'reason': '多语言支持好',
'cost': '中'
}
]实践任务
任务:评估3个模型在特定任务上的表现
目标:评估3个模型在特定任务上的表现
要求:
- 选择一个任务(代码生成、文本生成、问答等)
- 选择3个模型
- 设计测试用例
- 执行评估
- 分析结果
代码框架:
python
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
# 选择模型
models = ["gpt-4", "gpt-3.5-turbo", "claude-3-opus"]
# 设计测试用例
test_cases = [
# TODO: 设计测试用例
]
# 评估模型
results = comprehensive_evaluation(client, models, test_cases, model_pricing)
# 分析结果
# TODO: 分析结果课后作业
作业1:模型评估报告
题目:为特定任务选择最佳模型
要求:
- 选择一个实际任务
- 选择3-5个模型
- 设计测试用例
- 执行评估
- 生成评估报告
作业2:成本优化分析
题目:分析如何降低LLM使用成本
要求:
- 调研各模型定价
- 分析成本优化策略
- 提供优化建议
作业3:性能优化分析
题目:分析如何提高LLM性能
要求:
- 分析影响性能的因素
- 提出优化方案
- 实现优化代码
参考资料
评估基准
MMLU: https://github.com/hendrycks/test
- 多任务语言理解基准
HumanEval: https://github.com/openai/human-eval
- 代码生成基准
GSM8K: https://github.com/openai/grade-school-math
- 数学推理基准
评估工具
EleutherAI LM Evaluation Harness: https://github.com/EleutherAI/lm-evaluation-harness
- LLM评估工具
Promptfoo: https://promptfoo.dev/
- Prompt评估工具
在线资源
Hugging Face Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- 开源模型排行榜
Papers with Code: https://paperswithcode.com/
- 论文和代码
扩展阅读
评估方法
- 自动评估: 使用指标自动评估
- 人工评估: 人工打分评估
- 混合评估: 结合自动和人工评估
评估挑战
- 主观性: 评估标准主观
- 成本: 评估成本高
- 时间: 评估耗时长
下节预告
下一节我们将学习LLM模块总结与项目,回顾本周所学知识,并完成智能问答系统实战项目。

扫描二维码关注"架构师AI杜"公众号,获取更多技术内容和最新动态
