Skip to content

第17天:LLM评估与选择

学习目标

  • 理解LLM评估的重要性
  • 掌握LLM评估指标
  • 掌握基准测试方法
  • 能够评估和选择合适的LLM

课程内容

1. LLM评估概述

1.1 为什么需要评估

评估的重要性

  • 选择合适的模型
  • 优化模型性能
  • 降低使用成本
  • 提高用户体验

评估维度

1. 性能评估:准确率、响应速度等
2. 成本评估:API费用、部署成本等
3. 易用性评估:API易用性、文档质量等
4. 可靠性评估:稳定性、可用性等

1.2 评估流程

标准流程

1. 定义评估目标
2. 选择评估指标
3. 准备测试数据
4. 执行评估
5. 分析结果
6. 做出选择

2. LLM评估指标

2.1 准确性指标

2.1.1 文本生成质量

BLEU分数

python
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(reference, candidate):
    """
    计算BLEU分数
    
    Args:
        reference: 参考文本
        candidate: 候选文本
    
    Returns:
        bleu_score: BLEU分数
    """
    reference_tokens = reference.split()
    candidate_tokens = candidate.split()
    
    smoothing = SmoothingFunction().method1
    bleu_score = sentence_bleu(
        [reference_tokens],
        candidate_tokens,
        smoothing_function=smoothing
    )
    
    return bleu_score

# 示例
reference = "The quick brown fox jumps over the lazy dog"
candidate = "The quick brown fox jumps over the lazy dog"
bleu = calculate_bleu(reference, candidate)
print(f"BLEU分数: {bleu:.4f}")

ROUGE分数

python
from rouge import Rouge

def calculate_rouge(reference, candidate):
    """
    计算ROUGE分数
    
    Args:
        reference: 参考文本
        candidate: 候选文本
    
    Returns:
        rouge_scores: ROUGE分数
    """
    rouge = Rouge()
    scores = rouge.get_scores(candidate, reference)
    return scores[0]

# 示例
reference = "The quick brown fox jumps over the lazy dog"
candidate = "The quick brown fox jumps over the lazy dog"
rouge = calculate_rouge(reference, candidate)
print(f"ROUGE-L分数: {rouge['rouge-l']['f']:.4f}")

2.1.2 任务准确率

python
def calculate_accuracy(predictions, labels):
    """
    计算准确率
    
    Args:
        predictions: 预测结果
        labels: 真实标签
    
    Returns:
        accuracy: 准确率
    """
    correct = sum(1 for p, l in zip(predictions, labels) if p == l)
    accuracy = correct / len(predictions)
    return accuracy

# 示例
predictions = ["正面", "负面", "中性", "正面"]
labels = ["正面", "负面", "中性", "负面"]
accuracy = calculate_accuracy(predictions, labels)
print(f"准确率: {accuracy:.2%}")

2.1.3 代码生成质量

python
import ast

def validate_code(code):
    """
    验证代码语法
    
    Args:
        code: 代码字符串
    
    Returns:
        is_valid: 是否有效
        error: 错误信息
    """
    try:
        ast.parse(code)
        return True, None
    except SyntaxError as e:
        return False, str(e)

# 示例
code = """
def add(a, b):
    return a + b
"""

is_valid, error = validate_code(code)
if is_valid:
    print("代码语法正确")
else:
    print(f"代码语法错误: {error}")

2.2 效率指标

2.2.1 响应时间

python
import time

def measure_response_time(client, model, prompt):
    """
    测量响应时间
    
    Args:
        client: LLM客户端
        model: 模型名称
        prompt: 输入提示
    
    Returns:
        response_time: 响应时间(秒)
        response: 响应内容
    """
    start_time = time.time()
    response = client.get_text(model, prompt)
    end_time = time.time()
    
    response_time = end_time - start_time
    return response_time, response

# 示例
response_time, response = measure_response_time(client, "gpt-4", "你好")
print(f"响应时间: {response_time:.2f}秒")

2.2.2 吞吐量

python
def measure_throughput(client, model, prompts, duration=60):
    """
    测量吞吐量
    
    Args:
        client: LLM客户端
        model: 模型名称
        prompts: 提示列表
        duration: 测试时长(秒)
    
    Returns:
        throughput: 吞吐量(请求/秒)
    """
    start_time = time.time()
    count = 0
    
    for prompt in prompts:
        if time.time() - start_time > duration:
            break
        
        client.get_text(model, prompt)
        count += 1
    
    elapsed_time = time.time() - start_time
    throughput = count / elapsed_time
    
    return throughput

# 示例
prompts = ["你好"] * 100
throughput = measure_throughput(client, "gpt-4", prompts)
print(f"吞吐量: {throughput:.2f} 请求/秒")

2.3 成本指标

2.3.1 API成本计算

python
def calculate_cost(input_tokens, output_tokens, model_pricing):
    """
    计算API成本
    
    Args:
        input_tokens: 输入token数
        output_tokens: 输出token数
        model_pricing: 模型定价
    
    Returns:
        cost: 成本(美元)
    """
    input_cost = (input_tokens / 1000) * model_pricing['input_price']
    output_cost = (output_tokens / 1000) * model_pricing['output_price']
    total_cost = input_cost + output_cost
    
    return total_cost

# 示例
model_pricing = {
    'input_price': 0.03,  # 每1000 token的价格
    'output_price': 0.06
}

input_tokens = 1000
output_tokens = 500

cost = calculate_cost(input_tokens, output_tokens, model_pricing)
print(f"成本: ${cost:.4f}")

2.3.2 部署成本计算

python
def calculate_deployment_cost(gpu_hours, gpu_cost_per_hour):
    """
    计算部署成本
    
    Args:
        gpu_hours: GPU使用小时数
        gpu_cost_per_hour: 每小时GPU成本
    
    Returns:
        cost: 成本(美元)
    """
    cost = gpu_hours * gpu_cost_per_hour
    return cost

# 示例
gpu_hours = 24 * 30  # 30天,每天24小时
gpu_cost_per_hour = 0.5  # 每小时0.5美元

cost = calculate_deployment_cost(gpu_hours, gpu_cost_per_hour)
print(f"月度部署成本: ${cost:.2f}")

3. 基准测试

3.1 常用基准测试

3.1.1 MMLU(Massive Multitask Language Understanding)

python
def evaluate_mmlu(client, model, questions):
    """
    评估MMLU基准
    
    Args:
        client: LLM客户端
        model: 模型名称
        questions: 问题列表
    
    Returns:
        accuracy: 准确率
    """
    correct = 0
    total = len(questions)
    
    for question in questions:
        prompt = f"""
        请回答以下选择题,只输出选项字母。

        问题:{question['question']}
        A. {question['A']}
        B. {question['B']}
        C. {question['C']}
        D. {question['D']}
        
        答案:
        """
        
        response = client.get_text(model, prompt).strip()
        
        if response == question['answer']:
            correct += 1
    
    accuracy = correct / total
    return accuracy

# 示例
questions = [
    {
        'question': 'Python中哪个关键字用于定义函数?',
        'A': 'class',
        'B': 'def',
        'C': 'function',
        'D': 'func',
        'answer': 'B'
    },
    # 更多问题...
]

accuracy = evaluate_mmlu(client, "gpt-4", questions)
print(f"MMLU准确率: {accuracy:.2%}")

3.1.2 HumanEval(代码生成)

python
def evaluate_human_eval(client, model, problems):
    """
    评估HumanEval基准
    
    Args:
        client: LLM客户端
        model: 模型名称
        problems: 问题列表
    
    Returns:
        pass_rate: 通过率
    """
    passed = 0
    total = len(problems)
    
    for problem in problems:
        prompt = f"""
        请完成以下Python函数:

        ```python
        {problem['prompt']}
        ```
        """
        
        response = client.get_text(model, prompt)
        
        # 提取代码
        code = extract_code(response)
        
        # 验证代码
        if validate_and_test(code, problem['tests']):
            passed += 1
    
    pass_rate = passed / total
    return pass_rate

def extract_code(text):
    """提取代码"""
    # 实现代码提取逻辑
    pass

def validate_and_test(code, tests):
    """验证并测试代码"""
    # 实现代码验证和测试逻辑
    pass

# 示例
problems = [
    {
        'prompt': 'def add(a, b):\n    """计算两个数的和"""\n    ',
        'tests': ['assert add(1, 2) == 3', 'assert add(-1, 1) == 0']
    },
    # 更多问题...
]

pass_rate = evaluate_human_eval(client, "gpt-4", problems)
print(f"HumanEval通过率: {pass_rate:.2%}")

3.1.3 GSM8K(数学推理)

python
def evaluate_gsm8k(client, model, problems):
    """
    评估GSM8K基准
    
    Args:
        client: LLM客户端
        model: 模型名称
        problems: 问题列表
    
    Returns:
        accuracy: 准确率
    """
    correct = 0
    total = len(problems)
    
    for problem in problems:
        prompt = f"""
        请逐步解决以下数学问题,并给出最终答案。

        问题:{problem['question']}
        """
        
        response = client.get_text(model, prompt)
        
        # 提取答案
        answer = extract_answer(response)
        
        if answer == problem['answer']:
            correct += 1
    
    accuracy = correct / total
    return accuracy

def extract_answer(text):
    """提取答案"""
    # 实现答案提取逻辑
    pass

# 示例
problems = [
    {
        'question': '一个农场有鸡和兔子共20只,共有56条腿,鸡和兔子各有多少只?',
        'answer': '鸡12只,兔子8只'
    },
    # 更多问题...
]

accuracy = evaluate_gsm8k(client, "gpt-4", problems)
print(f"GSM8K准确率: {accuracy:.2%}")

3.2 自定义基准测试

python
class CustomBenchmark:
    """自定义基准测试类"""
    
    def __init__(self, client, model):
        self.client = client
        self.model = model
        self.results = []
    
    def add_test_case(self, test_case):
        """
        添加测试用例
        
        Args:
            test_case: 测试用例
        """
        self.test_cases.append(test_case)
    
    def run(self):
        """
        运行基准测试
        
        Returns:
            results: 测试结果
        """
        results = []
        
        for test_case in self.test_cases:
            result = self._run_test_case(test_case)
            results.append(result)
        
        self.results = results
        return results
    
    def _run_test_case(self, test_case):
        """
        运行单个测试用例
        
        Args:
            test_case: 测试用例
        
        Returns:
            result: 测试结果
        """
        start_time = time.time()
        
        response = self.client.get_text(
            self.model,
            test_case['prompt']
        )
        
        end_time = time.time()
        response_time = end_time - start_time
        
        # 评估结果
        is_correct = self._evaluate(response, test_case)
        
        result = {
            'test_case': test_case,
            'response': response,
            'response_time': response_time,
            'is_correct': is_correct
        }
        
        return result
    
    def _evaluate(self, response, test_case):
        """
        评估响应
        
        Args:
            response: 响应
            test_case: 测试用例
        
        Returns:
            is_correct: 是否正确
        """
        # 实现评估逻辑
        pass
    
    def get_accuracy(self):
        """获取准确率"""
        correct = sum(1 for r in self.results if r['is_correct'])
        accuracy = correct / len(self.results)
        return accuracy
    
    def get_average_response_time(self):
        """获取平均响应时间"""
        total_time = sum(r['response_time'] for r in self.results)
        avg_time = total_time / len(self.results)
        return avg_time

# 使用示例
benchmark = CustomBenchmark(client, "gpt-4")

# 添加测试用例
benchmark.add_test_case({
    'prompt': '什么是Python?',
    'expected': 'Python是一种编程语言'
})

# 运行测试
results = benchmark.run()

# 获取结果
accuracy = benchmark.get_accuracy()
avg_time = benchmark.get_average_response_time()

print(f"准确率: {accuracy:.2%}")
print(f"平均响应时间: {avg_time:.2f}秒")

4. 模型选择

4.1 选择标准

4.1.1 性能标准

python
def evaluate_performance(client, models, test_cases):
    """
    评估模型性能
    
    Args:
        client: LLM客户端
        models: 模型列表
        test_cases: 测试用例
    
    Returns:
        results: 评估结果
    """
    results = {}
    
    for model in models:
        benchmark = CustomBenchmark(client, model)
        
        for test_case in test_cases:
            benchmark.add_test_case(test_case)
        
        benchmark.run()
        
        results[model] = {
            'accuracy': benchmark.get_accuracy(),
            'avg_response_time': benchmark.get_average_response_time()
        }
    
    return results

# 示例
models = ["gpt-4", "gpt-3.5-turbo", "claude-3-opus"]
test_cases = [
    {'prompt': '什么是Python?', 'expected': 'Python是一种编程语言'},
    # 更多测试用例...
]

results = evaluate_performance(client, models, test_cases)

for model, metrics in results.items():
    print(f"{model}:")
    print(f"  准确率: {metrics['accuracy']:.2%}")
    print(f"  平均响应时间: {metrics['avg_response_time']:.2f}秒")

4.1.2 成本标准

python
def calculate_total_cost(client, models, test_cases, model_pricing):
    """
    计算总成本
    
    Args:
        client: LLM客户端
        models: 模型列表
        test_cases: 测试用例
        model_pricing: 模型定价
    
    Returns:
        costs: 成本列表
    """
    costs = {}
    
    for model in models:
        total_cost = 0
        
        for test_case in test_cases:
            response = client.call(model, test_case['prompt'])
            
            input_tokens = response['usage']['prompt_tokens']
            output_tokens = response['usage']['completion_tokens']
            
            cost = calculate_cost(
                input_tokens,
                output_tokens,
                model_pricing[model]
            )
            
            total_cost += cost
        
        costs[model] = total_cost
    
    return costs

# 示例
model_pricing = {
    'gpt-4': {'input_price': 0.03, 'output_price': 0.06},
    'gpt-3.5-turbo': {'input_price': 0.0015, 'output_price': 0.002},
    'claude-3-opus': {'input_price': 0.015, 'output_price': 0.075}
}

costs = calculate_total_cost(client, models, test_cases, model_pricing)

for model, cost in costs.items():
    print(f"{model}: ${cost:.4f}")

4.1.3 综合评估

python
def comprehensive_evaluation(client, models, test_cases, model_pricing):
    """
    综合评估
    
    Args:
        client: LLM客户端
        models: 模型列表
        test_cases: 测试用例
        model_pricing: 模型定价
    
    Returns:
        evaluation: 评估结果
    """
    # 性能评估
    performance = evaluate_performance(client, models, test_cases)
    
    # 成本评估
    costs = calculate_total_cost(client, models, test_cases, model_pricing)
    
    # 综合评分
    evaluation = {}
    
    for model in models:
        # 归一化指标
        accuracy = performance[model]['accuracy']
        response_time = performance[model]['avg_response_time']
        cost = costs[model]
        
        # 计算综合得分
        score = (accuracy * 0.5) + ((1 / response_time) * 0.3) + ((1 / cost) * 0.2)
        
        evaluation[model] = {
            'accuracy': accuracy,
            'response_time': response_time,
            'cost': cost,
            'score': score
        }
    
    # 按得分排序
    evaluation = dict(sorted(evaluation.items(), key=lambda x: x[1]['score'], reverse=True))
    
    return evaluation

# 示例
evaluation = comprehensive_evaluation(client, models, test_cases, model_pricing)

print("综合评估结果:")
for rank, (model, metrics) in enumerate(evaluation.items(), 1):
    print(f"\n{rank}名: {model}")
    print(f"  准确率: {metrics['accuracy']:.2%}")
    print(f"  响应时间: {metrics['response_time']:.2f}秒")
    print(f"  成本: ${metrics['cost']:.4f}")
    print(f"  综合得分: {metrics['score']:.4f}")

4.2 选择建议

场景1:代码生成

python
# 推荐模型
code_generation_models = [
    {
        'model': 'gpt-4',
        'reason': '代码生成能力强,准确率高',
        'cost': '高'
    },
    {
        'model': 'claude-3-opus',
        'reason': '代码质量高,注释详细',
        'cost': '高'
    },
    {
        'model': 'codellama',
        'reason': '开源,可本地部署',
        'cost': '低'
    }
]

场景2:文本生成

python
# 推荐模型
text_generation_models = [
    {
        'model': 'gpt-4',
        'reason': '生成质量高,创意强',
        'cost': '高'
    },
    {
        'model': 'claude-3-opus',
        'reason': '文本流畅,逻辑清晰',
        'cost': '高'
    },
    {
        'model': 'gpt-3.5-turbo',
        'reason': '速度快,成本低',
        'cost': '低'
    }
]

场景3:问答系统

python
# 推荐模型
qa_models = [
    {
        'model': 'gpt-4',
        'reason': '理解能力强,回答准确',
        'cost': '高'
    },
    {
        'model': 'claude-3-opus',
        'reason': '上下文长,适合长文档',
        'cost': '高'
    },
    {
        'model': 'gemini-pro',
        'reason': '多语言支持好',
        'cost': '中'
    }
]

实践任务

任务:评估3个模型在特定任务上的表现

目标:评估3个模型在特定任务上的表现

要求

  1. 选择一个任务(代码生成、文本生成、问答等)
  2. 选择3个模型
  3. 设计测试用例
  4. 执行评估
  5. 分析结果

代码框架

python
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# 选择模型
models = ["gpt-4", "gpt-3.5-turbo", "claude-3-opus"]

# 设计测试用例
test_cases = [
    # TODO: 设计测试用例
]

# 评估模型
results = comprehensive_evaluation(client, models, test_cases, model_pricing)

# 分析结果
# TODO: 分析结果

课后作业

作业1:模型评估报告

题目:为特定任务选择最佳模型

要求

  1. 选择一个实际任务
  2. 选择3-5个模型
  3. 设计测试用例
  4. 执行评估
  5. 生成评估报告

作业2:成本优化分析

题目:分析如何降低LLM使用成本

要求

  1. 调研各模型定价
  2. 分析成本优化策略
  3. 提供优化建议

作业3:性能优化分析

题目:分析如何提高LLM性能

要求

  1. 分析影响性能的因素
  2. 提出优化方案
  3. 实现优化代码

参考资料

评估基准

  1. MMLU: https://github.com/hendrycks/test

    • 多任务语言理解基准
  2. HumanEval: https://github.com/openai/human-eval

    • 代码生成基准
  3. GSM8K: https://github.com/openai/grade-school-math

    • 数学推理基准

评估工具

  1. EleutherAI LM Evaluation Harness: https://github.com/EleutherAI/lm-evaluation-harness

    • LLM评估工具
  2. Promptfoo: https://promptfoo.dev/

    • Prompt评估工具

在线资源

  1. Hugging Face Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

    • 开源模型排行榜
  2. Papers with Code: https://paperswithcode.com/

    • 论文和代码

扩展阅读

评估方法

  • 自动评估: 使用指标自动评估
  • 人工评估: 人工打分评估
  • 混合评估: 结合自动和人工评估

评估挑战

  • 主观性: 评估标准主观
  • 成本: 评估成本高
  • 时间: 评估耗时长

下节预告

下一节我们将学习LLM模块总结与项目,回顾本周所学知识,并完成智能问答系统实战项目。


架构师AI杜公众号二维码

扫描二维码关注"架构师AI杜"公众号,获取更多技术内容和最新动态