第17天：LLM评估与选择

学习目标

本节将带领读者深入理解LLM评估的重要性，掌握LLM评估指标，掌握基准测试方法，并能够评估和选择合适的LLM。

课程内容

1. LLM评估概述

1.1 为什么需要评估

LLM评估的重要性不言而喻。通过评估可以选择合适的模型，优化模型性能，降低使用成本，提高用户体验。评估维度包括性能评估（准确率、响应速度等）、成本评估（API费用、部署成本等）、易用性评估（API易用性、文档质量等）、可靠性评估（稳定性、可用性等）等多个方面。

1.2 评估流程

LLM评估的标准流程包括定义评估目标、选择评估指标、准备测试数据、执行评估、分析结果、做出选择等步骤。这个流程需要根据具体的应用场景进行调整，确保评估结果能够准确反映模型在实际应用中的表现。

2. LLM评估指标

2.1 准确性指标

2.1.1 文本生成质量

BLEU分数：

python

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(reference, candidate):
    """
    计算BLEU分数
    
    Args:
        reference: 参考文本
        candidate: 候选文本
    
    Returns:
        bleu_score: BLEU分数
    """
    reference_tokens = reference.split()
    candidate_tokens = candidate.split()
    
    smoothing = SmoothingFunction().method1
    bleu_score = sentence_bleu(
        [reference_tokens],
        candidate_tokens,
        smoothing_function=smoothing
    )
    
    return bleu_score

# 示例
reference = "The quick brown fox jumps over the lazy dog"
candidate = "The quick brown fox jumps over the lazy dog"
bleu = calculate_bleu(reference, candidate)
print(f"BLEU分数: {bleu:.4f}")

ROUGE分数：

python

from rouge import Rouge

def calculate_rouge(reference, candidate):
    """
    计算ROUGE分数
    
    Args:
        reference: 参考文本
        candidate: 候选文本
    
    Returns:
        rouge_scores: ROUGE分数
    """
    rouge = Rouge()
    scores = rouge.get_scores(candidate, reference)
    return scores[0]

# 示例
reference = "The quick brown fox jumps over the lazy dog"
candidate = "The quick brown fox jumps over the lazy dog"
rouge = calculate_rouge(reference, candidate)
print(f"ROUGE-L分数: {rouge['rouge-l']['f']:.4f}")

2.1.2 任务准确率

python

def calculate_accuracy(predictions, labels):
    """
    计算准确率
    
    Args:
        predictions: 预测结果
        labels: 真实标签
    
    Returns:
        accuracy: 准确率
    """
    correct = sum(1 for p, l in zip(predictions, labels) if p == l)
    accuracy = correct / len(predictions)
    return accuracy

# 示例
predictions = ["正面", "负面", "中性", "正面"]
labels = ["正面", "负面", "中性", "负面"]
accuracy = calculate_accuracy(predictions, labels)
print(f"准确率: {accuracy:.2%}")

2.1.3 代码生成质量

python

import ast

def validate_code(code):
    """
    验证代码语法
    
    Args:
        code: 代码字符串
    
    Returns:
        is_valid: 是否有效
        error: 错误信息
    """
    try:
        ast.parse(code)
        return True, None
    except SyntaxError as e:
        return False, str(e)

# 示例
code = """
def add(a, b):
    return a + b
"""

is_valid, error = validate_code(code)
if is_valid:
    print("代码语法正确")
else:
    print(f"代码语法错误: {error}")

2.2 效率指标

2.2.1 响应时间

python

import time

def measure_response_time(client, model, prompt):
    """
    测量响应时间
    
    Args:
        client: LLM客户端
        model: 模型名称
        prompt: 输入提示
    
    Returns:
        response_time: 响应时间（秒）
        response: 响应内容
    """
    start_time = time.time()
    response = client.get_text(model, prompt)
    end_time = time.time()
    
    response_time = end_time - start_time
    return response_time, response

# 示例
response_time, response = measure_response_time(client, "gpt-4", "你好")
print(f"响应时间: {response_time:.2f}秒")

2.2.2 吞吐量

python

def measure_throughput(client, model, prompts, duration=60):
    """
    测量吞吐量
    
    Args:
        client: LLM客户端
        model: 模型名称
        prompts: 提示列表
        duration: 测试时长（秒）
    
    Returns:
        throughput: 吞吐量（请求/秒）
    """
    start_time = time.time()
    count = 0
    
    for prompt in prompts:
        if time.time() - start_time > duration:
            break
        
        client.get_text(model, prompt)
        count += 1
    
    elapsed_time = time.time() - start_time
    throughput = count / elapsed_time
    
    return throughput

# 示例
prompts = ["你好"] * 100
throughput = measure_throughput(client, "gpt-4", prompts)
print(f"吞吐量: {throughput:.2f} 请求/秒")

2.3 成本指标

2.3.1 API成本计算

python

def calculate_cost(input_tokens, output_tokens, model_pricing):
    """
    计算API成本
    
    Args:
        input_tokens: 输入token数
        output_tokens: 输出token数
        model_pricing: 模型定价
    
    Returns:
        cost: 成本（美元）
    """
    input_cost = (input_tokens / 1000) * model_pricing['input_price']
    output_cost = (output_tokens / 1000) * model_pricing['output_price']
    total_cost = input_cost + output_cost
    
    return total_cost

# 示例
model_pricing = {
    'input_price': 0.03,  # 每1000 token的价格
    'output_price': 0.06
}

input_tokens = 1000
output_tokens = 500

cost = calculate_cost(input_tokens, output_tokens, model_pricing)
print(f"成本: ${cost:.4f}")

2.3.2 部署成本计算

python

def calculate_deployment_cost(gpu_hours, gpu_cost_per_hour):
    """
    计算部署成本
    
    Args:
        gpu_hours: GPU使用小时数
        gpu_cost_per_hour: 每小时GPU成本
    
    Returns:
        cost: 成本（美元）
    """
    cost = gpu_hours * gpu_cost_per_hour
    return cost

# 示例
gpu_hours = 24 * 30  # 30天，每天24小时
gpu_cost_per_hour = 0.5  # 每小时0.5美元

cost = calculate_deployment_cost(gpu_hours, gpu_cost_per_hour)
print(f"月度部署成本: ${cost:.2f}")

3. 基准测试

3.1 常用基准测试

3.1.1 MMLU（Massive Multitask Language Understanding）

python

def evaluate_mmlu(client, model, questions):
    """
    评估MMLU基准
    
    Args:
        client: LLM客户端
        model: 模型名称
        questions: 问题列表
    
    Returns:
        accuracy: 准确率
    """
    correct = 0
    total = len(questions)
    
    for question in questions:
        prompt = f"""
        请回答以下选择题，只输出选项字母。

        问题：{question['question']}
        A. {question['A']}
        B. {question['B']}
        C. {question['C']}
        D. {question['D']}
        
        答案：
        """
        
        response = client.get_text(model, prompt).strip()
        
        if response == question['answer']:
            correct += 1
    
    accuracy = correct / total
    return accuracy

# 示例
questions = [
    {
        'question': 'Python中哪个关键字用于定义函数？',
        'A': 'class',
        'B': 'def',
        'C': 'function',
        'D': 'func',
        'answer': 'B'
    },
    # 更多问题...
]

accuracy = evaluate_mmlu(client, "gpt-4", questions)
print(f"MMLU准确率: {accuracy:.2%}")

3.1.2 HumanEval（代码生成）

python

def evaluate_human_eval(client, model, problems):
    """
    评估HumanEval基准
    
    Args:
        client: LLM客户端
        model: 模型名称
        problems: 问题列表
    
    Returns:
        pass_rate: 通过率
    """
    passed = 0
    total = len(problems)
    
    for problem in problems:
        prompt = f"""
        请完成以下Python函数：

        ```python
        {problem['prompt']}
        ```
        """
        
        response = client.get_text(model, prompt)
        
        # 提取代码
        code = extract_code(response)
        
        # 验证代码
        if validate_and_test(code, problem['tests']):
            passed += 1
    
    pass_rate = passed / total
    return pass_rate

def extract_code(text):
    """提取代码"""
    # 实现代码提取逻辑
    pass

def validate_and_test(code, tests):
    """验证并测试代码"""
    # 实现代码验证和测试逻辑
    pass

# 示例
problems = [
    {
        'prompt': 'def add(a, b):\n    """计算两个数的和"""\n    ',
        'tests': ['assert add(1, 2) == 3', 'assert add(-1, 1) == 0']
    },
    # 更多问题...
]

pass_rate = evaluate_human_eval(client, "gpt-4", problems)
print(f"HumanEval通过率: {pass_rate:.2%}")

3.1.3 GSM8K（数学推理）

python

def evaluate_gsm8k(client, model, problems):
    """
    评估GSM8K基准
    
    Args:
        client: LLM客户端
        model: 模型名称
        problems: 问题列表
    
    Returns:
        accuracy: 准确率
    """
    correct = 0
    total = len(problems)
    
    for problem in problems:
        prompt = f"""
        请逐步解决以下数学问题，并给出最终答案。

        问题：{problem['question']}
        """
        
        response = client.get_text(model, prompt)
        
        # 提取答案
        answer = extract_answer(response)
        
        if answer == problem['answer']:
            correct += 1
    
    accuracy = correct / total
    return accuracy

def extract_answer(text):
    """提取答案"""
    # 实现答案提取逻辑
    pass

# 示例
problems = [
    {
        'question': '一个农场有鸡和兔子共20只，共有56条腿，鸡和兔子各有多少只？',
        'answer': '鸡12只，兔子8只'
    },
    # 更多问题...
]

accuracy = evaluate_gsm8k(client, "gpt-4", problems)
print(f"GSM8K准确率: {accuracy:.2%}")

3.2 自定义基准测试

python

class CustomBenchmark:
    """自定义基准测试类"""
    
    def __init__(self, client, model):
        self.client = client
        self.model = model
        self.results = []
    
    def add_test_case(self, test_case):
        """
        添加测试用例
        
        Args:
            test_case: 测试用例
        """
        self.test_cases.append(test_case)
    
    def run(self):
        """
        运行基准测试
        
        Returns:
            results: 测试结果
        """
        results = []
        
        for test_case in self.test_cases:
            result = self._run_test_case(test_case)
            results.append(result)
        
        self.results = results
        return results
    
    def _run_test_case(self, test_case):
        """
        运行单个测试用例
        
        Args:
            test_case: 测试用例
        
        Returns:
            result: 测试结果
        """
        start_time = time.time()
        
        response = self.client.get_text(
            self.model,
            test_case['prompt']
        )
        
        end_time = time.time()
        response_time = end_time - start_time
        
        # 评估结果
        is_correct = self._evaluate(response, test_case)
        
        result = {
            'test_case': test_case,
            'response': response,
            'response_time': response_time,
            'is_correct': is_correct
        }
        
        return result
    
    def _evaluate(self, response, test_case):
        """
        评估响应
        
        Args:
            response: 响应
            test_case: 测试用例
        
        Returns:
            is_correct: 是否正确
        """
        # 实现评估逻辑
        pass
    
    def get_accuracy(self):
        """获取准确率"""
        correct = sum(1 for r in self.results if r['is_correct'])
        accuracy = correct / len(self.results)
        return accuracy
    
    def get_average_response_time(self):
        """获取平均响应时间"""
        total_time = sum(r['response_time'] for r in self.results)
        avg_time = total_time / len(self.results)
        return avg_time

# 使用示例
benchmark = CustomBenchmark(client, "gpt-4")

# 添加测试用例
benchmark.add_test_case({
    'prompt': '什么是Python？',
    'expected': 'Python是一种编程语言'
})

# 运行测试
results = benchmark.run()

# 获取结果
accuracy = benchmark.get_accuracy()
avg_time = benchmark.get_average_response_time()

print(f"准确率: {accuracy:.2%}")
print(f"平均响应时间: {avg_time:.2f}秒")

4. 模型选择

4.1 选择标准

4.1.1 性能标准

python

def evaluate_performance(client, models, test_cases):
    """
    评估模型性能
    
    Args:
        client: LLM客户端
        models: 模型列表
        test_cases: 测试用例
    
    Returns:
        results: 评估结果
    """
    results = {}
    
    for model in models:
        benchmark = CustomBenchmark(client, model)
        
        for test_case in test_cases:
            benchmark.add_test_case(test_case)
        
        benchmark.run()
        
        results[model] = {
            'accuracy': benchmark.get_accuracy(),
            'avg_response_time': benchmark.get_average_response_time()
        }
    
    return results

# 示例
models = ["gpt-4", "gpt-3.5-turbo", "claude-3-opus"]
test_cases = [
    {'prompt': '什么是Python？', 'expected': 'Python是一种编程语言'},
    # 更多测试用例...
]

results = evaluate_performance(client, models, test_cases)

for model, metrics in results.items():
    print(f"{model}:")
    print(f"  准确率: {metrics['accuracy']:.2%}")
    print(f"  平均响应时间: {metrics['avg_response_time']:.2f}秒")

4.1.2 成本标准

python

def calculate_total_cost(client, models, test_cases, model_pricing):
    """
    计算总成本
    
    Args:
        client: LLM客户端
        models: 模型列表
        test_cases: 测试用例
        model_pricing: 模型定价
    
    Returns:
        costs: 成本列表
    """
    costs = {}
    
    for model in models:
        total_cost = 0
        
        for test_case in test_cases:
            response = client.call(model, test_case['prompt'])
            
            input_tokens = response['usage']['prompt_tokens']
            output_tokens = response['usage']['completion_tokens']
            
            cost = calculate_cost(
                input_tokens,
                output_tokens,
                model_pricing[model]
            )
            
            total_cost += cost
        
        costs[model] = total_cost
    
    return costs

# 示例
model_pricing = {
    'gpt-4': {'input_price': 0.03, 'output_price': 0.06},
    'gpt-3.5-turbo': {'input_price': 0.0015, 'output_price': 0.002},
    'claude-3-opus': {'input_price': 0.015, 'output_price': 0.075}
}

costs = calculate_total_cost(client, models, test_cases, model_pricing)

for model, cost in costs.items():
    print(f"{model}: ${cost:.4f}")

4.1.3 综合评估

python

def comprehensive_evaluation(client, models, test_cases, model_pricing):
    """
    综合评估
    
    Args:
        client: LLM客户端
        models: 模型列表
        test_cases: 测试用例
        model_pricing: 模型定价
    
    Returns:
        evaluation: 评估结果
    """
    # 性能评估
    performance = evaluate_performance(client, models, test_cases)
    
    # 成本评估
    costs = calculate_total_cost(client, models, test_cases, model_pricing)
    
    # 综合评分
    evaluation = {}
    
    for model in models:
        # 归一化指标
        accuracy = performance[model]['accuracy']
        response_time = performance[model]['avg_response_time']
        cost = costs[model]
        
        # 计算综合得分
        score = (accuracy * 0.5) + ((1 / response_time) * 0.3) + ((1 / cost) * 0.2)
        
        evaluation[model] = {
            'accuracy': accuracy,
            'response_time': response_time,
            'cost': cost,
            'score': score
        }
    
    # 按得分排序
    evaluation = dict(sorted(evaluation.items(), key=lambda x: x[1]['score'], reverse=True))
    
    return evaluation

# 示例
evaluation = comprehensive_evaluation(client, models, test_cases, model_pricing)

print("综合评估结果:")
for rank, (model, metrics) in enumerate(evaluation.items(), 1):
    print(f"\n第{rank}名: {model}")
    print(f"  准确率: {metrics['accuracy']:.2%}")
    print(f"  响应时间: {metrics['response_time']:.2f}秒")
    print(f"  成本: ${metrics['cost']:.4f}")
    print(f"  综合得分: {metrics['score']:.4f}")

4.2 选择建议

场景1：代码生成

python

# 推荐模型
code_generation_models = [
    {
        'model': 'gpt-4',
        'reason': '代码生成能力强，准确率高',
        'cost': '高'
    },
    {
        'model': 'claude-3-opus',
        'reason': '代码质量高，注释详细',
        'cost': '高'
    },
    {
        'model': 'codellama',
        'reason': '开源，可本地部署',
        'cost': '低'
    }
]

场景2：文本生成

python

# 推荐模型
text_generation_models = [
    {
        'model': 'gpt-4',
        'reason': '生成质量高，创意强',
        'cost': '高'
    },
    {
        'model': 'claude-3-opus',
        'reason': '文本流畅，逻辑清晰',
        'cost': '高'
    },
    {
        'model': 'gpt-3.5-turbo',
        'reason': '速度快，成本低',
        'cost': '低'
    }
]

场景3：问答系统

python

# 推荐模型
qa_models = [
    {
        'model': 'gpt-4',
        'reason': '理解能力强，回答准确',
        'cost': '高'
    },
    {
        'model': 'claude-3-opus',
        'reason': '上下文长，适合长文档',
        'cost': '高'
    },
    {
        'model': 'gemini-pro',
        'reason': '多语言支持好',
        'cost': '中'
    }
]

实践任务

任务：评估3个模型在特定任务上的表现

目标：评估3个模型在特定任务上的表现

要求：

选择一个任务（代码生成、文本生成、问答等）
选择3个模型
设计测试用例
执行评估
分析结果

代码框架：

python

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# 选择模型
models = ["gpt-4", "gpt-3.5-turbo", "claude-3-opus"]

# 设计测试用例
test_cases = [
    # TODO: 设计测试用例
]

# 评估模型
results = comprehensive_evaluation(client, models, test_cases, model_pricing)

# 分析结果
# TODO: 分析结果

课后作业

作业1：模型评估报告

题目：为特定任务选择最佳模型

要求：

选择一个实际任务
选择3-5个模型
设计测试用例
执行评估
生成评估报告

作业2：成本优化分析

题目：分析如何降低LLM使用成本

要求：

调研各模型定价
分析成本优化策略
提供优化建议

作业3：性能优化分析

题目：分析如何提高LLM性能

要求：

分析影响性能的因素
提出优化方案
实现优化代码

参考资料

评估基准

MMLU: https://github.com/hendrycks/test
- 多任务语言理解基准
HumanEval: https://github.com/openai/human-eval
- 代码生成基准
GSM8K: https://github.com/openai/grade-school-math
- 数学推理基准

评估工具

EleutherAI LM Evaluation Harness: https://github.com/EleutherAI/lm-evaluation-harness
- LLM评估工具
Promptfoo: https://promptfoo.dev/
- Prompt评估工具

在线资源

Hugging Face Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- 开源模型排行榜
Papers with Code: https://paperswithcode.com/
- 论文和代码

扩展阅读

评估方法

评估方法包括自动评估（使用指标自动评估）、人工评估（人工打分评估）、混合评估（结合自动和人工评估）等方式。

评估挑战

评估挑战包括主观性（评估标准主观）、成本（评估成本高）、时间（评估耗时长）等方面。

下节预告

下一节我们将学习LLM模块总结与项目，回顾本周所学知识，并完成智能问答系统实战项目。

扫描二维码关注"架构师AI杜"公众号，获取更多技术内容和最新动态

第17天：LLM评估与选择 ​

学习目标 ​

课程内容 ​

1. LLM评估概述 ​

1.1 为什么需要评估 ​

1.2 评估流程 ​

2. LLM评估指标 ​

2.1 准确性指标 ​

2.2 效率指标 ​

2.3 成本指标 ​

3. 基准测试 ​

3.1 常用基准测试 ​

3.2 自定义基准测试 ​

4. 模型选择 ​

4.1 选择标准 ​

4.2 选择建议 ​

实践任务 ​

任务：评估3个模型在特定任务上的表现 ​

课后作业 ​

作业1：模型评估报告 ​

作业2：成本优化分析 ​

作业3：性能优化分析 ​

参考资料 ​

评估基准 ​

评估工具 ​

在线资源 ​

扩展阅读 ​

评估方法 ​

评估挑战 ​

下节预告 ​

第17天：LLM评估与选择

学习目标

课程内容

1. LLM评估概述

1.1 为什么需要评估

1.2 评估流程

2. LLM评估指标

2.1 准确性指标

2.2 效率指标

2.3 成本指标

3. 基准测试

3.1 常用基准测试

3.2 自定义基准测试

4. 模型选择

4.1 选择标准

4.2 选择建议

实践任务

任务：评估3个模型在特定任务上的表现

课后作业

作业1：模型评估报告

作业2：成本优化分析

作业3：性能优化分析

参考资料

评估基准

评估工具

在线资源

扩展阅读

评估方法

评估挑战

下节预告