远程核实政务助手app
69.01MB · 2025-11-20
评估(Evaluations)是衡量大模型(LLM)应用程序性能的一种量化方式。LLM 的行为可能存在不确定性,即便对提示词、模型或输入做出微小调整,也可能对结果产生显著影响。而评估能够提供一种结构化方法,帮助识别故障、对比版本,并构建更可靠的人工智能应用程序。
在LangSmith中执行评估需要三个关键组件:
本文将使用LangSmith SDK带你完成一次基础评估(验证LLM 响应的正确性)。
pip install -U langsmith openevals openai
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langsmith-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"
export LANGSMITH_WORKSPACE_ID="<your-workspace-id>"
创建一个文件并添加以下代码,代码将实现以下功能:
# dataset.py
from langsmith import Client
def main():
client = Client()
# Programmatically create a dataset in LangSmith
dataset = client.create_dataset(
dataset_name="Sample dataset",
description="A sample dataset in LangSmith."
)
# Create examples
examples = [
{
"inputs": {"question": "Which country is Mount Kilimanjaro located in?"},
"outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."},
},
{
"inputs": {"question": "What is Earth's lowest point?"},
"outputs": {"answer": "Earth's lowest point is The Dead Sea."},
},
]
# Add examples to the dataset
client.create_examples(dataset_id=dataset.id, examples=examples)
print("Created dataset:", dataset.name)
if __name__ == "__main__":
main()
运行文件以创建数据集:
python dataset.py
定义一个包含待评估内容的目标函数(target function)。在本文中,将定义一个目标函数,该函数包含一次用于回答问题的大模型调用。
# eval.py
from langsmith import Client, wrappers
from openai import OpenAI
# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())
# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
response = openai_client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": "Answer the following question accurately"},
{"role": "user", "content": inputs["question"]},
],
)
return {"answer": response.choices[0].message.content.strip()}
在本步骤中,你需要告知LangSmith如何为应用程序生成的答案评分。从openevals中导入一个预构建的评估提示词(CORRECTNESS_PROMPT),以及一个辅助工具 —— 该工具会将其封装为 “以 LLM 作为评判者” 的评估器,此评估器将为应用程序的输出打分。 该评估器会对以下三类信息进行对比: 输入(inputs):传入目标函数的内容(例如,问题文本)。 输出(outputs):目标函数返回的内容(例如,模型生成的答案)。 参考输出(reference_outputs):在创建数据集时,你为每个数据集示例附加的基准真值答案(即 “标准答案”)。
from langsmith import Client, wrappers
from openai import OpenAI
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT
# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())
# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
response = openai_client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": "Answer the following question accurately"},
{"role": "user", "content": inputs["question"]},
],
)
return {"answer": response.choices[0].message.content.strip()}
def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
evaluator = create_llm_as_judge(
prompt=CORRECTNESS_PROMPT,
model="openai:o3-mini",
feedback_key="correctness",
)
return evaluator(
inputs=inputs,
outputs=outputs,
reference_outputs=reference_outputs
)
若要运行评估实验,需调用 evaluate(...) 函数,该函数将执行以下操作:
from langsmith import Client, wrappers
from openai import OpenAI
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT
# Wrap the OpenAI client for LangSmith tracing
openai_client = wrappers.wrap_openai(OpenAI())
# Define the application logic you want to evaluate inside a target function
# The SDK will automatically send the inputs from the dataset to your target function
def target(inputs: dict) -> dict:
response = openai_client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": "Answer the following question accurately"},
{"role": "user", "content": inputs["question"]},
],
)
return {"answer": response.choices[0].message.content.strip()}
def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
evaluator = create_llm_as_judge(
prompt=CORRECTNESS_PROMPT,
model="openai:o3-mini",
feedback_key="correctness",
)
return evaluator(
inputs=inputs,
outputs=outputs,
reference_outputs=reference_outputs
)
# After running the evaluation, a link will be provided to view the results in langsmith
def main():
client = Client()
experiment_results = client.evaluate(
target,
data="Sample dataset",
evaluators=[
correctness_evaluator,
# can add multiple evaluators here
],
experiment_prefix="first-eval-in-langsmith",
max_concurrency=2,
)
print(experiment_results)
if __name__ == "__main__":
main()