🍃Customizable Auto Evaluation Framework

How to Use It

Just need a function call and you will get the output! 😄

In the example above, the function get_retrieval_relevancy_output_async() is evaluating the retrieval relevancy, that is, how closely the retrieved content matches the user's query. The output includes a numerical score that indicates the level of relevance, along with AI's reasoning explaining why it made that judgment. This type of evaluation is commonly known as AI-as-Judge.

Let's look at another example that measures the answer usefulness:

The Framework

Now could you see how simple it is to call the auto evaluation functions? And, are you curious, what's happening at the back-end in each of these functions?

They share the same framework built in Langchain, which means this framework can be used on any platform that supports Langchain.

This framework has 3 major components. Still use retrieval relevancy as an example, let's look into each component in detail.

Component 1 - Specify Output Data Structure

A successful auto evaluation framework should be able to output any user-defined data structure. This is where many open source libraries fall short, they often only produce string or binary outputs, which limits their real-world usefulness. In practical applications, evaluation metrics are frequently integers or numerical values, requiring greater flexibility in how results are represented.

Such flexibility can be implemented as below:

Define output data structure as a class: The RetrievalRelevancy class defines a score (an integer with a value range from 1 to 3, each representing a specific level of relevance) and a reasoning field (a string that stores the AI-as-Judge's explanation for its decision).
Langchain follows format instructions: The evaluate_retrieval_relevancy_async() function is the core logic of AI-as-Judge for retrieval relevancy here. The sections highlighted in pink indicate the Langchain functions that ensure the output strictly follows the structure defined in the RetrievalRelevancy class.

🌻 Click to see the code >>

Component 2 - Fully Customizable Prompt

As we know, to create real business value, AI solutions must be closely aligned with the real business needs, which can differ greatly from what's described in most research papers, textbooks, or online tutorials. Therefore, having full control over the prompt is essential, as it allows you to provide transparency to the business and customized instructions that precisely match your business needs. Most open-source libraries fail in real business scenarios due to limited customization options or a lack of transparency.

If evaluate_retrieval_relevancy_async() is the core function, then inside of it, chain = prompt | llm | output_parser is the key logic. The output_parser controls output format, the llm is the large language model used for AI-as-Judge, such as OpenAI's GPT model, Google's Gemini model, etc. The prompt provides all the details to guide the LLM on how to make judgement.

Like the example prompt below, the prompt includes 4 parts:

AI's role definition
Detailed instructions
Examples of output
Input and output variables

🌻 Click to see the code >>

When customizing your own prompts, you can explore Opik, which offers a variety of ready-to-use templates for common evaluation metrics, a great starting point! 😉

Component 3 - Ensure Code Efficiency

Define the functions as async makes a difference. It allows functions to run without blocking others, letting the program handle multiple tasks efficiently.

In Lady H.'s case, processing 300 records took 1 hour without asynchronous functions, but with asynchronous execution, the same task was completed in just 4 minutes.

🌻 Click to see the code >>

PreviousEvaluation NextIterations to A Trustworthy Pipeline

Last updated 1 month ago