๐Iterations to A Trustworthy Pipeline
AI-as-Judge isn't perfect, but it can deliver more consistent, objective, detailed, and faster results than manually evaluation. Therefore, it's worth investing effort to turn this framework into a trustworthy pipeline rather than ignoring its potential.
Development Stage
To build a trustworthy auto evaluation pipeline, expert-in-the-loop and iterative improvements during the development stage are essential.

The auto evaluation pipeline takes as input the user query, the reference answer (the expected correct answer), and multiple versions of generated answers. It then compares each version with the reference answer and outputs an evaluation score along with the reasoning for each. The number of versions are NOT limited to two versions.
In each iteration, business experts review the auto evaluation's output, providing detailed feedback on what worked well, what didnโt, and why. Based on this feedback, we adjust the evaluation prompt to improve performance. These iterations continue until the auto evaluation achieves a satisfactory level of accuracy.

During this stage, having diligent business experts who are willing to share their knowledge openly is both incredibly fortunate and critically important. Lady H. once collaborated with such a professional, while others tended to complain about AI or delivered slow and inconsistent evaluation results all the time...... So expert-in-the-loop also takes luck! ๐
Deployment Stage
When the auto evaluation pipeline is ready for use, the data input and output format stay the same.

You can also integrate visual generation into the pipeline, allowing it to summarize results in charts that make version comparison and selection more straightforward.
For example, the chart below show the answer usefulness of 2 versions of answers:

๐ป See full notebook of version comparison using auto evaluation >>
Last updated