🍃Experiments with Small Dataset
The small financial dataset contains 30 records in total. As you can see in the table below, the evaluation metrics between DSPy and AdalFlow are not directly comparable, even though both use the "G-Eval" score (higher score the better). This is fine, since the main goal of these experiments is simply to have runnable code for both tools first as LLM usage costs money 💸, and to get a general sense of their efficiency and outputs.
In addition, Lady H. wanted to quickly compare DSPy’s MIPROv2 with BootstrapFewShot. BootstrapFewShot is a lightweight DSPy optimizer that selects the top-performing examples to form a few-shot prompt. DSPy recommends using BootstrapFewShot for small datasets, so Lady H. wanted to see whether it actually performs better in this case. However, it didn’t — not only was its performance after optimization lower than MIPROv2’s, but the optimized prompt also remained unchanged.
DSPy with
MIPROv2
Data Split
10 training
20 testing
Prompt Before
"Answer questions based on retrieved context."
Prompt After
"Given a question related to financial or banking procedures for businesses and a set of retrieved contextual information, generate a comprehensive and precise response. Your answer should include a clear step-by-step reasoning process that explains how the information relates to the question, followed by a concise final answer. Ensure that your reasoning demonstrates an understanding of procedural details, legal considerations, and potential pitfalls. The goal is to produce an explanation that is both informative and transparent, enabling users to understand the basis for your conclusion. Use the retrieved context effectively to support your reasoning and answer formulation."
Average G-Eval
Before Opt: 0.46
After Opt: 0.53
DSPy with
Bootstrap- FewShot
Data Split
10 training
20 testing
Prompt Before
"Answer questions based on retrieved context."
Prompt After
"Answer questions based on retrieved context."
Average G-Eval
Before Opt: 0.47
After Opt: 0.51
AdalFlow
Data Split
10 training
10 validation
10 testing
Prompt Before
"Answer questions with short factoid answers. You will receive context(contain relevant facts). Think step by step."
Prompt After
"Have the cheque reissued to the proper payee, such as the business name, and then deposit it into your business account."
AdalFlow G-Eval
Validation: 0.8
Testing: 0.72
As highlighted in the table, AdalFlow’s optimized prompt provides an answer to a specific question rather than a general prompt applicable to all questions. This is a clear case of overfitting, which still occurs after Lady H. added regularization to the code.

Based on the insights gained here, Lady H. decided to apply DSPy’s MIPROv2 and AdalFlow to larger datasets.
Last updated