🍃Experiments with Small Dataset

The small financial dataset contains 30 records in total. As you can see in the table below, the evaluation metrics between DSPy and AdalFlow are not directly comparable, even though both use the "G-Eval" score (higher score the better). This is fine, since the main goal of these experiments is simply to have runnable code for both tools first as LLM usage costs money 💸, and to get a general sense of their efficiency and outputs.

In addition, Lady H. wanted to quickly compare DSPy’s MIPROv2 with BootstrapFewShot. BootstrapFewShot is a lightweight DSPy optimizer that selects the top-performing examples to form a few-shot prompt. DSPy recommends using BootstrapFewShot for small datasets, so Lady H. wanted to see whether it actually performs better in this case. However, it didn’t — not only was its performance after optimization lower than MIPROv2’s, but the optimized prompt also remained unchanged.

Experiment

About

Results

DSPy with

MIPROv2

Data Split
- 10 training
- 20 testing
Prompt Before
"Answer questions based on retrieved context."
Prompt After
"Given a question related to financial or banking procedures for businesses and a set of retrieved contextual information, generate a comprehensive and precise response. Your answer should include a clear step-by-step reasoning process that explains how the information relates to the question, followed by a concise final answer. Ensure that your reasoning demonstrates an understanding of procedural details, legal considerations, and potential pitfalls. The goal is to produce an explanation that is both informative and transparent, enabling users to understand the basis for your conclusion. Use the retrieved context effectively to support your reasoning and answer formulation."

Average G-Eval
- Before Opt: 0.46
- After Opt: 0.53
🌻 Click to see full code >>

DSPy with

Bootstrap- FewShot

Data Split
- 10 training
- 20 testing

Prompt Before
"Answer questions based on retrieved context."
Prompt After
"Answer questions based on retrieved context."

Average G-Eval
- Before Opt: 0.47
- After Opt: 0.51
🌻 Click to see full code >>

AdalFlow

Data Split
- 10 training
- 10 validation
- 10 testing

Prompt Before
"Answer questions with short factoid answers. You will receive context(contain relevant facts). Think step by step."
Prompt After
"Have the cheque reissued to the proper payee, such as the business name, and then deposit it into your business account."

AdalFlow G-Eval
- Validation: 0.8
- Testing: 0.72
🌻 Click to see full code >>

As highlighted in the table, AdalFlow’s optimized prompt provides an answer to a specific question rather than a general prompt applicable to all questions. This is a clear case of overfitting, which still occurs after Lady H. added regularization to the code.

Based on the insights gained here, Lady H. decided to apply DSPy’s MIPROv2 and AdalFlow to larger datasets.

PreviousHow Does AdalFlow Work NextExperiments with Bigger Datasets

Last updated 1 month ago