🍃Experiments with Bigger Datasets

The experiments' results with bigger datasets are summarized below:

The data source of these two datasets differs from the previous small financial dataset. It contain a much larger data pool, but many of its financial Q&A pairs feature similar questions and answers.

🌻 Click to see the data sampling code >>

Looking into this table, you might notice several problems:

Problem 1 - DSPy's Unchanged Instructions: In both datasets, DSPy's optimized instruction stays the same as the baseline instruction.
Problem 2 - AdalFlow's overfitting with larger training data.
Problem 3 - DSPy's has lower testing performance than the baseline even though they share the same instruction.

The reason for problem 3 can be that DSPy loads the entire JSON file as its prompt, making the instruction only a part of it. Therefore, even though the instructions appear identical, DSPy’s overall prompt differs from the baseline's. 🌻 Click to see an example DSPy JSON prompt >>

The main cause of problems 1 and 2 is likely related to the data input. The source data contains many similar questions and answers, resulting in low diversity within the two samples. DSPy relies on data variance when updating prompts, but such variance was lacking in both datasets. AdalFlow performed better on the smaller dataset; however, as the dataset size increased, the lack of diversity caused all candidate prompts to yield similar losses. Consequently, text gradients amplified small random differences, leading to overfitting.

Key Takeaways 💖

To ensure better prompt optimization results in the future, we can take a few actions:

High quality data input is the key.
The data input doesn't need to be large, but it must contain sufficient variance.
DSPy remains the more mature framework in terms of cost efficiency (both time and money), evaluation flexibility, user experience and its ability to reduce overfitting. Of course, if you can find better open source libraries in the future is better!
If the evaluation step identifies what worked well and what didn’t, then using those insights to guide prompt revisions might lead to more effective optimization results.

Normally, Lady H. would continue experimenting until achieving stable and decent performance, but she was called by the Cosmos Banking Union for an urgent event and had to wrap up the experiments here. That's life 😉. Now it's your turn to explore more!

PreviousExperiments with Small Dataset NextEpilogue

Last updated 1 month ago