Skip to main content
Thereโ€™s always a reason to do a model comparison - whether a new model/finetune drops, a new methodology for prompting comes out, or itโ€™s time to evaluate an in-house model. Oxenโ€™s diff tool allows you to evaluate and compare model outputs from small tests all the way to large benchmarks. To follow along with this example, weโ€™ll be using data from the BoolQ Repo which was generated with this notebook.
oxen clone https://www.oxen.ai/datasets/boolq-llama-gemma
cd boolq-llama-gemma

Our Data

In this repo, weโ€™re comparing the outputs of the Gemma-2b-Instruct model and Llama-7b-chat-hf model on the Boolq Benchmark. Letโ€™s check out the structure of these datasets:
oxen df gemma.jsonl
Each dataframe has columns for the index, the context for the question, the prompt, and the ground truth label (validation_response). We can see here that our models didnโ€™t output exactly โ€œTrueโ€ or โ€œFalseโ€ like they were told to. So we added a column processed_response to show a clean difference between the outputs.

Comparing Model Results

But we mainly care about how these models do compared to each other. So we basically want to know where the processed_responseโ€™s are different in each file.
oxen diff gemma.jsonl llama_chat.jsonl -k id,context,prompt,validation_response -c processed_response
Row changes: 
   ฮ” 360 (modified)

shape: (360, 7)
+------+-----------------------------------+-----------------------------------+---------------------+-------------------------+--------------------------+-------------------+
| id   | context                           | prompt                            | validation_response | processed_response.left | processed_response.right | .oxen.diff.status |
| ---  | ---                               | ---                               | ---                 | ---                     | ---                      | ---               |
| i64  | str                               | str                               | str                 | str                     | str                      | str               |
+------+-----------------------------------+-----------------------------------+---------------------+-------------------------+--------------------------+-------------------+
| 0    | All biomass goes through at leasโ€ฆ | does ethanol take more energy maโ€ฆ | False               | False                   | True                     | modified          |
| 12   | Shower gels for men may contain โ€ฆ | is it bad to wash your hair withโ€ฆ | True                | False                   | True                     | modified          |
| 25   | The drinking age in Wisconsin isโ€ฆ | can you drink alcohol with your โ€ฆ | True                | True                    | Both                     | modified          |
| 38   | The carbon-hydrogen bond (C--H bโ€ฆ | can carbon form polar covalent bโ€ฆ | False               | False                   | True                     | modified          |
| โ€ฆ    | โ€ฆ                                 | โ€ฆ                                 | โ€ฆ                   | โ€ฆ                       | โ€ฆ                        | โ€ฆ                 |
| 3213 | It is illegal to sell packaged lโ€ฆ | are liquor stores in oklahoma opโ€ฆ | False               | False                   | True                     | modified          |
| 3216 | Flash memory cards, e.g., Secureโ€ฆ | is a memory card the same as a fโ€ฆ | False               | False                   | True                     | modified          |
| 3220 | Rumors of this chemical's existeโ€ฆ | can pool water change color if yโ€ฆ | False               | False                   | True                     | modified          |
| 3224 | Before the 1999--2000 season awaโ€ฆ | do away goals count in the leaguโ€ฆ | False               | False                   | True                     | modified          |
+------+-----------------------------------+-----------------------------------+---------------------+-------------------------+--------------------------+-------------------+

View Results in Oxen UI

These results are also available in the Oxen UI, which makes it a bit easier to grok whatโ€™s going on than the command line. compare model results You can view the results in the UI by going to the compare tab in this repository. From this, we can see that out of the 3270 total samples, our models disagreed on 360 total samples, or roughly 11% of the dataset. In some cases, like line 25, the model on the right (llama_chat in this case) didnโ€™t really provide an answer, as it responded with both โ€œTrue and Falseโ€.

Takeaways

Some potential takeaways are:
  1. Gemma-2b was better at following these instructions (text formating) than Llama-7b despite its smaller size.
  2. These models were fairly in agreement on the validation set without any finetuning on the training set.
  3. Gemma-2b is a candidate to replace Llama-7b-chat as a base model for this task, however we will need to further explore to confirm.

Next Steps

We will use the oxen diff tool to dive deeper into these results, comparing accuracies. We will also further explore the trends in these differences and how to use Oxen to take the next steps in our data science workflow.