Predicting Which Chatbot Response Humans Prefer

Large Language Models (LLMs) are rapidly becoming part of everyday life. As these systems interact with users, an important question emerges:

Which responses do people actually prefer?

In this Kaggle competition, each sample contains a prompt and two responses generated by anonymous chatbots. Human annotators select the response they prefer: Model A, Model B, or a tie. The challenge is to build a model that can predict these preferences.

This was my first Kaggle competition. Because the competition runs indefinitely and has no prize pool, it provided a low-stakes environment to experiment and learn.

After several iterations, my final model achieved a log-loss score of 1.02, placing 36th out of 213 submissions at the time of writing.

Competition link: https://www.kaggle.com/competitions/llm-classification-finetuning

Code: https://www.kaggle.com/code/danman1432/llm-classification-finetuning

Data Preparation and Feature Engineering

Problem Setup

The goal of the competition was to predict which of two chatbot responses a user would prefer for a given prompt. Each example contains:

a prompt
response A
response B
the human preference:
- Model A wins
- Model B wins
- Tie

This makes the problem a three-class classification task.

Dataset Overview

The competition provided three files:

Training dataset

Contains the prompt. both model responses, and labels indication whether the user preferred Model A, Model B, or tie.

Test dataset

Has the same structured as training data but without preference labels. The goal is to predict probabilities for each possible outcome.

Sample submission

Shows the required format for predictions, with probabilities for winner_model_a, winner_model_b, winner_tie.

Feature Engineering

The raw dataset contains mostly text which has to be converted into features that a model could learn from.

Target Variable

The original dataset included three indicator columns representing the winning response. These were converted into a single multi-class target variable:

0 -> Model A wins
1 -> Model B wins
2 -> Tie

This simplifies the task into a standard three-class classification problem.

Structural Features

Before working with the text itself, I created numerical features describing the structure of the responses.

Length-based features

Word count
Character count
Sentence count
Difference in word count
Difference in character count
Ratio of response lengths

These features help capture situations where one model tends to produce longer, more detailed answers, while the other produces shorter or more concise responses.

Stylistic Features

I also added simple indicators of writing style:

Average word length
Number of exclamation points
Number of question marks
Number of line breaks

These features capture stylistic differences such as formatting, rhetorical questions, or expressive language.

Prompt alignment features

Another useful signal is how closely each response relates to the original prompt. To measure this, I computed:

Prompt overlap with response A
Prompt overlap with response B
The difference between the two

This helps the model detect when one response is more aligned with the question being asked.

Constructing the Final Text Fields

Finally, each response was paired with the original prompt to create two combined text fields:

pair_a: prompt + response A
pair_b: prompt + response B

This allows later stages of the pipeline to evaluate each response in the context of the question it is answering, rather than analyzing the response in isolation.

Converting Text into Features with TF-IDF

Next step was to transform raw text into numerical features that a machine learning model could use. Because the dataset consists of prompts and two competing responses, the challenge is comparing two pieces of text relative to the same response.

Why TF-IDF?

Machine learning models cannot work directly with raw text, so the responses first need to be converted into numerical features. To do this, I used TF-IDF. TF-IDF assigns higher weights to words that are important in a specific document but relatively rare across the entire dataset. Words that appear everywhere (like “the” or “is”) receive lower weights. This helps highlights informative terms rather than common filler words. Because each response was already paired with its prompt (pair_a and pair_b), the TF-IDF representation allows the model to learn relationships between the instruction and the generated answer.

Capturing Phrases with N-grams

Single words sometime miss important context. For example, the phrase “not allowed” carries a different meaning than the individual words “not” and “allowed”. To capture these patterns, I used n-grams, which represent sequences of words.

Unigrams (1-grams) – individual words
Bigrams (2-grams) – two word combinations

For example, the sentence:

“The capital of Japan is Tokyo”

would produce

Unigrams:

“The”
“capital”
“of”
“Japan”
“is”
“Tokyo”

Bigrams:

“The capital”
“capital of”
“of Japan”
“Japan is”
“is Tokyo”

Including bigrams allows the model to capture short phrases and local context, which can be important when comparing responses.

Limiting the Vocabulary Size

Using bigrams dramatically increases the number of possible features. To keep the model manageable, I limited the TF-IDF vocabulary to 30,000 features. Here, a feature refers to a word or phrase in the vocabulary. Each document is represented as a vector of length 30,000, where each position corresponds to the TF-IDF weight of a specific term. This keeps the representation expressive while preventing the feature space from blowing up.

Comparing the Two Responses

After generating TF-IDF vectors for both responses, I computed their difference. This difference vector highlights which words or phrases appear more strongly in one response than the other. This representation gives the model direct information about how the two answers differ, which is often more useful than analyzing them independently.

Summary of Text and Structural Features

All features were combined into a single dataset before training the model. These features included:

TF-IDF features for response A
TF-IDF features for response B
TF-IDF difference vector (A-B)
Engineered numerical features

Together, these features capture both semantic information from the text and structural signals from the responses.

Modeling, Training, and Evaluation

After transforming the dataset into numerical features, the next step was training a model capable of predicting whether Response A, Response B, or both responses would be preferred. For this task, I used LightGBM, a gradient boosting algorithm that works well with large, sparse feature spaces like those produced by TF-IDF.

Why LightGBM?

LightGBM was a good fit for this problem for several reasons:

It performs well with high-dimensional sparse data
It can handle mixed feature types (text features and numeric features) without additional scaling
It captures non-linear relationships efficiently
It trains quickly even on large datasets

Training Setup

To evaluate the model during development, I created an 80/20 train-validation split from the training dataset. Because gradient boosting models can easily overfit if allowed to train indefinitely, I used early stopping. Training stops when the validation loss stops improving for a specified number of iterations. This prevents the model from memorizing training data and identifies the optimal number of boosting rounds.

To further reduce overfitting, I applied several regularization techniques:

leaf contraints
feature subsampling
L1 and L2 regularization penalties

Validation Results

On the 20% validation set, the model achieved:

Log-loss: 1.019
Accuracy: 48.8%
Best iteration: 167 boosting rounds

For a three-class problem, random guessing would produce an accuracy of roughly 33%, so the model performs substantially better than chance. Although training was allowed to run up to 2000 boosting rounds, early stopping identified 167 as the optimal point. This suggests that most of the useful learning happened early in training.

Predicted Class Distribution

The model’s predictions were distributed as follows:

Response A wins: 38.4%
Response B wins: 36.2%
Tie: 25.4%

This distribution is reasonably balanced and reflects the fact that many examples in the dataset involve clear preferences between the two responses.

Confusion Matrix Insights

Examining the confusion matrix reveals an interesting pattern.

The model performs reasonably well distinguishing between Response A and Response B. Many of these examples are classified correctly. However, ties are significantly harder to predict.

For cases where the true label was a tie, the model predicted:

A win: 1157
B win: 1067
Tie: 1328
This means the model correctly identifies a tie, the mo

This means the model correctly identifies a tie about 37% of the time. The result is not surprising. A tie often presents ambiguous cases where both responses are similarly good, making them harder for the model to distinguish.

Classification Metrics

Performance by class:

Response A

Precision: 0.4776
Recall: 0.5472
F1: 0.5212

Response B

Precision: 0.5008
Recall: 0.5299
F1: 0.5150

Tie

Precision: 0.4542
Recall: 0.3739
F1: 0.4101

As expected, performance on the tie class is noticeably weaker, reflecting the ambiguity of these examples.

Another interesting observation is that the model’s average top predicted probability was only 47.3%. This suggests that many examples were inherently uncertain rather than easy high-confidence decisions.

Feature importance

Looking at feature importance provides insight into what signals the model relied on most.

Top numeric features

Character difference
Sentence Difference
Character Ratio
Average word length difference
Prompt words
Chatbot b lines
Word difference
Chatbot a average word length
Prompt overlap difference
Chatbot a lines

The most important features were related to response-length and structure, illustrating that these influence human preferences the most.

Importance by Feature Group

Breaking down importance by feature category:

A-B difference in TF-IDF: 40.9%
Pair_b TF-IDF: 26.8%
Pair_a TF-IDF: 26.1%
Numeric 6.2%

This supports the feature design where comparing the two responses directly brings the most value than looking at each response independently.

What the Model Learned From Text

Looking at the most influential TF-IDF terms provides additional insight. Some of the top terms included: “provide”, “cannot, “an ai”, “sorry”, “apologize”, “appropriate”. These terms suggest the model learned stylistic patterns associated with helpfulness or refusal behavior. Apology-heavy responses may appear in refusals and words like “provide” may correlate with more helpful responses. These patterns indicate that the model is picking up on stylistic cues that influence human preferences.

Training on the Full Dataset

After validating the model, I retrained it on the entire training dataset. Using all available labeled examples allows the model to learn from the full set of patterns before generating predictions for the competition test set. The final model outputs probabilities for each of the three possible outcomes: Response A wins, Response B wins, Tie. Submission example is shown below:

Final Result

My final submission achieved a log-loss score of 1.02, placing 26th out of 213 submissions at the time of writing. Not bad!

Because the evaluation metric is a multi-class log-loss, the model is rewarded for assigning high probability to the correct outcome rather than simply predicting the correct label.

For reference, a random model predicting equal possibilities (0.33, 0.33, 0.33) would achieve a log-loss of 1.10, so a score of 1.02 indicates the model is learning meaningful patterns from the data.

Key Takeaways

Comparing the two responses directly was the most valuable signal.
Structural features also mattered.
Tie cases are inherently difficult
Human preference is subjective.

This was my first Kaggle competition, and it was a great opportunity to experiment with feature engineering, text representations, and gradient boosting models.

My Engineering Portfolio

Final Result

Leave a comment Cancel reply

Predicting Which Chatbot Response Humans Prefer

Data Preparation and Feature Engineering

Problem Setup

Dataset Overview

Feature Engineering

Target Variable

Structural Features

Converting Text into Features with TF-IDF

Summary of Text and Structural Features

Modeling, Training, and Evaluation

Training Setup

Validation Results

Training on the Full Dataset

Final Result

Share this:

Leave a comment Cancel reply