This solution was developed for the LMSYS - Chatbot Arena Human Preference Predictions competition on Kaggle, where participants were challenged to predict user preferences in head-to-head conversations between chatbots powered by large language models (LLMs). The task involved utilizing a dataset from Chatbot Arena, in which users interact with two anonymous LLMs and choose their preferred response. By creating a machine learning model that accurately predicts these preferences, we aimed to contribute to improving the alignment of chatbot responses with human preferences.
Our team successfully placed 4th out of 1849 teams, earning a Gold Medal for our solution and a prize of $20,000! ?
First, we utilized the official dataset (55k) along with 33k deduplicated data, employing a 20-fold cross-validation (n_splits=20), but only trained on one fold to maximize the amount of training data. Additionally, we created pseudo-labels for 30,000 entries from the ultrafeedback dataset to further supplement the dataset.
We designed a unique prompt, which is beneficial because when the dialogue length exceeds the maximum token length (max_length
), it allows for a reasonable truncation of the final round of conversation. This ensures that the prompt, response A, and response B can all be adequately displayed, avoiding situations where only the prompt or response A gets truncated. If the remaining token count in the final round is less than 80, the entire conversation round (and the subsequent ones) will be discarded. These thresholds and proportions were determined through observation of the training set.
def tokenize_cls_p3(example, tokenizer, max_length, is_train):
input_ids = []
attention_mask = []
dot_tokens = tokenizer("......", add_special_tokens=False)["input_ids"]
final_p_tokens = tokenizer("nn---nWhich response is better? [A or B or tie]nAnswer: ", add_special_tokens=False)["input_ids"]
for ps, ras, rbs in zip(example['prompt'], example['response_a'], example['response_b']):
one_input_ids = [tokenizer.bos_token_id]
prev_tokens_num = 2 + len(final_p_tokens) # 2 for bos_token and eos_token
for idx, (p, ra, rb) in enumerate(zip(ps, ras, rbs)):
r_tokens = tokenizer(f'nn## Round {idx+1}:' if idx else f'## Round {idx+1}:', add_special_tokens=False)["input_ids"]
p_tokens = tokenizer(f'n### Prompt:n{p}', add_special_tokens=False)["input_ids"]
ra_tokens = tokenizer(f'nn### Response A:n{ra}', add_special_tokens=False)["input_ids"]
rb_tokens = tokenizer(f'nn### Response B:n{rb}', add_special_tokens=False)["input_ids"]
all_tokens_num = prev_tokens_num + len(r_tokens) + len(p_tokens) + len(ra_tokens) + len(rb_tokens
if all_tokens_num > max_length:
remain_tokens_num = max_length - prev_tokens_num - len(r_tokens) - 3 * len(dot_tokens)
if remain_tokens_num >= 80:
p_tokens = p_tokens[:int(remain_tokens_num * 0.2)] + dot_tokens if len(p_tokens) > int(remain_tokens_num * 0.2) else p_tokens
ra_tokens = ra_tokens[:int(remain_tokens_num * 0.4)] + dot_tokens if len(ra_tokens) > int(remain_tokens_num * 0.4) else ra_tokens
rb_tokens = rb_tokens[:int(remain_tokens_num * 0.4)] + dot_tokens if len(rb_tokens) > int(remain_tokens_num * 0.4) else rb_tokens
one_input_ids += r_tokens + p_tokens + ra_tokens + rb_tokens
break
else:
prev_tokens_num = all_tokens_num
one_input_ids += r_tokens + p_tokens + ra_tokens + rb_tokens
one_input_ids += final_p_tokens + [tokenizer.eos_token_id]
one_attention_mask = [1] * len(one_input_ids)
input_ids.append(one_input_ids)
attention_mask.append(one_attention_mask)
if is_train:
labels = [0 if a_win else 1 if b_win else 2 for a_win, b_win, tie in zip(example['winner_model_a'], example['winner_model_b'], example['winner_tie'])]
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels,
}
else:
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
}
We selected gemma-2-9b-it as the starting model, which significantly outperforms other models such as Llama3 8b and Llama3.1 8b. We used Gemma2ForSequenceClassification for a three-class classification task, and fine-tuned the model using lora with bf16 precision. The best experimental results were achieved on four A100 GPUs.
Each experiment took approximately 10 hours for the first phase and 15 hours for the second phase on a system with 4 A100 GPUs (40G).
The inference phase uses a similar code structure to the training phase, with some key differences: the max_length
is increased to 3072, and response_a and response_b are swapped as part of a test-time augmentation (TTA) strategy. The final result is the average output of both.
Post-processing was applied for two specific scenarios (which may overlap):
df2 = pd.read_csv('/kaggle/input/lmsys-chatbot-arena/test.csv')
df2['id'] = df2['id'].astype(str)
a_null_df = df2[(df2["response_a"] == '[null]') | (df2["response_a"] == '[]') | (df2["response_a"] == '[ ]') | (df2["response_a"] == '[ ]') | (df2["response_a"] == '[""]') | (df2["response_a"] == '["",""]')]
a_null_id_list = a_null_df["id"].tolist()
submission_df.loc[submission_df['id'].isin(a_null_id_list), ['winner_model_a', 'winner_model_b', 'winner_tie']] = [0.04, 0.88, 0.08]
b_null_df = df2[(df2["response_b"] == '[null]') | (df2["response_b"] == '[]') | (df2["response_b"] == '[ ]') | (df2["response_b"] == '[ ]') | (df2["response_b"] == '[""]') | (df2["response_b"] == '["",""]')]
b_null_id_list = b_null_df["id"].tolist()
submission_df.loc[submission_df['id'].isin(b_null_id_list), ['winner_model_a', 'winner_model_b', 'winner_tie']] = [0.88, 0.04, 0.08]
same_a_b_df2 = df2[(df2["response_a"] == df2["response_b"])]
same_a_b_id_list = same_a_b_df2["id"].tolist()
submission_df.loc[submission_df['id'].isin(same_a_b_id_list), ['winner_model_a', 'winner_model_b', 'winner_tie']] = [0.06, 0.06, 0.88]
Overview: Developed and optimized a human preference prediction model for dialogue systems based on the gemma-2-9b-it model, improving the accuracy of predicting user preference responses in the dialogue system.
Key Techniques:
Daoyuan Li - Kaggle Profile
For any questions, please contact Daoyuan Li at [email protected].