This is the repo for KokoMind, a dataset with multi-party social interactions to evaluate LLMs' social understanding abilities. The repo contains:
Logo of KokoMind.
KokoMind contains 150 complex multi-party social interactions (50 per source) with free-text questions and answers. To ensure diversity and scalability and avoid data contamination, all the social interactions, questions, and answers are generated by GPT-4 and verified by human experts later. These generations are based on three different sources:
For each social interaction, we ask various questions designed to probe the following aspects of social understanding.
question_nonverbal_yes_v0.1.json
contains 770 samples in total. This JSON Lines file is a list of dictionaries, with each dictionary contains the following fields:
question_id
: int, the unique ID of the question.text
: str, social interaction context and question.answer
: str, GPT-4 answer that has been further verified by human.source
: str, one of the three data sources: gpt-4
, movie
, tomi
.category
: str, one of six question categories: ToM
, Social Norm
, Emotion Recognition
, Social Relation
, Counterfactual
, Social Advice
.question_nonverbal_no_v0.1.json
contains the same social interactions and questions but but with the non-verbal cues in the parenthesis (e.g., nervously sipping coffee, etc) removed from the context.
pip install -r requirements.txt
export OPENAI_API_KEY=<your_api_key>
export ANTHROPIC_API_KEY=<your_api_key>
# Generate local model anwers
# Use vicuna-7b as an example
python eval/get_model_answer.py --model-path ${PATH_TO_LOCAL_HF_MODEL} --model-id vicuna-7b --question-file data/question_nonverbal_yes_v0.1.jsonl --answer-file data/answer/answer_vicuna-7b.jsonl --num-gpus 8
# GPT-3 answer (reference model by alpaca-eval)
python eval/qa_baseline_gpt3.py -q data/question_nonverbal_yes_v0.1.jsonl -o data/answer/answer_gpt3.jsonl
# GPT-3.5 answer
python eval/qa_baseline_gpt35.py -q data/question_nonverbal_yes_v0.1.jsonl -o data/answer/answer_gpt35.jsonl
# GPT-4.0 answer
python eval/qa_baseline_gpt4.py -q data/question_nonverbal_yes_v0.1.jsonl -o data/answer/answer_gpt4.jsonl
# Claude answer
python eval/qa_baseline_claude.py -q data/question_nonverbal_yes_v0.1.jsonl -o data/answer/answer_claude.jsonl
Our evaluation is based on Alpaca-Eval.
# Convert to alpaca_eval input format
python eval/generate_alpaca_eval.py -q data/question_nonverbal_yes_v0.1.jsonl -a data/answer/answer_gpt3.jsonl -o data/alpaca_eval/answer_gpt3.json
alpaca_eval make_leaderboard --leaderboard_path data/alpaca_results/leaderboard.csv --all_model_outputs "./data/alpaca_eval/answer_*" --reference_outputs data/alpaca_eval/answer_gpt3.json --is_overwrite_leaderboard True
This project is an early-stage research showcase, designed solely for non-commercial purposes. It adheres to OpenAI's data usage terms, and ShareGPT's privacy practices. Let us know if you spot any potential violations. The software's code is available under the Apache License 2.0.
We would like to thank Yejin Choi from UW, Louis-Philippe Morency from CMU, Jason Weston from Meta, and Diyi Yang from Stanford for their enlightening dialogues and constructive inputs. The theoretical foundation of KokoMind is based on Liang's PhD research with Song-Chun Zhu from Peking University, Tsinghua University and Beijing Institute for General Artificial Intelligence (BIGAI) and Ying Nian Wu from UCLA.
Please cite our work if you find it useful.
@misc{Shi_KokoMind_Can_Large_2023,
author = {Shi, Weiyan and Qiu, Liang and Xu, Dehong and Sui, Pengwei and Lu, Pan and Yu, Zhou},
title = {{KokoMind: Can Large Language Models Understand Social Interactions?}},
month = jul,
year = {2023},
url = {https://chats-lab.github.io/KokoMind/}
}