此儲存庫託管在DOSA資料集上執行實驗的程式碼。
透過運行create_env.py
建立dosa
conda 環境
透過運行conda activate dosa
激活環境
在 .env 檔案中設定以下環境變量
OPENAI_API_KEY
HF_TOKEN
另外,匯出PYTHONPATH
變量,以便所有套件都可以正常工作。若要新增PYTHONPATH
,請在終端機上編寫以下命令: export PYTHONPATH=$PYTHONPATH:
注意確保您申請了 Llama 2 模型的存取權限。另外,我們使用 HuggingFace 下載 llama2 模型。確保您使用的電子郵件 ID 與申請存取 llama 2 模型時使用的電子郵件 ID 相同。產生HF_TOKEN
並將其儲存在.env
檔案中
如果您使用資料集或程式碼,請使用以下 bibTEX:
@inproceedings{seth-etal-2024-dosa-dataset,
title = "{DOSA}: A Dataset of Social Artifacts from Different {I}ndian Geographical Subcultures",
author = "Seth, Agrima and
Ahuja, Sanchit and
Bali, Kalika and
Sitaram, Sunayana",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.474",
pages = "5323--5337",
abstract = "Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering. To be effective globally, these models must be aware of and account for local socio-cultural contexts, making it necessary to have benchmarks to evaluate the models for their cultural familiarity. Since the training data for LLMs is web-based and the Web is limited in its representation of information, it does not capture knowledge present within communities that are not on the Web. Thus, these models exacerbate the inequities, semantic misalignment, and stereotypes from the Web. There has been a growing call for community-centered participatory research methods in NLP. In this work, we respond to this call by using participatory research methods to introduce DOSA, the first community-generated Dataset of 615 Social Artifacts, by engaging with 260 participants from 19 different Indian geographic subcultures. We use a gamified framework that relies on collective sensemaking to collect the names and descriptions of these artifacts such that the descriptions semantically align with the shared sensibilities of the individuals from those cultures. Next, we benchmark four popular LLMs and find that they show significant variation across regional sub-cultures in their ability to infer the artifacts.",
}
該項目歡迎貢獻和建議。大多數貢獻都要求您同意貢獻者授權協議 (CLA),聲明您有權並且實際上授予我們使用您的貢獻的權利。有關詳細信息,請訪問 https://cla.opensource.microsoft.com。
當您提交拉取請求時,CLA 機器人將自動確定您是否需要提供 CLA 並適當地裝飾 PR(例如,狀態檢查、評論)。只需按照機器人提供的說明進行操作即可。您只需使用我們的 CLA 在所有儲存庫中執行一次此操作。
該專案採用了微軟開源行為準則。有關詳細信息,請參閱行為準則常見問題解答或聯繫 [email protected] 提出任何其他問題或意見。
該項目可能包含項目、產品或服務的商標或標誌。 Microsoft 商標或標誌的授權使用須遵守且必須遵循 Microsoft 的商標和品牌指南。在此項目的修改版本中使用 Microsoft 商標或標誌不得混淆或暗示 Microsoft 贊助。任何對第三方商標或標誌的使用均須遵守這些第三方的政策。
請在此參閱我們的資料許可證。
您可以在此處閱讀有關 Microsoft 隱私權聲明的更多資訊。