storm下載 - storm原始碼下載

STORM：透過檢索與多視角提問合成主題大綱

|研究預覽|風暴紙|共同風暴紙|網站|

最新消息

[2024/09] Co-STORM程式碼庫現已發布並整合到knowledge-storm python套件v1.0.0中。運行pip install knowledge-storm --upgrade進行檢查。
[2024/09] 我們引進協作風暴（Co-STORM）來支援人機協作知識管理！ Co-STORM論文已被EMNLP 2024主會議接收。
[2024/07] 您現在可以使用pip install knowledge-storm安裝我們的軟體包！
[2024/07] 我們新增了VectorRM以支援基於使用者提供的文檔，補充了搜尋引擎（ YouRM 、 BingSearch ）的現有支援。（查看＃58）
[2024/07] 我們為開發人員發布了 demo light，這是一個使用 Python 中的 Streamlit 框架構建的最小用戶介面，方便本地開發和演示託管（查看 #54）
[2024/06] 我們將在NAACL 2024上呈現STORM！歡迎參加 6 月 17 日的海報會議 2 或查看我們的簡報資料。
[2024/05] 我們在 rm.py 中新增了 Bing 搜尋支援。使用GPT-4o測試 STORM - 我們現在使用GPT-4o模型在演示中配置文章生成部分。
[2024/04] 我們發布了STORM程式碼庫的重構版本！我們定義 STORM 管道的介面並重新實作 STORM-wiki（查看src/storm_wiki ）以演示如何實例化管道。我們提供API來支援不同語言模型的客製化和檢索/搜尋整合。

概述（立即嘗試 STORM！）

STORM 是一個法學碩士系統，可以根據網路搜尋從頭開始撰寫類似維基百科的文章。 Co-STORM 透過使人與協作的 LLM 系統支援更一致且首選的資訊搜尋和知識管理，進一步增強了其功能。

雖然該系統無法產生通常需要大量編輯的可發表文章，但經驗豐富的維基百科編輯發現它在預寫作階段很有幫助。

超過 70,000 人嘗試過我們的即時研究預覽。試試一下，看看 STORM 如何幫助您的知識探索之旅，並請提供回饋以幫助我們改進系統！

STORM 和 Co-STORM 的工作原理

風暴

STORM 將產生帶有引用的長文章分為兩個步驟：

預寫階段：系統進行基於互聯網的研究，收集參考文獻並產生大綱。
寫作階段：系統使用大綱和參考文獻產生帶有引文的全文文章。

STORM 認為研究過程自動化的核心是自動提出好的問題。直接提示語言模型提出問題效果並不好。為了提高問題的深度和廣度，STORM 採用了兩種策略：

觀點引導提問：給定輸入主題，STORM 透過調查類似主題的現有文章來發現不同的觀點，並使用它們來控制提問過程。
模擬對話：STORM 模擬維基百科作者和基於網路資源的主題專家之間的對話，使語言模型能夠更新其對主題的理解並提出後續問題。

共同風暴

Co-STORM提出了一種協作對話協議，該協議實施輪流管理策略以支援之間的順利協作

Co-STORM LLM 專家：此類代理人根據外部知識來源產生答案和/或根據話語歷史提出後續問題。
主持人：該代理商會根據檢索器發現的資訊產生發人深省的問題，但在先前的回合中並未直接使用。問題生成也可以接地氣！
人類使用者：人類使用者將主動（1）觀察對話以獲得對主題的更深入的理解，或（2）透過注入話語來引導討論焦點來積極參與對話。

Co-STORM也維護一個動態更新的心智圖，它將收集到的資訊組織成分層概念結構，旨在在人類使用者和系統之間建立共享的概念空間。事實證明，心智圖有助於減輕長篇深入的演講時的精神負擔。

STORM 和 Co-STORM 都是使用 dspy 以高度模組化的方式實現的。

安裝

若要安裝知識風暴庫，請使用pip install knowledge-storm 。

您也可以安裝原始程式碼，它允許您直接修改 STORM 引擎的行為。

克隆 git 儲存庫。

git clone https://github.com/stanford-oval/storm.git
cd storm

安裝所需的軟體包。

conda create -n storm python=3.11
conda activate storm
pip install -r requirements.txt

應用程式介面

目前，我們的套餐支援：

OpenAIModel 、 AzureOpenAIModel 、 ClaudeModel 、 VLLMClient 、 TGIClient 、 TogetherClient 、 OllamaClient 、 GoogleModel 、 DeepSeekModel 、 GroqModel作為語言模型元件
YouRM 、 BingSearch 、 VectorRM 、 SerperRM 、 BraveRM 、 SearXNG 、 DuckDuckGoSearchRM 、 TavilySearchRM 、 GoogleSearch和AzureAISearch作為檢索模組元件

？將更多語言模型整合到knowledge_storm/lm.py 並將搜尋引擎/檢索器整合到knowledge_storm/rm.py 的 PR 受到高度讚賞！

STORM和Co-STORM都工作在資訊管理階層，需要設定資訊檢索模組和語言模型模組來分別建立它們的Runner類別。

風暴

STORM 知識管理引擎定義為一個簡單的 Python STORMWikiRunner類別。以下是使用 You.com 搜尋引擎和 OpenAI 模型的範例。

 import os
from knowledge_storm import STORMWikiRunnerArguments , STORMWikiRunner , STORMWikiLMConfigs
from knowledge_storm . lm import OpenAIModel
from knowledge_storm . rm import YouRM

lm_configs = STORMWikiLMConfigs ()
openai_kwargs = {
    'api_key' : os . getenv ( "OPENAI_API_KEY" ),
    'temperature' : 1.0 ,
    'top_p' : 0.9 ,
}
# STORM is a LM system so different components can be powered by different models to reach a good balance between cost and quality.
# For a good practice, choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.
# Choose a more powerful model for `article_gen_lm` to generate verifiable text with citations.
gpt_35 = OpenAIModel ( model = 'gpt-3.5-turbo' , max_tokens = 500 , ** openai_kwargs )
gpt_4 = OpenAIModel ( model = 'gpt-4o' , max_tokens = 3000 , ** openai_kwargs )
lm_configs . set_conv_simulator_lm ( gpt_35 )
lm_configs . set_question_asker_lm ( gpt_35 )
lm_configs . set_outline_gen_lm ( gpt_4 )
lm_configs . set_article_gen_lm ( gpt_4 )
lm_configs . set_article_polish_lm ( gpt_4 )
# Check out the STORMWikiRunnerArguments class for more configurations.
engine_args = STORMWikiRunnerArguments (...)
rm = YouRM ( ydc_api_key = os . getenv ( 'YDC_API_KEY' ), k = engine_args . search_top_k )
runner = STORMWikiRunner ( engine_args , lm_configs , rm )

STORMWikiRunner實例可以透過簡單的run方法來呼叫：

 topic = input ( 'Topic: ' )
runner . run (
    topic = topic ,
    do_research = True ,
    do_generate_outline = True ,
    do_generate_article = True ,
    do_polish_article = True ,
)
runner . post_run ()
runner . summary ()

do_research ：如果為 True，則模擬不同視角的對話以收集有關該主題的資訊；否則，加載結果。
do_generate_outline ：如果為 True，則產生主題的大綱；否則，加載結果。
do_generate_article ：如果為 True，則根據大綱和收集到的資訊產生該主題的文章；否則，加載結果。
do_polish_article ：如果為 True，則透過新增摘要部分並（可選）刪除重複內容來完善文章；否則，加載結果。

共同風暴

Co-STORM 知識管理引擎被定義為一個簡單的 Python CoStormRunner類別。這是使用 Bing 搜尋引擎和 OpenAI 模型的範例。

 from knowledge_storm . collaborative_storm . engine import CollaborativeStormLMConfigs , RunnerArgument , CoStormRunner
from knowledge_storm . lm import OpenAIModel
from knowledge_storm . logging_wrapper import LoggingWrapper
from knowledge_storm . rm import BingSearch

# Co-STORM adopts the same multi LM system paradigm as STORM 
lm_config : CollaborativeStormLMConfigs = CollaborativeStormLMConfigs ()
openai_kwargs = {
    "api_key" : os . getenv ( "OPENAI_API_KEY" ),
    "api_provider" : "openai" ,
    "temperature" : 1.0 ,
    "top_p" : 0.9 ,
    "api_base" : None ,
} 
question_answering_lm = OpenAIModel ( model = gpt_4o_model_name , max_tokens = 1000 , ** openai_kwargs )
discourse_manage_lm = OpenAIModel ( model = gpt_4o_model_name , max_tokens = 500 , ** openai_kwargs )
utterance_polishing_lm = OpenAIModel ( model = gpt_4o_model_name , max_tokens = 2000 , ** openai_kwargs )
warmstart_outline_gen_lm = OpenAIModel ( model = gpt_4o_model_name , max_tokens = 500 , ** openai_kwargs )
question_asking_lm = OpenAIModel ( model = gpt_4o_model_name , max_tokens = 300 , ** openai_kwargs )
knowledge_base_lm = OpenAIModel ( model = gpt_4o_model_name , max_tokens = 1000 , ** openai_kwargs )

lm_config . set_question_answering_lm ( question_answering_lm )
lm_config . set_discourse_manage_lm ( discourse_manage_lm )
lm_config . set_utterance_polishing_lm ( utterance_polishing_lm )
lm_config . set_warmstart_outline_gen_lm ( warmstart_outline_gen_lm )
lm_config . set_question_asking_lm ( question_asking_lm )
lm_config . set_knowledge_base_lm ( knowledge_base_lm )

# Check out the Co-STORM's RunnerArguments class for more configurations.
topic = input ( 'Topic: ' )
runner_argument = RunnerArgument ( topic = topic , ...)
logging_wrapper = LoggingWrapper ( lm_config )
bing_rm = BingSearch ( bing_search_api_key = os . environ . get ( "BING_SEARCH_API_KEY" ),
                     k = runner_argument . retrieve_top_k )
costorm_runner = CoStormRunner ( lm_config = lm_config ,
                               runner_argument = runner_argument ,
                               logging_wrapper = logging_wrapper ,
                               rm = bing_rm )

CoStormRunner實例可以透過warmstart()和step(...)方法來呼叫。

 # Warm start the system to build shared conceptual space between Co-STORM and users
costorm_runner . warm_start ()

# Step through the collaborative discourse 
# Run either of the code snippets below in any order, as many times as you'd like
# To observe the conversation:
conv_turn = costorm_runner . step ()
# To inject your utterance to actively steer the conversation:
costorm_runner . step ( user_utterance = "YOUR UTTERANCE HERE" )

# Generate report based on the collaborative discourse
costorm_runner . knowledge_base . reorganize ()
article = costorm_runner . generate_report ()
print ( article )

使用範例腳本快速入門

我們在範例資料夾中提供了腳本，作為使用不同配置運行 STORM 和 Co-STORM 的快速入門。

我們建議使用secrets.toml來設定 API 金鑰。在根目錄下建立檔案secrets.toml ，新增以下內容：

 # Set up OpenAI API key.
OPENAI_API_KEY= " your_openai_api_key "
# If you are using the API service provided by OpenAI, include the following line:
OPENAI_API_TYPE= " openai "
# If you are using the API service provided by Microsoft Azure, include the following lines:
OPENAI_API_TYPE= " azure "
AZURE_API_BASE= " your_azure_api_base_url "
AZURE_API_VERSION= " your_azure_api_version "
# Set up You.com search API key.
YDC_API_KEY= " your_youcom_api_key "

風暴範例

要使用具有預設配置的gpt系列型號來運行 STORM：

運行以下命令。

python examples/storm_examples/run_storm_wiki_gpt.py 
    --output-dir $OUTPUT_DIR 
    --retriever you 
    --do-research 
    --do-generate-outline 
    --do-generate-article 
    --do-polish-article

要使用您喜歡的語言模型或基於您自己的語料庫運行 STORM：請查看 Examples/storm_examples/README.md。

協同風暴範例

要使用預設配置的gpt系列型號運行 Co-STORM，

將BING_SEARCH_API_KEY="xxx"和ENCODER_API_TYPE="xxx"加入secrets.toml
運行以下命令

python examples/costorm_examples/run_costorm_gpt.py 
    --output-dir $OUTPUT_DIR 
    --retriever bing

管道的定制

風暴

如果您已經安裝了原始程式碼，您可以根據自己的用例自訂STORM。 STORM引擎由4個模組組成：

知識管理模組：收集有關給定主題的廣泛資訊。
大綱產生模組：透過為策劃的知識產生分層大綱來組織收集的資訊。
文章生成模組：用收集到的信息填充生成的大綱。
文章潤飾模組：完善和增強書面文章，以更好地呈現。

每個模組的介面在knowledge_storm/interface.py中定義，而它們的實作在knowledge_storm/storm_wiki/modules/*中實例化。這些模組可以根據您的特定要求進行自訂（例如，以項目符號格式產生部分而不是完整段落）。

共同風暴

如果您已經安裝了原始程式碼，您可以根據自己的用例自訂Co-STORM

Co-STORM引入了多種LLM代理類型（即Co-STORM專家和主持人）。 LLM代理介面在knowledge_storm/interface.py中定義，而其實作在knowledge_storm/collaborative_storm/modules/co_storm_agents.py中實例化。可以客製化不同的LLM代理政策。
Co-STORM引入了協作話語協議，其核心功能以回合策略管理為中心。我們在knowledge_storm/collaborative_storm/engine.py中提供了透過DiscourseManager實現輪流策略管理的範例。它可以定制並進一步改進。

數據集

為了促進自動知識管理和複雜資訊搜尋的研究，我們的專案發布了以下資料集：

新鮮維基

FreshWiki 資料集是 100 篇高品質維基百科文章的集合，重點關注 2022 年 2 月至 2023 年 9 月編輯次數最多的頁面。

您可以直接從huggingface下載資料集。為了緩解資料污染問題，我們將資料建置管道的源代碼存檔，以便將來可以重複使用。

狂野搜尋

為了研究使用者對野外複雜資訊搜尋任務的興趣，我們利用從網路研究預覽中收集的資料來建立 WildSeek 資料集。我們對數據進行了下採樣，以確保主題的多樣性和數據的品質。每個數據點都是一對，包含一個主題和用戶對該主題進行深度搜尋的目標。更多詳細信息，請參閱Co-STORM論文的2.2節和附錄A。

WildSeek 資料集可在此處取得。

複製 STORM 和 Co-STORM 論文結果

對於STORM論文實驗，請切換到此處的分支NAACL-2024-code-backup 。

對於Co-STORM論文實驗，請切換到分支EMNLP-2024-code-backup （暫時佔位，稍後更新）。

路線圖和貢獻

我們的團隊正在積極致力於：

人機互動功能：支援使用者參與知識管理過程。
資訊抽象：開發精選資訊的抽象，以支援維基百科風格報告以外的演示格式。

如果您有任何問題或建議，請隨時提出問題或拉取請求。我們歡迎為改進系統和程式碼庫做出貢獻！

聯絡人：邵益佳、蔣玉成

致謝

我們要感謝維基百科提供的優秀開源內容。 FreshWiki 資料集源自維基百科，並根據 Creative Commons Attribution-ShareAlike (CC BY-SA) 授權。

我們非常感謝 Michelle Lam 為該專案設計了徽標，並感謝 Dekun Ma 領導了 UI 開發。

引文

如果您在工作中使用此程式碼或其中的一部分，請引用我們的論文：

 @misc { jiang2024unknownunknowns ,
      title = { Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations } , 
      author = { Yucheng Jiang and Yijia Shao and Dekun Ma and Sina J. Semnani and Monica S. Lam } ,
      year = { 2024 } ,
      eprint = { 2408.15232 } ,
      archivePrefix = { arXiv } ,
      primaryClass = { cs.CL } ,
      url = { https://arxiv.org/abs/2408.15232 } , 
}

@inproceedings { shao2024assisting ,
      title = { {Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models} } , 
      author = { Yijia Shao and Yucheng Jiang and Theodore A. Kanell and Peter Xu and Omar Khattab and Monica S. Lam } ,
      year = { 2024 } ,
      booktitle = { Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) }
}