KeyBERT下载 - KeyBERT源码下载

密钥伯特

KeyBERT 是一种最小且易于使用的关键字提取技术，它利用 BERT 嵌入来创建与文档最相似的关键字和关键短语。

相应的媒体帖子可以在这里找到。

一、项目简介

返回目录

尽管已经有很多可用于关键字生成的方法（例如，Rake、YAKE！、TF-IDF 等），但我想创建一个非常基本但功能强大的方法来提取关键字和关键短语。这就是KeyBERT发挥作用的地方！它使用 BERT 嵌入和简单的余弦相似度来查找文档中与文档本身最相似的子短语。

首先，使用 BERT 提取文档嵌入以获得文档级表示。然后，提取 N 元词/短语的词嵌入。最后，我们使用余弦相似度来查找与文档最相似的单词/短语。然后，最相似的单词可以被识别为最能描述整个文档的单词。

KeyBERT 绝不是独一无二的，它是作为一种快速、简单的创建关键字和关键短语的方法而创建的。尽管有很多使用 BERT 嵌入的优秀论文和解决方案（例如 1、2、3），但我找不到不需要从头开始训练并且可以供初学者使用的基于 BERT 的解决方案（如果我错了请纠正我！ ）。因此，目标是pip install keybert和最多 3 行代码的使用。

2. 入门

返回目录

2.1.安装

可以使用 pypi 完成安装：

 pip install keybert

您可能需要安装更多，具体取决于您将使用的转换器和语言后端。可能的安装有：

 pip install keybert[flair]
pip install keybert[gensim]
pip install keybert[spacy]
pip install keybert[use]

2.2.用法

下面是提取关键字的最小示例：

 from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT ()
keywords = kw_model . extract_keywords ( doc )

您可以设置keyphrase_ngram_range来设置生成的关键字/关键短语的长度：

 >> > kw_model . extract_keywords ( doc , keyphrase_ngram_range = ( 1 , 1 ), stop_words = None )
[( 'learning' , 0.4604 ),
 ( 'algorithm' , 0.4556 ),
 ( 'training' , 0.4487 ),
 ( 'class' , 0.4086 ),
 ( 'mapping' , 0.3700 )]

要提取关键短语，只需将keyphrase_ngram_range设置为 (1, 2) 或更高，具体取决于您想要在生成的关键短语中包含的单词数量：

 >> > kw_model . extract_keywords ( doc , keyphrase_ngram_range = ( 1 , 2 ), stop_words = None )
[( 'learning algorithm' , 0.6978 ),
 ( 'machine learning' , 0.6305 ),
 ( 'supervised learning' , 0.5985 ),
 ( 'algorithm analyzes' , 0.5860 ),
 ( 'learning function' , 0.5850 )]

我们可以通过简单地设置highlight来突出显示文档中的关键字：

 keywords = kw_model . extract_keywords ( doc , highlight = True )

注意：有关所有可能的变压器模型的完整概述，请参阅句子变压器。我建议对于英语文档使用"all-MiniLM-L6-v2" ，对于多语言文档或任何其他语言使用"paraphrase-multilingual-MiniLM-L12-v2" 。

2.3.最大总和距离

为了使结果多样化，我们采用与文档最相似的 2 x top_n 单词/短语。然后，我们从 2 x top_n 个单词中取出所有 top_n 组合，并通过余弦相似度提取彼此最不相似的组合。

 >> > kw_model . extract_keywords ( doc , keyphrase_ngram_range = ( 3 , 3 ), stop_words = 'english' ,
                              use_maxsum = True , nr_candidates = 20 , top_n = 5 )
[( 'set training examples' , 0.7504 ),
 ( 'generalize training data' , 0.7727 ),
 ( 'requires learning algorithm' , 0.5050 ),
 ( 'supervised learning algorithm' , 0.3779 ),
 ( 'learning machine learning' , 0.2891 )]

2.4.最大边际相关性

为了使结果多样化，我们可以使用最大边距相关性（MMR）来创建同样基于余弦相似度的关键字/关键短语。具有高度多样性的结果：

 >> > kw_model . extract_keywords ( doc , keyphrase_ngram_range = ( 3 , 3 ), stop_words = 'english' ,
                              use_mmr = True , diversity = 0.7 )
[( 'algorithm generalize training' , 0.7727 ),
 ( 'labels unseen instances' , 0.1649 ),
 ( 'new examples optimal' , 0.4185 ),
 ( 'determine class labels' , 0.4774 ),
 ( 'supervised learning algorithm' , 0.7502 )]

多样性低的结果：

 >> > kw_model . extract_keywords ( doc , keyphrase_ngram_range = ( 3 , 3 ), stop_words = 'english' ,
                              use_mmr = True , diversity = 0.2 )
[( 'algorithm generalize training' , 0.7727 ),
 ( 'supervised learning algorithm' , 0.7502 ),
 ( 'learning machine learning' , 0.7577 ),
 ( 'learning algorithm analyzes' , 0.7587 ),
 ( 'learning algorithm generalize' , 0.7514 )]

2.5.嵌入模型

KeyBERT 支持许多可用于嵌入文档和单词的嵌入模型：

句子变形金刚
天赋
斯帕西
根森
使用

单击此处查看所有支持的嵌入模型的完整概述。

句子变形金刚
您可以在此处从sentence-transformers中选择任何模型，并将其通过带有model的 KeyBERT 传递：

 from keybert import KeyBERT
kw_model = KeyBERT ( model = 'all-MiniLM-L6-v2' )

或者选择带有您自己的参数的 SentenceTransformer 模型：

 from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer ( "all-MiniLM-L6-v2" )
kw_model = KeyBERT ( model = sentence_model )

天赋
Flair 允许您选择几乎任何公开可用的嵌入模型。 Flair 可以如下使用：

 from keybert import KeyBERT
from flair . embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings ( 'roberta-base' )
kw_model = KeyBERT ( model = roberta )

您可以选择任何一个？变形金刚模型在这里。

3. 大型语言模型

返回目录

借助KeyLLM您可以使用大型语言模型 (LLM) 执行关键字提取。您可以在此处找到完整的文档，但有两个与此新方法常见的示例。在开始之前，请确保通过pip install openai安装 OpenAI 软件包。

首先，我们可以直接要求OpenAI提取关键词：

 import openai
from keybert . llm import OpenAI
from keybert import KeyLLM

# Create your LLM
client = openai . OpenAI ( api_key = MY_API_KEY )
llm = OpenAI ( client )

# Load it in KeyLLM
kw_model = KeyLLM ( llm )

这将查询任何 ChatGPT 模型并要求它从文本中提取关键字。

其次，我们可以找到可能具有相同关键字的文档，并仅提取这些文档的关键字。这比询问每个文档的关键字要有效得多。可能有一些文档具有完全相同的关键字。这样做很简单：

 import openai
from keybert . llm import OpenAI
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer ( 'all-MiniLM-L6-v2' )
embeddings = model . encode ( MY_DOCUMENTS , convert_to_tensor = True )

# Create your LLM
client = openai . OpenAI ( api_key = MY_API_KEY )
llm = OpenAI ( client )

# Load it in KeyLLM
kw_model = KeyLLM ( llm )

# Extract keywords
keywords = kw_model . extract_keywords ( MY_DOCUMENTS , embeddings = embeddings , threshold = .75 )

您可以使用threshold参数来确定文档需要有多相似才能接收相同的关键字。

引文

要在您的工作中引用 KeyBERT，请使用以下 bibtex 参考文献：

 @misc { grootendorst2020keybert ,
  author       = { Maarten Grootendorst } ,
  title        = { KeyBERT: Minimal keyword extraction with BERT. } ,
  year         = 2020 ,
  publisher    = { Zenodo } ,
  version      = { v0.3.0 } ,
  doi          = { 10.5281/zenodo.4461265 } ,
  url          = { https://doi.org/10.5281/zenodo.4461265 }
}