Paket ini membungkus pengubah kalimat (juga dikenal sebagai kalimat-BERT) langsung di spaCy. Anda dapat mengganti vektor yang disediakan dalam model spaCy mana pun dengan vektor yang telah disetel secara khusus untuk kesamaan semantik.
Model di bawah ini disarankan untuk menganalisis kesamaan kalimat, seperti yang ditunjukkan oleh tolok ukur STS. Ingatlah bahwa sentence-transformers
dikonfigurasi dengan panjang urutan maksimum 128. Oleh karena itu, untuk teks yang lebih panjang mungkin lebih cocok untuk digunakan dengan model lain (misalnya Universal Sentence Encoder).
Kesesuaian:
Untuk menginstal paket ini, Anda dapat menjalankan salah satu dari yang berikut ini:
pip install spacy-sentence-bert
pip install git+https://github.com/MartinoMensio/spacy-sentence-bert.git
Anda dapat menginstal paket spaCy mandiri dari GitHub dengan pip. Jika Anda menginstal paket mandiri, Anda akan dapat memuat model bahasa secara langsung dengan menggunakan spacy.load
API, tanpa perlu menambahkan tahap pipeline. Tabel ini mengambil model yang tercantum dalam dokumentasi Sentence Transformers dan memperlihatkan beberapa statistik beserta instruksi untuk memasang model mandiri. Jika Anda tidak ingin menginstal model mandiri, Anda masih dapat menggunakannya dengan menambahkan tahapan pipeline (lihat di bawah).
nama kalimat-BERT | nama model spacy | ukuran | bahasa | tolok ukur STS | instalasi mandiri |
---|---|---|---|---|---|
paraphrase-distilroberta-base-v1 | en_paraphrase_distilroberta_base_v1 | 768 | en | 81.81 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_paraphrase_distilroberta_base_v1-0.1.2.tar.gz#en_paraphrase_distilroberta_base_v1-0.1.2 |
paraphrase-xlm-r-multilingual-v1 | xx_paraphrase_xlm_r_multilingual_v1 | 768 | 50+ | 83,50 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_paraphrase_xlm_r_multilingual_v1-0.1.2.tar.gz#xx_paraphrase_xlm_r_multilingual_v1-0.1.2 |
stsb-roberta-large | en_stsb_roberta_large | 1024 | en | 86.39 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_large-0.1.2.tar.gz#en_stsb_roberta_large-0.1.2 |
stsb-roberta-base | en_stsb_roberta_base | 768 | en | 85.44 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_base-0.1.2.tar.gz#en_stsb_roberta_base-0.1.2 |
stsb-bert-large | en_stsb_bert_large | 1024 | en | 85.29 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_bert_large-0.1.2.tar.gz#en_stsb_bert_large-0.1.2 |
stsb-distilbert-base | en_stsb_distilbert_base | 768 | en | 85.16 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_distilbert_base-0.1.2.tar.gz#en_stsb_distilbert_base-0.1.2 |
stsb-bert-base | en_stsb_bert_base | 768 | en | 85.14 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_bert_base-0.1.2.tar.gz#en_stsb_bert_base-0.1.2 |
nli-bert-large | en_nli_bert_large | 1024 | en | 79.19 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large-0.1.2.tar.gz#en_nli_bert_large-0.1.2 |
nli-distilbert-base | en_nli_distilbert_base | 768 | en | 78.69 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_distilbert_base-0.1.2.tar.gz#en_nli_distilbert_base-0.1.2 |
nli-roberta-large | en_nli_roberta_large | 1024 | en | 78.69 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_roberta_large-0.1.2.tar.gz#en_nli_roberta_large-0.1.2 |
nli-bert-large-max-pooling | en_nli_bert_large_max_pooling | 1024 | en | 78.41 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large_max_pooling-0.1.2.tar.gz#en_nli_bert_large_max_pooling-0.1.2 |
nli-bert-large-cls-pooling | en_nli_bert_large_cls_pooling | 1024 | en | 78.29 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_large_cls_pooling-0.1.2.tar.gz#en_nli_bert_large_cls_pooling-0.1.2 |
nli-distilbert-base-max-pooling | en_nli_distilbert_base_max_pooling | 768 | en | 77.61 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_distilbert_base_max_pooling-0.1.2.tar.gz#en_nli_distilbert_base_max_pooling-0.1.2 |
nli-roberta-base | en_nli_roberta_base | 768 | en | 77.49 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_roberta_base-0.1.2.tar.gz#en_nli_roberta_base-0.1.2 |
nli-bert-base-max-pooling | en_nli_bert_base_max_pooling | 768 | en | 77.21 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base_max_pooling-0.1.2.tar.gz#en_nli_bert_base_max_pooling-0.1.2 |
nli-bert-base | en_nli_bert_base | 768 | en | 77.12 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base-0.1.2.tar.gz#en_nli_bert_base-0.1.2 |
nli-bert-base-cls-pooling | en_nli_bert_base_cls_pooling | 768 | en | 76.30 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nli_bert_base_cls_pooling-0.1.2.tar.gz#en_nli_bert_base_cls_pooling-0.1.2 |
average_word_embeddings_glove.6B.300d | en_average_word_embeddings_glove.6B.300d | 768 | en | 61.77 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_glove.6B.300d-0.1.2.tar.gz#en_average_word_embeddings_glove.6B.300d-0.1.2 |
average_word_embeddings_komninos | en_average_word_embeddings_komninos | 768 | en | 61.56 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_komninos-0.1.2.tar.gz#en_average_word_embeddings_komninos-0.1.2 |
average_word_embeddings_levy_dependency | en_average_word_embeddings_levy_dependency | 768 | en | 59.22 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_levy_dependency-0.1.2.tar.gz#en_average_word_embeddings_levy_dependency-0.1.2 |
average_word_embeddings_glove.840B.300d | en_average_word_embeddings_glove.840B.300d | 768 | en | 52.54 | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_average_word_embeddings_glove.840B.300d-0.1.2.tar.gz#en_average_word_embeddings_glove.840B.300d-0.1.2 |
quora-distilbert-base | en_quora_distilbert_base | 768 | en | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_quora_distilbert_base-0.1.2.tar.gz#en_quora_distilbert_base-0.1.2 |
quora-distilbert-multilingual | xx_quora_distilbert_multilingual | 768 | 50+ | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_quora_distilbert_multilingual-0.1.2.tar.gz#xx_quora_distilbert_multilingual-0.1.2 |
msmarco-distilroberta-base-v2 | en_msmarco_distilroberta_base_v2 | 768 | en | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_distilroberta_base_v2-0.1.2.tar.gz#en_msmarco_distilroberta_base_v2-0.1.2 |
msmarco-roberta-base-v2 | en_msmarco_roberta_base_v2 | 768 | en | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_roberta_base_v2-0.1.2.tar.gz#en_msmarco_roberta_base_v2-0.1.2 |
msmarco-distilbert-base-v2 | en_msmarco_distilbert_base_v2 | 768 | en | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_msmarco_distilbert_base_v2-0.1.2.tar.gz#en_msmarco_distilbert_base_v2-0.1.2 |
nq-distilbert-base-v1 | en_nq_distilbert_base_v1 | 768 | en | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_nq_distilbert_base_v1-0.1.2.tar.gz#en_nq_distilbert_base_v1-0.1.2 |
distiluse-base-multilingual-cased-v2 | xx_distiluse_base_multilingual_cased_v2 | 512 | 50+ | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_distiluse_base_multilingual_cased_v2-0.1.2.tar.gz#xx_distiluse_base_multilingual_cased_v2-0.1.2 |
stsb-xlm-r-multilingual | xx_stsb_xlm_r_multilingual | 768 | 50+ | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_stsb_xlm_r_multilingual-0.1.2.tar.gz#xx_stsb_xlm_r_multilingual-0.1.2 |
T-Systems-onsite/cross-en-de-roberta-sentence-transformer | xx_cross_en_de_roberta_sentence_transformer | 768 | id, de | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_cross_en_de_roberta_sentence_transformer-0.1.2.tar.gz#xx_cross_en_de_roberta_sentence_transformer-0.1.2 |
LaBSE | xx_LaBSE | 768 | 109 | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/xx_LaBSE-0.1.2.tar.gz#xx_LaBSE-0.1.2 |
allenai-specter | en_allenai_specter | 768 | en | T/A | pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_allenai_specter-0.1.2.tar.gz#en_allenai_specter-0.1.2 |
Jika model Anda tidak ada dalam daftar ini (misalnya xlm-r-base-en-ko-nli-ststb
), Anda masih dapat menggunakannya dengan perpustakaan ini tetapi tidak sebagai bahasa yang berdiri sendiri. Anda perlu menambahkan tahapan pipeline yang dikonfigurasi dengan benar (lihat di bawah API nlp.add_pipe
).
Ada berbagai cara untuk memuat model sentence-bert
.
spacy.load
API: Anda harus menginstal salah satu model dari tabel di atasspacy_sentence_bert.load_model
: Anda dapat memuat salah satu model dari tabel di atas tanpa menginstal paket mandirinlp.add_pipe
API: Anda dapat memuat model sentence-bert
mana pun di atas objek nlp
Andaspacy.load
API Model mandiri diinstal dari GitHub (misalnya, dari tabel di atas, pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_roberta_large-0.1.2.tar.gz#en_stsb_roberta_large-0.1.2
), Anda dapat memuat model secara langsung dengan spaCy API:
import spacy
nlp = spacy . load ( 'en_stsb_roberta_large' )
spacy_sentence_bert.load_model
APIAnda bisa mendapatkan hasil yang sama tanpa harus menginstal model mandiri, dengan menggunakan metode ini:
import spacy_sentence_bert
nlp = spacy_sentence_bert . load_model ( 'en_stsb_roberta_large' )
nlp.add_pipe
API Jika Anda ingin menggunakan salah satu penyematan kalimat pada objek Bahasa yang sudah ada, Anda dapat menggunakan metode nlp.add_pipe
. Ini juga berfungsi jika Anda ingin menggunakan model bahasa yang tidak tercantum pada tabel di atas. Pastikan saja pengubah kalimat mendukungnya.
import spacy
nlp = spacy . blank ( 'en' )
nlp . add_pipe ( 'sentence_bert' , config = { 'model_name' : 'allenai-specter' })
nlp . pipe_names
Model, saat pertama kali digunakan, mengunduh kalimat-BERT ke folder yang ditentukan dengan TORCH_HOME
dalam variabel lingkungan (default ~/.cache/torch
).
Setelah Anda memuat model, gunakan melalui properti vector
dan metode similarity
spaCy:
# get two documents
doc_1 = nlp ( 'Hi there, how are you?' )
doc_2 = nlp ( 'Hello there, how are you doing today?' )
# get the vector of the Doc, Span or Token
print ( doc_1 . vector . shape )
print ( doc_1 [ 3 ]. vector . shape )
print ( doc_1 [ 2 : 4 ]. vector . shape )
# or use the similarity method that is based on the vectors, on Doc, Span or Token
print ( doc_1 . similarity ( doc_2 [ 0 : 7 ]))
Untuk membangun dan mengunggah
VERSION=0.1.2
# build the standalone models (17)
./build_models.sh
# build the archive at dist/spacy_sentence_bert-${VERSION}.tar.gz
python setup.py sdist
# upload to pypi
twine upload dist/spacy_sentence_bert- ${VERSION} .tar.gz