Below is a job search example:
I use pretrained models from huggingface transformers.
Manually download pretrained tokenizer and t5/bert model into local directories.You can check models here.
I use 't5-small' model, check here and click List all files in model
to download files.
Note the manually downloaded files directory struture.
You could use other T5 or Bert models.
If you download other models, check hugaface transformers pretraied model list to check the model name.
$ export TOKEN_DIR=path_to_your_tokenizer_directory/tokenizer
$ export MODEL_DIR=path_to_your_model_directory/model
$ export MODEL_NAME=t5-small #or other model you downloaded
$ export INDEX_NAME=docsearch
$ docker-compose up --build
I also use docker system prune
to remove all unused containers, networks and images to get more memory. Increase your docker memory (I use 8GB
) if you encounter Container exits with non-zero exit code 137
error.
We use dense vector datatype to save the extracted features from pretrained NLP models (t5 or bert here, but you can add your interested pretrained models by yourself)
{
...
"text_vector": {
"type": "dense_vector",
"dims": 512
}
...
}
Dimensions dims:512
is for T5 models. Change dims
to 768 if you use Bert models.
Read doc from mysql and convert document into correct json format to bulk into elasticsearch.
$ cd index_files
$ pip install -r requirements.txt
$ python indexing_files.py
# or you can customize your parameters
# $ python indexing_files.py --index_file='index.json' --index_name='docsearch' --data='documents.jsonl'
Go to http://127.0.0.1:5000.
The key code for use pretrained model to extract features is get_emb
function in ./index_files/indexing_files.py
and ./web/app.py
files.
def get_emb(inputs_list,model_name,max_length=512):
if 't5' in model_name: #T5 models, written in pytorch
tokenizer = T5Tokenizer.from_pretrained(TOKEN_DIR)
model = T5Model.from_pretrained(MODEL_DIR)
inputs = tokenizer.batch_encode_plus(inputs_list, max_length=max_length, pad_to_max_length=True,return_tensors="pt")
outputs = model(input_ids=inputs['input_ids'], decoder_input_ids=inputs['input_ids'])
last_hidden_states = torch.mean(outputs[0], dim=1)
return last_hidden_states.tolist()
elif 'bert' in model_name: #Bert models, written in tensorlow
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = TFBertModel.from_pretrained('bert-base-multilingual-cased')
batch_encoding = tokenizer.batch_encode_plus(["this is","the second","the thrid"], max_length=max_length, pad_to_max_length=True)
outputs = model(tf.convert_to_tensor(batch_encoding['input_ids']))
embeddings = tf.reduce_mean(outputs[0],1)
return embeddings.numpy().tolist()
You can change the code and use your favorite pretrained model. For example, you can use GPT2 model.
you can also customize your elasticsearch by using your own score function instead of cosineSimilarity
in .webapp.py
.
This rep is modified based on Hironsan/bertsearch, which use bert-serving
packages to extract bert features. It limits to TF1.x