Text embedder is a machine learning task that is used to create a vector representation of a piece of text. This vector can then be used as input to a machine learning algorithm. The goal of text embedding is to capture the meaning of the text in a way that is suitable for machine learning.
There are many different ways to create text embeddings, but the most common is to use a neural network. A neural network is a machine learning algorithm that is very good at learning complex relationships. The input to a neural network is a vector, and the output is a vector of the same size. The neural network learns to map the input vectors to the output vectors in a way that captures the relationships between the inputs and outputs.
To create text embedding, the neural network is first trained on a large corpus of text. The training data is a set of sentences, and each sentence is represented as a vector. The vectors are created by taking the word vectors of the words in the sentence and summing them together. The neural network is then trained to map the sentence vectors to a fixed vector size.
Once the neural network has been trained, it can then be used to create text embeddings for new pieces of text. The new text is first represented as a vector, and then the neural network is used to map the vector to the fixed vector size. The result is a text embedding that captures the meaning of the text.
Text embeddings can be used for a variety of machine learning tasks. For example, they can be used to improve the performance of a machine learning algorithm that is used to classify texts. Text embeddings can also be used to find similar pieces of text, or to cluster texts together.
There are many different ways to create text embeddings, and the choice of method will depend on the application. However, neural networks are a powerful and widely used method for creating text embeddings.
Co:here is a powerful neural network, which can generate, embed, and classify text. In this tutorial, we will use Co:here to embed descriptions. To use Co:here you need to create account on Co:here and get API key.
We will be programming in Python, so we need to install cohere
library by pip
pip install cohere
Firstly, we have to implement cohere.Client
. In arguments of Client should be API key, which you have generated before, and version 2021-11-08
. I will create the class CoHere
, it will be useful in the next steps.
class CoHere:
def __init__(self, api_key):
self.co = cohere.Client(f'{api_key}', '2021-11-08')
self.examples = []
The main part of each neural network is a dataset. In this tutorial, I will use a dataset that includes 1000 descriptions of 10 classes. If you want to use the same, you can download it here.
The downloaded dataset has 10 folders in each folder is 100 files.txt
with descriptions. The name of files is a label of description, e.g.sport_3.txt
.
We will compare Random Forest
with Co:here Classifier
, so we have to prepare data in two ways. For Random Forest
we will use Co:here Embedder
, we will focus on this in this tutorial. Cohere classifier requires samples, in which each sample should be designed as a list [description, label]
and I did it in my previous tutorial (here)
In the beginning, we need to load all data, to do that. We create the function load_examples
. In this function we will use three external libraries:
os.path
to go into the folder with data. The code is executed in a path where is a python's file.py
. This is an internal library, so we do not need to install it.
numpy
this library is useful to work with arrays. In this tutorial, we will use it to generate random numbers. You have to install this library by pip pip install numpy
.
glob
helps us to read all files and folder names. This is an external library, so the installation is needed - pip install glob
.
The downloaded dataset should be extracted in the folder data
. By os.path.join
we can get universal paths of folders.
folders_path = os.path.join('data', '*')
In windows, a return is equal to data*
.
Then we can use glob
method to get all names of folders.
folders_name = glob(folders_path)
folders_name
is a list, which contains window paths of folders. In this tutorial, these are the names of labels.
['data\business', 'data\entertainment', 'data\food', 'data\graphics', 'data\historical', 'data\medical', 'data\politics', 'data\space', 'data\sport', 'data\technologie']
Size of Co:here
training dataset can not be bigger than 50 examples and each class has to have at least 5 examples, but for Random Forest
we can use 1000 examples. With loop for
we can get the names of each file. The entire function looks like that:
import os.path
from glob import glob
import numpy as np
def load_examples(no_of_ex):
examples_path = []
folders_path = os.path.join('data', '*')
folders_name = glob(folders_path)
for folder in folders_name:
files_path = os.path.join(folder, '*')
files_name = glob(files_path)
for i in range(no_of_ex // len(folders_name)):
random_example = np.random.randint(0, len(files_name))
examples_path.append(files_name[random_example])
return examples_path
The last loop is taking randomly N-paths of each label and appending them into a new list examples_path
.
Now, we have to create a training set. To make it we will load examples with load_examples()
. In each path is the name of a class, we will use it to create samples. Descriptions need to be read from files, a length can not be long, so in this tutorial, the length will be equal to 100. To list texts
is appended list of [descroption, class_name]
. Thus, a return is that list.
def examples(no_of_ex):
texts = []
examples_path = load_examples(no_of_ex)
for path in examples_path:
class_name = path.split(os.sep)[1]
with open(path, 'r', encoding="utf8") as file:
text = file.read()[:100]
texts.append([text, class_name])
return texts
We back to CoHere
class. We have to add one method - to embed examples.
The second cohere
method is to embed text. The method has serval arguments, such as:
model
size of a model.
texts
list of texts to embed.
truncate
if the text is longer than the available token, which part of the text should be taken LEFT
, RIGHT
or NONE
.
All of them you can find here.
In this tutorial, the cohere
method will be implemented as a method of our CoHere
class.
def embed(self, no_of_ex):
# as a good developer we should split the dataset.
data = pd.DataFrame(examples(no_of_ex))
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
list(data[0]), list(data[1]), test_size=0.2, random_state=0)
# in the next two lines we create a numeric form of X_train data
self.X_train_embeded = self.co.embed(texts=X_train,
model="large",
truncate="LEFT").embeddings
self.X_test_embeded = self.co.embed(texts=X_test,
model="large",
truncate="LEFT").embeddings
X_train_embeded
will be an array of numbers, which looks like that:
[ 386, 0.39653537, -0.409076, 0.5956299, -0.06624506, 2.0539167, 0.7133603,...
To create an application, which will be comparison of two likelihood displays, we will use Stramlit
. This is an easy and very useful library.
Installation
pip install streamlit
We will need text inputs for co:here
API key.
In docs of streamlit we can find methods:
st.header()
to make a header on our app
st.test_input()
to send a text request
st.button()
to create button
st.write()
to display the results of cohere model.
st.progress()
to display a progrss bar
st.column()
to split an app
st.header("Co:here Text Classifier vs Random Forest")
api_key = st.text_input("API Key:", type="password")
cohere = CoHere(api_key)
cohere.list_of_examples(50) # number of examples for Cohere classifier
# showed in the previous tutorial
cohere.embed(1000) # number of examples for random forest
# initialization of random forest with sklearn library
forest = RandomForestClassifier(max_depth=10, random_state=0)
col1, col2 = st.columns(2)
if col1.button("Classify"):
# training process of random forest, to do it we use embedded text.
forest.fit(cohere.X_train_embeded, cohere.y_train)
# prediction process of random forest
predict = forest.predict_proba(np.array(cohere.X_test_embeded[0]).reshape(1, -1))[0]
here = cohere.classify([cohere.X_test[0]])[0] # prediction process of cohere classifier
col2.success(f"Correct prediction: {cohere.y_test[0]}") # display original label
col1, col2 = st.columns(2)
col1.header("Co:here classify") # predictions for cohere
for con in here.confidence:
col1.write(f"{con.label}: {np.round(con.confidence*100, 2)}%")
col1.progress(con.confidence)
col2.header("Random Forest") # predictions for random forest
for con, pred in zip(here.confidence, predict):
col2.write(f"{con.label}: {np.round(pred*100, 2)}%")
col2.progress(pred)
To run the streamlit app use command
streamlit run name_of_your_file.py
The created app looks like this
Text embedding is a powerful tool that can be used to improve the performance of machine learning algorithms. Neural networks are a widely used and effective method for creating text embeddings. Text embeddings can be used for tasks such as text classification, text similarity, and text clustering.
In this tutorial, we compare the Random Forest
with Co:here Classifier
, but the possibilities of Co:here Embedder
are huge. You can build a lot of stuff with it.
Stay tuned for future tutorials! The repository of this code can check here.
Thank you! - Adrian Banachowicz, Data Science Intern in New Native