Techniques To Convert Words To Vectors
Different techniques to convert words to vectors
In Machine Learning, we want to convert words into vectors to fully utilize the mathematical functions for the model.
There are various techniques using which we can convert words into vectors. We will look into those techniques in this article.
Below are few techniques which can convert words to vectors:
Input Text
Lets say we have 4 sentences which we want to convert to vectors:
Bag of Words (BoW)
This is the simplest way to convert text to a vector. Basic idea of this technique is store all unique words in a list and for each sentence we will create a vector of length same as unique words. This creates a sparse vector where most of the values in the vector are zeros.
From our above example, our Bag of unique words contains:
{This, headphone, is, amazing, Noise, Cancelling, in, really, awesome, not, good, TrulyWireless, and, works, as, described, the, description}
Here we have 18 unique words in the bag, so we will represent each of the sentense in a vector of length 18.
Representation of sentence 1 -> This headphone is amazing -> [1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Similarly other sentences can be represented as,
Noise Cancelling in this headphone is really awesome -> [1,1,1,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0]
This headphone is not good -> [1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0]
Same is repeated for other sentence
TF-IDF Vectorizer
Bag of Words simply convert each word into a vector without considering the importance of the word. In TF-IDF, we will consider the frequency of word into consideration.
TF (Term Frequency):
How often a word occurs in corpus. If a word occurs multiple times, then TF of the word will be high
IDF (Inverse Document Frequency):
How rare a word occurs in corpus. If a word is rare, then the IDF of the word will be high
By combining TF-IDF, we are giving importance to rare word in corpus and frequency of occurance in current sentence
From our above example, our unique words are:
{This, headphone, is, amazing, Noise, Cancelling, in, really, awesome, not, good, TrulyWireless, and, works, as, described, the, description}
For computing the TF-IDF of a word in Sentence 1, consider ‘This’ word
TF(‘This’) : 1/4
IDF(‘This’): log(4/4)
TF-IDF(‘This’)=TF(‘This’)*IDF(‘This’)=(1/4)*(log(1))=0
Here word ‘This’ occurs very frequently in the document corpus and is not a important word and hence TF-IDF of ‘This’ is 0
so, we will update TF-IDF of Sentence 1 as -> This headphone is amazing ->
[TF-IDF(‘This’),TF-IDF(‘headphone’),TF-IDF(‘is’),TF-IDF(‘amazing’),0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Word2Vec
Word2Vec considers the semantic meaning and generates vectors for each word. Word2Vec considers the context and relation between the words. Distance between vectors of similar words is less. For example, distance between vectors generated by Word2Vec for King and Queen will be similar to distance between vectors for Man and Woman. Word2Vec learns the vectors from large document corpus.
Conclusion:
Using above techniques, we can convert words into vectors on which we can apply transformation that are useful for machine learning.
I will try to create blog for each of the techniques more elaborately.
Originally published at https://cskuracha.github.io on April 3, 2021.