Text processing/Data processing, Bag of words, TFIDF-
we all know that the language of computers is Numerical, the very first step that comes to our mind is to
convert our language to numbers. This conversion takes a few steps to happen.
The first step to it is Text Normalisation. Since human languages are complex, we need to first of all
simplify them in order to make sure that the understanding becomes possible.
Text Normalization helps in cleaning up the textual data in such a way that it comes down to a level
where its complexity is lower than the actual data.
Text Normalization-
Original form of text not suitable for a machine to process due to many unnecessary word and symbol in it. The
process of simplifying the text to make it suitable for machine processing is called text normalization.
In this process we remove unnecessary text and break down the text into smaller token.
Entire text that come from all the documents to be processed by a machine is called corpus.
Text Normalization is a process to reduce the variations in text’s word forms to a common form .
Text normalization simplifies the text for further processing.
Text Normalization—
Raw Text from user Text Normalization process Output
1. Sentence Segmentation----
In this first stage, all the text in a corpus is broken down into sentence. Each sentence is treated as a separated
string of letters to be processed.
Example-
2.Tokenisation---:
In this step, each sentence is further broken into individual text pieces called Token.
A token can be a word , number or symbol (punctuations, special character etc)
3.Eliminating Stop words, Special Characters and numbers—
There are certain token which occur several times in the corpus .Mostly these are auxiliary verb (is,
are ,was),punctuation, preposition (on, at, in),articles (a, an ,the) and other such word like such, there ,them or
and etc. All these word are called stop words because they pose unnecessary processing effort.
Note- if a corpus include document of financial transaction then numbers are not stop word there.
Some examples of stop words are-
4.) Converting text to a common case-:
Convert all the token into a common text case preferably lower. It will eliminate the differences in the
interpretation of same token such as Truth, truth, truth.
5.) Stemming -:
Certain word have some affixes (letter that appear at the end of the word).In stemming these affixes are
removed to keep only the root or original word. Some example are –
Hours> Hour ,Eating >Eat
Stemming does not take into account if the stemmed word is meaningful or not. It just removes the affixes
hence it is faster.
6. Lemmatization-:
Lemmatization is the process of stemming as well as converting the stemmed word to their proper form to
keep them meaningful. A word which is stemmed and converted to its meaningful form is called lemma.
we have normalised our text to tokens which are the simplest form of words. Now it is time to convert the
tokens into numbers. For this, we would use the Bag of Words algorithm.
For extracting feature of the text ,we need to get it converted into suitable numeric form .this is done by the
help of various algorithm.
Bag of word
Term frequency-inverse Document Frequency
Bag of words (BOW)---
Bag of words algorithm is used to extract two feature of text in the corpus – vocabulary and frequency.
Vocabulary refer to the unique word identified in the corpus and frequency is the number of occurrences of
each term in the corpus.
Here is the step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the word from the
unique list of words has occurred.
4. Create document vectors for all the documents. Let us go through all the steps with an example:
Step 1: Collecting data and pre-processing it.
Document 1: Aman and Anil are stressed
Document 2: Aman went to a therapist
Document 3: Anil went to download a health chatbot
Step 2: Create Dictionary
list down all the words which occur in all three documents:
Dictionary:
Step 3: Create document vector—
In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches
with the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by
1. And if the word does not occur in that document, put a 0 under it.
Step 4: Repeat for all documents Same exercise has to be done for all the documents. Hence, the table
becomes:
In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents. Take a look at this table and analyse the positioning of 0s and 1s in it.
Finally, this gives us the document vector table for our corpus. But the tokens have still not converted
to numbers. This leads us to the final steps of our algorithm: TFIDF.
TFIDF: Term Frequency & Inverse Document Frequency---
In this graph we plot occurrence of word and versus their value . As we see if the word have highest
occurrences in all the document of corpus they have neglible value hence they are termed as stop word.
These word removed at pre-processing stage and now we move ahead from the stop word and
occurrence level drops drastically and the word which have sufficient occurrence in the corpus and have
some amount of value are termed as frequent words. Further occurrence of word drop, the value of
such word rises. These word occur the least but add the most value to the corpus.
Term Frequency—
Term frequency is the frequency of a word in one document. Term frequency can easily be found from
the document vector table as in that table we mention the frequency of each word of the vocabulary in
each document.
Document Frequency---
Document Frequency is the number of documents in which the word occurs irrespective of how many
times it has occurred in those documents.
Inverse Document Frequency---
we need to put the document frequency in the denominator while the total number of documents is the
numerator. Here, the total number of documents are 3, hence inverse document frequency becomes:
Finally, the formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log( IDF(W) ).
Here, log is to the base of 10.
Now, let’s multiply the IDF values to the TF values. Note that the TF values are for each document while
the IDF values are for the whole corpus. Hence, we need to multiply the IDF values to each row of the
document vector table.
Finally, the words have been converted to numbers. These numbers are the values of each for each
document.
Summarising the concept, we can say that:
1. Words that occur in all the documents with high term frequencies have the least values and are
considered to be the stop words.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a common
word for all documents.
3. These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus.