NLP

NLP

Before going to NLP, we need to do following operations.

Pre processing of the text
From the input we can’t to operation, pre processing helps to get the context of the input. Following seven steps involves the pre processing, After completes pre processing text, we can proceed for the numerical feature extraction.

1. Sentence tokenisation.
Sentence tokenisation splitting the sentence by the punctuation.
In nltk, PunktSentenceTokenizer class helps for the splitting  by the punctuation.

from nltk.tokens import PuncktSentenceTokenizer

line = “Hello world, I love you. I am waiting for you”
token = PuncktSentenceTokenizer()
token.tokenize(line)

2. Word tokenisation.

from nltk.tokens import PuncktWordTokenizer
line = “Hello world, I love you. I am waiting for you”
token = PuncktWordTokenizer()
token.tokenize(line)

3. POS(Part of speech) tagging
from nltk import pos_tag

4. Stemming(Semmatization)
Stemming helps to create the word cloud. It finds the root word, Remove the plural form of word.

5. Lemmatization.
 Lemmatization helps to create the word cloud. It finds the root word,  remove the tense form of word.

6. Removing stop words.
Stop words does not help for identifying the sentiment. So we need to remove stop words .(a, an , the)
7. Word cloud building.
Word clouds build from maximum used words in the input. So that we can easily get the sentiment of the inputs. Example: in review system, customer reviews contains excellent means, reviews are good. Because based on the count it will generate. We need to remove the stop words, otherwise it might will appear in the word cloud.

Numerical feature extraction.
Using raw text to math models to train. Math models are expecting numerical inputs. For training we need features and labels using linear algebra matrixes using following techniques.

  1. Word existence feature.
  2. Word frequency feature.
  3. Word proportion feature.
  4. Lexical feature
  5. Lexical diversification feature.
  6. tf-idf[term frequency and inverse document  frequency]  using Machine learning.
  7. t-sne feature using Deep Learning.

Word existence feature
    Find the existence from the input. using Features and labels. we need to convert to numerical format. So that convert the input into features and labels to matrix.

I love you, pos
I hate you, neg
I kill you, neg
I love to kill you, neg
I admire you, pos

I     love     you     hate     kill     to     admire
————————————————————–
1      1           1         0          0        0           0
1      0           1         1          0        0           0
1      0           1         0          1        0           0
1      1           0         0          1        1           0
1      0           1         0          0        0           1

x =  5X7 matrix, is the input matrix. Size is mXn rows.
m –> number of documents.
n –> number of unique words.
each cell –> either 1 or 0.
if word existed  –> 1
not existed –> 0

labels are the sentiment, here neg, pos.
y = [1 0 0 0 1]

Word frequency feature
Find the occurrence of the word in the document.
I love you, pos
I hate you, neg
I kill you, neg
I love to kill you, neg
I admire you, pos
I love to love people

I     love     you     hate     kill     to     admire      people
————————————————————————–
1      1           1         0          0        0           0           0
1      0           1         1          0        0           0           0
1      0           1         0          1        0           0           0
1      1           0         0          1        1           0           0
1      0           1         0          0        0           1           0
1      2           1         0          0        0           1           1

x =  5X7 matrix, is the input matrix. Size is mXn rows.
m –> number of documents.
n –> number of unique words.
each cell –> number of terms the word repeated in that document.
if word existed  –> number of occurance
not existed –> 0

Word proportion feature

I love love love you
I hate you

I     love     hate     you
——————————-
1/5   3/5      0         1/5
1/3    0         1/3      1/3

each cell –>  freq of (words|documents)/total_words_in_the_doc
if word existed  –> number of occurrence
not existed –> 0

Lexical feature

I love love love you
I hate you

I                    love                    hate                  you
——————————————————————
(1/5)*100       (3/5)*100      0                     (1/5)*100
(1/3)*100            0              (1/3)*100      (1/3)*100

I       love       hate        you
————————————
20       60      0            20
33       0       33          33

Lexical diversification feature

I love love love you
I love you hate you

I                    love                    hate                  you
——————————————————————
(1/2)*100       (3/4)*100      0                        (1/2)*100
(1/2)*100       (1/4)*100      (1/1)*100          (1/2)*100

I          love    hate    you
————————————-
50      75      0           50
50      25     100        50

Machine Learning – Simple problem

ML+NLP

  1.  Supervised + NLP = Sentiment Analysis, SPAM Detections
  2. UnSupervised + NLP = Topic categorisation
  3. Reinforcement + NLP = Chatbot

Chatbot have types

  1. Rule based – Fixed questions
  2. Dynamic based – AI

Deep Learning – Complex problem