Parts of Speech Tagger
1st project on ML internship (xcelerator)
submitted by- Akshay Bhoju kothari
Dhanush Shetty
H.K Nakul
Machine learning
definition-
Machine learning is an application of artificial intelligence (AI) that provides systems
the ability to automatically learn and improve from experience without being
explicitly programmed.
Applications
1. Virtual Personal Assistants. Siri, Alexa, Google Now are some of the popular examples of virtual personal assistants.
2. Social Media Services.(facebook)
3. Email Spam and Malware Filtering.
4. Online Customer Support.
5. Search Engine Result Refining.
6. Product Recommendations.
Challenges faced-
1. Most of the challenges we faced in extracting the features.
2. During the training phase we got less accuracy .
3. As we are beginner in python coding ,so it felt somewhat difficult.
4. Understanding about parts of speech.
Feature Extraction
def Feature_Extraction(sentence, i): #feature extraction
features = { 'Token': sentence[i],
'first_word': i == 0,
'capitalized':sentence[i][0].upper() == sentence[i][0],
'All_capitalized': sentence[i].upper() == sentence[i],
'numeric': sentence[i].isdigit(),
'prev-word': '' if i == 0 else sentence[i - 1],
'suffix(1)': sentence[i][-1],
'suffix(2)': '' if len(sentence[i]) < 2 else sentence[i][-2:],
'suffix(3)': '' if len(sentence[i]) < 3 else sentence[i][-3:],
'prefix(1)': sentence[i][0]}
return features
How have we solved our problem--
1. we did a lot of research on identifying the proper features.
2. we read materials and referred websites on Xclerator portal about
machine learning and python coding.
3. Referred many websites and online learning platforms like coursera and
NPTEL.
4. we chose proper algorithm to improve efficiency.
5. we did a lot of research in identifying the proper features .
6. we discussed among our group to enhance our knowledge.
Importing and downloading necessary libraries and dataset.
import nltk #importing and downloading necessary libraries and dataset.
nltk.download('brown')
nltk.download('tagsets')
nltk.download('universal_tagset')
from nltk.corpus import brown
lines = brown.sents(categories='news')
feature= []
for sentence in lines:
for i, word in enumerate(sentence):
feature.append((Feature_Extraction(sentence, i)))
tagged_sents = brown.tagged_sents(categories='news', tagset='universal') #to untag all
the sentences which are tagged and the appending it to the featureset
featureset = []
for tagged_sent in tagged_sents:
untagged_sent = nltk.tag.untag(tagged_sent)
for i, (word,tag) in enumerate(tagged_sent):
featureset.append((Feature_Extraction(untagged_sent,i),tag)) #here featureset is
the data which we will be using for training
size = int(len(featureset)*0.1) #using only 10000 of words as total data
train_set, test_set = featureset[size:], featureset[:size] #5000 datas to train and other
5000 to test
Classifier-
classifier = nltk.NaiveBayesClassifier.train(train_set)
Evaluation using accuracy-
classifier.classify(Feature_Extraction(brown.sents()[0], 9,)) #for the word 'of'
print(Feature_Extraction(brown.sents()[0], 9,))
accuracy=nltk.classify.accuracy(classifier, test_set)
print(accuracy) #nearly 85% we are getting
Naive bayes Classifier-
Future Enhancements-
1. we will be able to correct grammatical errors in a sentence.
2. we will able to do chunking and parsing of text
3. this can be also used in chatbots as a part of the model
4. By adding some extra features we can make this model as sentiment
analyser
References-
1.http://www.nltk.org/book/ch06.html#ref-document-classify-all-words.
2 resource available on xclerator portal.
3.https://docs.python.org/3/library/stdtypes.html.(python documentation)
THANK YOU