[go: up one dir, main page]

0% found this document useful (0 votes)
192 views23 pages

05b.BDA (18CS72) Module-5 Text Mining

This document provides an overview of text mining and its applications. It discusses the following key points: - Text mining is used to discover patterns and insights from textual databases through techniques like frequency analysis and semantic analysis. - Common applications of text mining include marketing analysis, customer service, business operations monitoring, legal research, and political analysis. - The text mining process involves five phases: pre-processing, feature generation, feature selection, modeling, and evaluation. Pre-processing prepares the text for analysis through steps like cleanup, tokenization, and parsing. - Feature generation uses methods like bag-of-words modeling and stemming to extract meaningful features from text. Feature selection then identifies the most useful features for

Uploaded by

Suhas NS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views23 pages

05b.BDA (18CS72) Module-5 Text Mining

This document provides an overview of text mining and its applications. It discusses the following key points: - Text mining is used to discover patterns and insights from textual databases through techniques like frequency analysis and semantic analysis. - Common applications of text mining include marketing analysis, customer service, business operations monitoring, legal research, and political analysis. - The text mining process involves five phases: pre-processing, feature generation, feature selection, modeling, and evaluation. Pre-processing prepares the text for analysis through steps like cleanup, tokenization, and parsing. - Feature generation uses methods like bag-of-words modeling and stemming to extract meaningful features from text. Feature selection then identifies the most useful features for

Uploaded by

Suhas NS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Big Data Analytics [18CS72]

MODULE 5
Chapter 1: Text Mining
Text mining is the art and science of discovering knowledge, insights and patterns from an
organized collection of textual databases. Textual mining can help with frequency analysis of
important terms, and their semantic relationships.

Text is an important part of the growing data in the world. Social media technologies have
enabled users to become producers of text and images and other kinds of information. Text
mining can be applied to large-scale social media data for gathering preferences, and
measuring emotional sentiments. It can also be applied to societal, organizational and
individual scales.

1.1 Text Mining Applications


Text mining is a useful tool in the hands of chief knowledge officers to extract knowledge
relevant to an organization. Text mining can be used across industry sectors and application
areas, including decision support, sentiment analysis, fraud detection, survey analysis, and
many more.

1. Marketing: The voice of the customer can be captured in its native and raw format and
then analyzed for customer preferences and complaints.
1. Social personas are a clustering technique to develop customer segments of
interest. Consumer input from social media sources, such as reviews, blogs, and
tweets, contain numerous leading indicators that can be used towards anticipating
and predicting consumer behavior.
2. A ‘listening platform’ is a text mining application, that in real time, gathers social
media, blogs, and other textual feedback, and filters out the chatter to extract true
consumer sentiment. The insights can lead to more effective product marketing
and better customer service.

2. The customer call center conversations and records can be analyzed for patterns of
customer complaints. Decision trees can organize this data to create decision choices
that could help with product management activities and to become proactive in
avoiding those complaints.

3. Business operations: Many aspects of business functioning can be accurately gauged


from analyzing text./
1. Social network analysis and text mining can be applied to emails, blogs, social
media and other data to measure the emotional states and the mood of employee
populations. Sentiment analysis can reveal early signs of employee dissatisfaction
which can then can be proactively managed.

Page 1
Big Data Analytics [18CS72]

2. Studying people as emotional investors and using text analysis of the social
Internet to measure mass psychology can help in obtaining superior investment
returns.

3. Legal: In legal applications, lawyers and paralegals can more easily search case
histories and laws for relevant documents in a particular case to improve their chances
of winning.
1. Text mining is also embedded in e-discovery platforms that help in minimizing
risk in the process of sharing legally mandated documents.
2. Case histories, testimonies, and client meeting notes can reveal additional
information, such as morbidities in a healthcare situation that can help better
predict high-cost injuries and preventcosts.

4. Governance and Politics: Governments can be overturned based on a tweet originating


from a self-immolating fruit-vendor in Tunisia.
1. Social network analysis and text mining of large-scale social media data can be
used for measuring the emotional states and the mood of constituent populations.
Micro-targeting constituents with specific messages gleaned from social media
analysis can be a more efficient use of resources when fighting democratic
elections.
2. In geopolitical security, internet chatter can be processed for real- time
information and to connect the dots on any emerging threats.
3. In academic, research streams could be meta-analyzed for underlying research
trends.

1.2 Text Mining Process


Text Mining is a rapidly evolving area of research. As the amount of social media and other
text data grows, there is need for efficient abstraction and categorization of meaningful
information from the text.

The five phases for processing text are as follows:


Phase 1: Text pre-processing enables Syntactic/Semantic text-analysis and does the
followings:

Page 2
Big Data Analytics [18CS72]

1. Text cleanup is a process of removing unnecessary or unwanted information. Text


cleanup converts the raw data by filling up the missing values, identifies and
removes outliers, and resolves the inconsistencies. For example, removing
comments, removing or escaping "%20" from URL for the web pages or cleanup the
typing error, such as teh (the), do n't (do not) [%20 specifies space in a URL].
2. Tokenization is a process of splitting the cleanup text into tokens (words) using
white spaces and punctuation marks as delimiters.
3. Part of Speech (POS) tagging is a method that attempts labeling of each token (word)
with an appropriate POS. Tagging helps in recognizing names of people, places,
organizations and titles. English language set includes the noun, verb, adverb,
adjective,
prepositions and conjunctions. Part of Speech encoded in the annotation system of the Penn
Treebank Project has 36 POS tags.4
4. Word sense disambiguation is a method, which identifies the sense of a word used in
a sentence; that gives meaning in case the word has multiple meanings. The
methods, which resolve the ambiguity of words can be context or proximity based.
Some examples of such words are bear, bank, cell and bass.
5. Parsing is a method, which generates a parse-tree for each sentence. Parsing
attempts and infers the precise grammatical relationships between different words in
a given sentence.
Phase 2: Features Generation is a process which first defines features (variables,
predictors). Some of the ways of feature generations are:
1. Bag of words-Order of words is not that important for certain applications.
Text document is represented by the words it contains (and their occurrences).
Document classification methods commonly use the bag-of-words model. The pre-
processing of a document first provides a document with a bag of words. Document
classification methods then use the occurrence (frequency) of each word as a feature
for training a classifier. Algorithms do not directly apply on the bag of words, but use
the frequencies.
2. Stemming-identifies a word by its root.
(i) Normalizes or unifies variations of the same concept, such as speak for three
variations, i.e., speaking, speaks, speakers denoted by [speaking, speaks, speaker-
+ speak]
(ii) Removes plurals, normalizes verb tenses and remove affixes.
Stemming reduces the word to its most basic element. For example, impurification -+ pure.
3. Removing stop words from the feature space-they are the common words, unlikely to
help text mining. The search program tries to ignore stop words. For example,
ignores a, at, for, it, in
and are.
4. Vector Space Model (VSM)-is an algebraic model for representing text documents as vector
of identifiers, word frequencies or terms in the document index. VSM uses the method of
term frequency-inverse document frequency (TF-IDF) and evaluates how important is a
Page 3
Big Data Analytics [18CS72]

word in a document.
When used in document classification, VSM also refers to the bag-of-words model. This bag
of words is required to be converted into a term-vector in VSM. The term vector provides
the numeric values corresponding to each term appearing in a document. The term vector is
very helpful in feature generation and selection.
Term frequency and inverse document frequency (IDF) are important metrics in text analysis.
TF-IDF weighting is most common• Instead of the simple TF, IDF is used to weight the
importance of word in the document.
Phase 3: Features Selection is the process that selects a subset of features by rejecting
irrelevant and/or redundant features (variables, predictors or dimension) according to defined
criteria. Feature selection process does the following:
1. Dimensionality reduction-Feature selection is one of the methods of division and therefore,
dimension reduction. The basic objective is to eliminate irrelevant and redundant data.
Redundant features are those, which provide no extra information. Irrelevant features
provide no useful or relevant information in any context.
Principal Component Analysis (PCA) and Linear Discriminate Analysis (LDA) are dimension
reduction methods. Discrimination ability of a feature measures relevancy of features.
Correlation helps in finding the redundancy of the feature. Two features are redundant to
each other if their values correlate with each other.
2. N-gram evaluation-finding the number of consecutive words of interest and extract them.
For example, 2-gram is a two words sequence, ["tasty food", "Good one"]. 3-gram is a three
words sequence, ["Crime Investigation Department"].
3. Noise detection and evaluation of outliers methods do the identification of unusual or
suspicious items, events or observations from the data set. This step helps in cleaning the
data.
The feature selection algorithm reduces dimensionality that not only improves the
performance of learning algorithm but also reduces the storage requirement for a dataset. The
process enhances data understanding and its visualization.
Phase 4: Data mining techniques enable insights about the structured database that resulted
from the previous phases. Examples of techniques are:
1. Unsupervised learning (for example, clustering)
(i) The class labels (categories) of training data are unknown
(ii) Establish the existence of groups or clusters in the data
Good clustering methods use high intra-cluster similarity and low inter-cluster similarity.
Examples of uses - biogs, pattern
and trends.
2. Supervised learning (for example, classification)
(i) The training data is labeled indicating the class
(ii) New data is classified based on the training set
Classification is correct when the known label of test sample is identical with the resulting
class computed from the classification model.

Page 4
Big Data Analytics [18CS72]

Examples of uses are news filtering application, where it is required to automatically assign
incoming documents to pre-defined categories; email spam filtering, where it is identified
whether incoming email messages are spam or not.
Example of text classification methods are Naive Bayes Classifier and SVMs.
3. Identifying evolutionary patterns in temporal text streams-the method is useful in a wide
range of applications, such as summarizing of events in news articles and extracting the
research trends in the scientific literature.
Phase 5: Analysing results

(i) Evaluate the outcome of the complete process.


(ii) Interpretation of Result- If acceptable then results obtained can be used as an input for next
set of sequences. Else, the result can be discarded, and try to understand what and why the
process failed.
(iii) Visualization - Prepare visuals from data, and build a prototype.
(iv) Use the results for further improvement in activities at the enterprise, industry or institution.

Text Mining Challenges


The challenges in the area of text mining can be classified on the basis of documents area-
characteristics. Some of the classifications are as follows:
1. NLP issues:

(i) POS Tagging

(ii) Ambiguity

(iii) Tokenization

(iv) Parsing

(v) Stemming

(vi) Synonymy and polysemy

2. Mining techniques:

(i) Identification of the suitable algorithm(s)

(ii) Massive amount of data and annotated corpora

(iii) Concepts and semantic relations extraction

(iv) When no training data is available

3. Variety of data:

(i) Different data sources require different approaches and different areas of expertise

Page 5
Big Data Analytics [18CS72]

(ii) Unstructured and language independency

4. Information visualization

5. Efficiency when processing real-time text stream

6. Scalability

1.3 Term Document Matrix


This is the heart of the structuring process. Free flowing text can be transformed into numeric
data in a TDM, which can then be mined using regular data mining techniques.

1. There are several efficient techniques for identifying key terms from a text. There are
less efficient techniques available for creating topics out of them. For the purpose of
this discussion, one could call key words, phrases or topics as a term of interest. This
approach measures the frequencies of select important terms occurring in each
document. This creates a t x d Term–by–Document Matrix (TDM) where t is the
number of terms and d is the number of documents (Table 11.1).
2. Creating a TDM requires making choices of which terms to include. The terms chosen
should reflect the stated purpose of the text mining exercise. The list of terms should be
as extensive as needed, but should not include unnecessary stuff that will serve to
confuse the analysis, or slow the computation.

Table 1.1: Term-Document Matrix

Page 6
Big Data Analytics [18CS72]

Here are some considerations in creating a TDM.

1. A large collection of documents mapped to a large bag of words will likely lead to a
very sparse matrix if they have few common words. Reducing dimensionality of data
will help improve the speed of analysis and meaningfulness of the results. Synonyms,
or terms will similar meaning, should be combined and should be counted together, as a
common term. This would help reduce the number of distinct terms of words or
‘tokens’.
2. Data should be cleaned for spelling errors. Common spelling errors should be ignored
and the terms should be combined. Uppercase- lowercase terms should also be
combined.
3. When many variants of the same term are used, just the stem of theword would be used
to reduce the number of terms. For instance, terms like customer order, ordering, order
data, should be combined into a single token word, called ‘Order’.
4. On the other side, homonyms (terms with the same spelling but different meanings)
should be counted separately. This would enhance the quality of analysis. For example,
the term order can mean a customer order, or the ranking of certain choices. These two
should be treated separately. “The boss ordered that the customer orders data analysis
be presented in chronological order’. This statement shows three different meanings for
the word ‘order’. Thus, there will be a need for a manual review of the TD matrix.
5. Terms with very few occurrences in very few documents should be eliminated from the
matrix. This would help increase the density of the matrix and the quality of analysis.
6. The measures in each cell of the matrix could be one of several possibilities. It could be
a simple count of the number of occurrences of each term in a document. It could also
be the log of that number. It could be the fraction number computed by dividing the
frequency count by the total number of words in the document. Or there may be binary
valuesin the matrix to represent whether a term is mentioned or not. The choice of value
in the cells will depend upon the purpose of the textanalysis.

At the end of this analysis and cleansing, a well-formed, densely populated, rectangular,
TDM will be ready for analysis. The TDM could be mined using all the available data mining
techniques.

1.4 Mining the TDM


The TDM can be mined to extract patterns/knowledge. A variety of techniques could be
applied to the TDM to extract new knowledge.

Predictors of desirable terms could be discovered through predictive techniques, such as


regression analysis. Suppose the word profit is a desirable word in a document. The number
of occurrences of the word profit in a document could be regressed against many other terms
in the TDM. The relative strengths of the coefficients of various predictor variables would

Page 7
Big Data Analytics [18CS72]

show the relative impact of those terms on creating a profitdiscussion.

Predicting the chances of a document being liked is another form of analysis. For example,
important speeches made by the CEO or the CFO to investors could be evaluated for quality.
If the classification of those documents (such as good or poor speeches) was available, then
the terms of TDM could be used to predict the speech class. A decision tree could be
constructed that makes a simple tree with a few decision points that predicts the success of a
speech 80 percent of the time. This tree could be trained with more data to become better
over time.

Clustering techniques can help categorize documents by common profile. For example,
documents containing the words investment and profit more often could be bundled together.
Similarly, documents containing the words, customer orders and marketing, more often could
be bundled together. Thus, a few strongly demarcated bundles could capture the essence of
the entire TDM. These bundles could thus help with further processing, such as handing over
select documents to others for legal discovery.

Association rule analysis could show relationships of coexistence. Thus, one could say that
the words, tasty and sweet, occur together often (say 5 percent of the time); and further, when
these two words are present, 70 percent of the time, the word happy, is also present in the
document.

1.5 Comparing Text Mining and Data Mining


Text Mining is a form of data mining. There are many common elements between Text and
Data Mining. However, there are some key differences (Table 1.2). The key difference is that
text mining requires conversion of text data into frequency data, before data mining
techniques can beapplied.

Page 8
Big Data Analytics [18CS72]

Dimension Text Mining Data Mining

Nature of data Unstructured data: Words, Numbers; alphabetical and


phrases, sentences logical values

Many languages and dialects used in the


Language used world; Similar numerical systems
many languages are extinct, new across the world
documents are discovered

Clarity and Sentences can be ambiguous; sentiment


precision may contradict the words Numbers are precise.

Different parts of data can be


Consistency Different parts of the text can contradict inconsistent, thus, requiring
each other statistical significance analysis

Text may present a clear and consistent


Sentiment or mixed sentiment, across a continuum. Not applicable
Spokenwords adds further sentiment

Spelling errors. Differing values of


Quality proper nouns such as names. Varying Issues with missing values,
quality of language translation outliers, etc

A full wide range of statistical


Nature of Keyword based search; co- existence of and machine learning analysis for
Analysis themes; Sentiment Mining relationship and differences

Table 1.2: Comparing Text Mining and Data Mining

Page 9
Big Data Analytics [18CS72]

1.6 Text Mining Best Practices


Many of the best practices that apply to the use of data mining techniques will also apply to text
mining.

1. The first and most important practice is to ask the right question. A good question is
one which gives an answer and would lead to large payoffs for the organization. The
purpose and the key question will define how and at what levels of granularity the
TDM would be made. For example, TDM defined for simpler searches would be
different from those used for complex semantic analysis or network analysis.
2. A second important practice is to be creative and open in proposing imaginative
hypotheses for the solution. Thinking outside the box is important, both in the quality
of the proposed solution as well as in finding the high quality data sets required to test
the hypothesized solution. For example, a TDM of consumer sentiment data should be
combined with customer order data in order to develop a comprehensive view of
customer behavior. It’s important to assemble a team that has a healthy mix of technical
and business skills.
3. Another important element is to pursue the problem iteratively. Too much data can
overwhelm the infrastructure and also befuddle the mind. It is better to divide and
conquer the problem with a simpler TDM, with fewer terms and fewer documents and
data sources. Expand as needed, in an iterative sequence of steps. In the future, add new
terms to help improve predictive accuracy.
4. A variety of data mining tools should be used to test the relationships in the TDM.
Different decision tree algorithms could be run alongside cluster analysis and other
techniques. Triangulating the findings with multiple techniques, and many what-if
scenarios, helps build confidence in the solution. Test the solution in many ways before
committing to deploy it.

Page 10
Big Data Analytics [18CS72]

Chapter 2: Web Mining

Web mining is the art and science of discovering patterns and insights from the World-wide
web so as to improve it. The world-wide web is at the heart of the digital revolution. More
data is posted on the web every day than was there on the whole web just 20 years ago.
Billions of users are using it every day for a variety of purposes. The web is used for
electronic commerce, business communication, and many other applications. Web mining
analyzes data from the web and helps find insights that could optimize the web content and
improve the user experience. Data for web mining is collected via Web crawlers, web logs,
and other means.

Here are some characteristics of optimized websites:

1. Appearance: Aesthetic design. Well-formatted content, easy to scan and navigate. Good
color contrasts.
2. Content: Well planned information architecture with useful content. Fresh content.
Search-engine optimized. Links to other goodsites.
3. Functionality: Accessible to all authorized users. Fast loading times.
Usable forms. Mobile enabled.

This type of content and its structure is of interest to ensure the web is easy to use. The
analysis of web usage provides feedback on the web content, and also the consumer’s
browsing habits. This data can be of immense use for commercial advertising, and even for
social engineering.

The web could be analyzed for its structure as well as content. The usage pattern of web
pages could also be analyzed. Depending upon objectives, web mining can be divided into
three different types: Web usage mining, Web content mining and Web structure mining
(Figure 2.1).

Figure: 2.1 Web Mining structure

Page 11
Big Data Analytics [18CS72]

2.1 Web content mining


A website is designed in the form of pages with a distinct URL (universal resource locator).
A large website may contain thousands of pages. These pages and their content is managed
using specialized software systems called Content Management Systems. Every page can
have text, graphics, audio, video, forms, applications, and more kinds of content including
user generated content.

The websites keep a record of all requests received for its page/URLs, including the requester
information using ‘cookies’. The log of these requests could be analyzed to gauge the
popularity of those pages among different segments of the population. The text and
application content on the pages could be analyzed for its usage by visit counts. The pages on
a website themselves could be analyzed for quality of content that attracts most users. Thus
the unwanted or unpopular pages could be weeded out, or they can be transformed with
different content and style. Similarly, more resources could be assigned to keep the more
popular pages more fresh and inviting.

2.2 Web structure mining


The Web works through a system of hyperlinks using the hypertext protocol (http). Any page
can create a hyperlink to any other page, it can be linked to by another page. The intertwined
or self-referral nature of web lends itself to some unique network analytical algorithms. The
structure of Web pages could also be analyzed to examine the pattern of hyperlinks among
pages. There are two basic strategic models for successful websites: Hubs and Authorities.

1. Hubs: These are pages with a large number of interesting links. They serve as a hub, or
a gathering point, where people visit to access a variety of information. Media sites like
Yahoo.com, or government sites would serve that purpose. More focused sites like
Traveladvisor.com and yelp.com could aspire to becoming hubs for new emergingareas.
2. Authorities: Ultimately, people would gravitate towards pages that provide the most
complete and authoritative information on a particular subject. This could be factual
information, news, advice, user reviews etc. These websites would have the most
number of inbound links from other websites. Thus Mayoclinic.com would serve as an
authoritative page for expert medical opinion. NYtimes.com would serve as an
authoritative page for daily news.

2.3 Web usage mining


As a user clicks anywhere on a webpage or application, the action is recorded by many
entities in many locations. The browser at the client machine will record the click, and the
web server providing the content would also make a record of the pages served and the user
activity on those pages. The entities between the client and the server, such as the router,
proxy server, or ad server, too would record that click.

The goal of web usage mining is to extract useful information and patterns from data
generated through Web page visits and transactions. The activity data comes from data stored

Page 12
Big Data Analytics [18CS72]

in server access logs, referrer logs, agent logs, and client-side cookies. The user
characteristics and usage profiles are also gathered directly, or indirectly, through syndicated
data. Further, metadata, such as page attributes, content attributes, and usage data are also
gathered.

The web content could be analyzed at multiple levels (Figure 2.2).

1. The server side analysis would show the relative popularity of the web pages accessed.
Those websites could be hubs andauthorities.
2. The client side analysis could focus on the usage pattern or the actual content
consumed and created by users.
1. Usage pattern could be analyzed using ‘clickstream’ analysis, i.e. analyzing web
activity for patterns of sequence of clicks, and the location and duration of visits
on websites. Clickstream analysis can be useful for web activity analysis,
software testing, market research, and analyzing employee productivity.
2. Textual information accessed on the pages retrieved by users could be analyzed
using text mining techniques. The text would be gathered and structured using the
bag-of-words technique to build a Term-document matrix. This matrix could then
be mined using cluster analysis and association rules for patterns such as popular
topics, user segmentation, and sentiment analysis.

Figure: 2.2 Web Usage Mining architecture

Web usage mining has many business applications. It can help predict user behavior based on
previously learned rules and users' profiles, and can help determine lifetime value of clients.
It can also help design cross-marketing strategies across products, by observing association
rules among the pages on the website. Web usage can help evaluate promotional campaigns
and see if the users were attracted to the website and used the pages relevant to the campaign.
Web usage mining could be used to present dynamic information to users based on their
interests and profiles. This includes targeted online ads and coupons at user groups based on
user access patterns.

2.4 Web Mining Algorithms


Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates web pages as
being hubs or authorities. Many other HITS-based algorithms have also been published. The
most famous and powerful of these algorithms is the PageRank algorithm. Invented by

Page 13
Big Data Analytics [18CS72]

Google co-founder Larry Page, this algorithm is used by Google to organize the results of its
search function. This algorithm helps determine the relative importance of any particular web
page by counting the number and quality of links to a page. The websites with more number
of links, and/or more links from higher-quality websites, will be ranked higher. It works in a
similar way as determining the status of a person in a society of people. Those with relations
to more people and/or relations to people of higher status will be accorded a higher status.

PageRank is the algorithm that helps determine the order of pages listed upon a Google
Search query. The original PageRank algorithm formuation has been updated in many ways
and the latest algorithm is kept a secret so other websites cannot take advantage of the
algorithm and manipulate their website according to it. However, there are many standard
elements that remain unchanged. These elements lead to the principles for a good website.
This process is also called Search Engine Optimization (SEO).

SUNIL G L, Dept. Of CSE, SVIT Page 14


Big Data Analytics [18CS72]

Chapter 3
Naïve Bayes Analysis

Naïve Bayes technique is a is supervised machine learning technique that that uses
probability theory based analysis.
It is machine learning technique that computes the probabilities of an instance of belonging to
each of many target classes, given the prior probabilities of classification using individual
factors.

P(c|x) is the posterior probability of class (c, target)


given predictor (x, attributes).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
• P(x) is the prior probability of predictor.
3.1 Probability
• The Bayes Rule provides the formula for the probability of Y given X. But, in real-
world problems, you typically have multiple X variables.
• When the features are independent, we can extend the Bayes Rule to what is called
Naive Bayes.
• It is called ‘Naive’ because of the naive assumption that the X’s are independent of
each other. Regardless of its name, it’s a powerful formula.

SUNIL G L, Dept. Of CSE, SVIT Page 15


Big Data Analytics [18CS72]

• In technical jargon, the left-hand-side (LHS) of the equation is understood as the


posterior probability or simply the posterior.
• The RHS has 2 terms in the numerator.
• The first term is called the ‘Likelihood of Evidence’. It is nothing but the conditional
probability of each X’s given Y is of particular class ‘c’.
• Since all the X’s are assumed to be independent of each other, you can just multiply
the ‘likelihoods’ of all the X’s and called it the ‘Probability of likelihood of evidence’.
This is known from the training dataset by filtering records where Y=c.
• The second term is called the prior which is the overall probability of Y=c, where c is
a class of Y. In simpler terms, Prior = count(Y=c) / n_Records.

SUNIL G L, Dept. Of CSE, SVIT Page 16


Big Data Analytics [18CS72]

• An example is better than an hour of theory. So let’s see one

Naive Bayes Example


• Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’.
• These are the 3 possible classes of the Y variable.
• We have data for the following X variables, all of which are binary (1 or 0).
• Long
• Sweet
• Yellow
• The first few rows of the training dataset look like this:

• For the sake of computing the probabilities, let’s aggregate the training data to form a
counts table like this.

SUNIL G L, Dept. Of CSE, SVIT Page 17


Big Data Analytics [18CS72]

• So the objective of the classifier is to predict if a given fruit is a ‘Banana’ or ‘Orange’


or ‘Other’ when only the 3 features (long, sweet and yellow) are known.
• Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you predict what
fruit it is?
• This is the same of predicting the Y when only the X variables in testing data are
known. Let’s solve it by hand using Naive Bayes.
• The idea is to compute the 3 probabilities, that is the probability of the fruit being a
banana, orange or other. Whichever fruit type gets the highest probability wins.
• All the information to calculate these probabilities is present in the above tabulation.
• Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits.
o P(Y=Banana) = 500 / 1000 = 0.50
o P(Y=Orange) = 300 / 1000 = 0.30
o P(Y=Other) = 200 / 1000 = 0.20
• Step 2: Compute the probability of evidence that goes
in the denominator.
o P(x1=Long) = 500 / 1000 = 0.50
o P(x2=Sweet) = 650 / 1000 = 0.65
o P(x3=Yellow) = 800 / 1000 = 0.80
• Step 3: Compute the probability of likelihood of evidences that goes in the numerator.
o Here, I have done it for Banana alone.
o Probability of Likelihood for Banana
o P(x1=Long | Y=Banana) = 400 / 500 = 0.80
o P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70

SUNIL G L, Dept. Of CSE, SVIT Page 18


Big Data Analytics [18CS72]

o P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90


• Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get the
probability that it is a banana.

Advantages
• When assumption of independent predictors holds true, a Naive Bayes
classifier performs better as compared to other models
• Naive Bayes requires a small amount of training data to estimate the test data. So,
the training period is less.
• Naive Bayes is also easy to implement.
Disadvantages

• Main imitation of Naive Bayes is the assumption of independent predictors. Naive


Bayes implicitly assumes that all the attributes are mutually independent. In real life,
it is almost impossible that we get a set of predictors which are completely
independent.
• If categorical variable has a category in test data set, which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as Zero Frequency. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.

SUNIL G L, Dept. Of CSE, SVIT Page 19


Big Data Analytics [18CS72]

Chapter 5
Support Vector Machine
• “Support Vector Machine” (SVM) is a supervised machine learning algorithm which
can be used for both classification or regression challenges.
• However, it is mostly used in classification problems.
• In this algorithm, we plot each data item as a point in n-dimensional space (where n is
number of features you have) with the value of each feature being the value of a
particular coordinate.
• Then, we perform classification by finding the hyper-plane that differentiate the two
classes very well (look at the below snapshot).

SUNIL G L, Dept. Of CSE, SVIT Page 20


Big Data Analytics [18CS72]

How does it work?


• Thumb rule to identify the right hyper-plane
• • Select the hyper-plane which segregates the two classes better
• • Maximizing the distances between nearest data point (either class) and hyper-plane.
This distance is called as Margin.

SUNIL G L, Dept. Of CSE, SVIT Page 21


Big Data Analytics [18CS72]

SVM Model

• f(x) = W.X + b
• W is the normal to the line, X is input vector and b the bias
• W is known as the weight vector

Advantages of SVM
• The main strength of SVM is that they work well even when the number of SVM
features is much larger than the number of instances.
• It can work on datasets with huge feature space, such is the case in spam filtering,
where a large number of words are the potential signifiers of a message being spam.

SUNIL G L, Dept. Of CSE, SVIT Page 22


Big Data Analytics [18CS72]

• Even when the optimal decision boundary is a nonlinear curve, the SVM transforms
the variables to create new dimensions such that the representation of the classifier is
a linear function of those transformed dimensions of the data.
• SVMs are conceptually easy to understand. They create an easy-to- understand linear
classifier. By working on only a subset of relevant data,. they are computationally
efficient. SVMs are now available with almost all data analytics toolsets.

Disadvantages of SVM
The SVM technique has two major constraints
• It works well only with real numbers, i.e., all the data points in all the dimensions
must be defined by numeric values only,
• It works only with binary classification problems. One can make a series of cascaded
SVMs to get around this constraint.
• Training the SVMs is an inefficient and time consuming process, when the data is
large.
• It does not work well when there is much noise in the data, and thus has to compute
soft margins.
• The SVMs will also not provide a probability estimate of classification, i.e., the
confidence level for classifying an instance.

SUNIL G L, Dept. Of CSE, SVIT Page 23

You might also like