0% found this document useful (0 votes)

139 views1 page

Text Mining Using Python

The document describes the steps for effective text data cleaning using Python. It outlines 10 steps for cleaning a sample tweet related to consumer opinions on the iPhone. The steps include escaping HTML characters, decoding data, apostrophe lookup, removing stop words, removing punctuation, removing expressions, splitting attached words, slang lookup, standardizing words, and removing URLs. The document also briefly discusses advanced cleaning techniques like grammar checking and spelling correction.

Uploaded by

Pablo Rivera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views1 page

Text Mining Using Python

Uploaded by

Pablo Rivera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Effective Text Data

Cleaning using Python

Benefits of mining for a brand?
You can do sentimental You can measure brand It is used to identify the It is widely used for
analysis to discover popularity using the pain points of customers predictions and
customer’s sentiment actively engaged i.e. customer relationship forecasting
for a brand tweeters management

The Business Problem

Let’s say, we want to find the features of an Apple iPhone which are most
popular amongst the fans on Twitter.

What to do next?
We’ve extracted all the tweets related to consumer opinions of iPhone.
Here’s a sample tweet on which we’ll perform data cleaning

TWEET
“I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo
happppppy :) http://www.apple.com”

Steps for Data Cleaning

STEP Escaping HTML characters

01
Code

import HTMLParser
html_parser = HTMLParser.HTMLParser()
tweet = html_parser.unescape(original_tweet)

Output
>> “I luv my <3 iphone & you’re awsm apple. Display Is Awesome, sooo
happppppy http://www.apple.com”

Decoding data STEP

02
Code

tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)

Output
>> “I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome,
sooo happppppy :) http://www.apple.com”

STEP Apostrophe Lookup

03
Code

APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary
words = tweet.split()
reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
reformed = " ".join(reformed)

Outcome

>> “I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo
happppppy :) http://www.apple.com”

Removal of Stop-Words STEP

04
When data analysis needs to be data driven at the word level, the
commonly occurring words (stop-words) should be removed.
One can either create a long list of stop-words or one can use
predefined language specific libraries.

STEP Removal of Punctuations

05
All the punctuation marks according to the priorities should be
dealt with. For example: “.”, “,”,”?” are important punctuations
that should be retained while others need to be removed.

Removal of Expressions
STEP

06
Textual data (usually speech transcripts) may contain human
expressions like [laughing], [Crying], [Audience paused]. These
expressions are usually non relevant to content of the speech and
hence need to be removed.

STEP Split Attached Words

07
Code
cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))

Outcome
>> “I luv my <3 iphone & you are awsm apple. Display Is Awesome, sooo
happppppy :) http://www.apple.com”

Slangs lookup STEP

08
Code
tweet = _slang_loopup(tweet)

Outcome

>> “I love my <3 iphone & you are awesome apple. Display Is
Awesome, sooo happppppy :) http://www.apple.com”

STEP Standardizing word

09
Code

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))

Outcome

>> “I love my <3 iphone & you are awesome apple. Display Is
Awesome, so happy :) http://www.apple.com”

Removal of URLs STEP

10
URLs and hyperlinks in text data like comments, reviews, and tweets
should be removed.

Final cleaned tweet:

>> “I love my iphone & you are awesome apple. Display Is Awesome, so
happy!” , <3 , :)

Advanced Data Cleaning

Grammar checking
Grammar checking is majorly learning based,
huge amount of proper text data is learned and models are created.
Many online tools are available for grammar correction purposes.

Spelling correction
In natural language, misspelled errors are
encountered. One can use algorithms like the Levenshtein Distances,
Dictionary Lookup etc. other modules and packages to fix these
errors.

Your Next Steps…

Now that the data (tweet) is cleaned, you are ready to practice and learn the
following techniques (in no order) of Text Mining-

1. Framework to build a niche dictionary for text mining

http://bit.ly/1eetMw6

2 Step by Step guide to extract insights from free text

http://bit.ly/1JjslYe

3. 2014 FIFA World Cup Prediction using Twitter Mining

http://bit.ly/1kLeYSk

4. Text Mining Hack using Google API

http://bit.ly/1LDPF6c

For more resources on analytics/data science, visit

www.analyticsvidhya.com

Text Data Cleaning with Python
No ratings yet
Text Data Cleaning with Python
5 pages
Text Data Cleaning Steps in Python
No ratings yet
Text Data Cleaning Steps in Python
6 pages
Quick Guide - Steps To Perform Text Data Cleaning in Python
No ratings yet
Quick Guide - Steps To Perform Text Data Cleaning in Python
6 pages
Experiment No 3
No ratings yet
Experiment No 3
7 pages
Text Noise Removal & Preprocessing
No ratings yet
Text Noise Removal & Preprocessing
38 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Unit 5
No ratings yet
Unit 5
4 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
No ratings yet
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
20 pages
Preprocessing in Python
No ratings yet
Preprocessing in Python
50 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
Data Cleaning & Storage Guide
No ratings yet
Data Cleaning & Storage Guide
9 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
Social Media
No ratings yet
Social Media
7 pages
AminaRahmanK DL Lab5
No ratings yet
AminaRahmanK DL Lab5
11 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
NLP (DP) Notes1
No ratings yet
NLP (DP) Notes1
61 pages
Text Cleaning Methods in NLP
No ratings yet
Text Cleaning Methods in NLP
7 pages
Sma Exp 3
No ratings yet
Sma Exp 3
7 pages
Text Analysis in Business Using Python
No ratings yet
Text Analysis in Business Using Python
5 pages
String and Text Processing
No ratings yet
String and Text Processing
8 pages
Advance Data Mining Assignment
No ratings yet
Advance Data Mining Assignment
10 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Data Science
No ratings yet
Data Science
9 pages
Python Data Cleaning Techniques
No ratings yet
Python Data Cleaning Techniques
36 pages
1745064423339-Coders of Delhi
No ratings yet
1745064423339-Coders of Delhi
12 pages
Sma 3
No ratings yet
Sma 3
3 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
3-Ue - Harran.edu - TR Pluginfile - PHP 1977887 Mod Resource Content 1 ReExample - Py
No ratings yet
3-Ue - Harran.edu - TR Pluginfile - PHP 1977887 Mod Resource Content 1 ReExample - Py
2 pages
Sentiment Analysis For Twitter Comments Project Exp
No ratings yet
Sentiment Analysis For Twitter Comments Project Exp
5 pages
WeRateDogs Twitter Data Wrangling
No ratings yet
WeRateDogs Twitter Data Wrangling
4 pages
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
No ratings yet
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
8 pages
Data Wrangling
No ratings yet
Data Wrangling
4 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
Text Analytics - Capstone Project
No ratings yet
Text Analytics - Capstone Project
19 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Sma 2
No ratings yet
Sma 2
9 pages
Chapter 2
No ratings yet
Chapter 2
34 pages
Bda Module 5
No ratings yet
Bda Module 5
6 pages
Statistical Computing With Python
No ratings yet
Statistical Computing With Python
21 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
Data Analytics
No ratings yet
Data Analytics
24 pages
Data Mining vs Extraction Explained
No ratings yet
Data Mining vs Extraction Explained
8 pages
Python Map Lambda
No ratings yet
Python Map Lambda
9 pages
Unit 5
No ratings yet
Unit 5
8 pages
(Assignment 1 & 2) Regular Expression
No ratings yet
(Assignment 1 & 2) Regular Expression
3 pages
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
No ratings yet
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
27 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Data Cleaning
No ratings yet
Data Cleaning
52 pages
Toxic Comment Analysis Report
No ratings yet
Toxic Comment Analysis Report
20 pages
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
III Unit
No ratings yet
III Unit
4 pages
An Aggregatedisaggregate Intermittent Demand Approach Adida To Forecasting
No ratings yet
An Aggregatedisaggregate Intermittent Demand Approach Adida To Forecasting
11 pages
A Model For Selecting The Appropriate Level of Aggregation in Forecasting Processes PDF
No ratings yet
A Model For Selecting The Appropriate Level of Aggregation in Forecasting Processes PDF
10 pages
Kalchschmidt, M., Verganti, R., & Zotteri, G. (2006) - Forecasting Demand From Heterogeneous Customers PDF
No ratings yet
Kalchschmidt, M., Verganti, R., & Zotteri, G. (2006) - Forecasting Demand From Heterogeneous Customers PDF
23 pages
Seasonal Demand Forecasting Guide
100% (1)
Seasonal Demand Forecasting Guide
17 pages
An Aggregatedisaggregate Intermittent Demand Approach Adida To Forecasting
No ratings yet
An Aggregatedisaggregate Intermittent Demand Approach Adida To Forecasting
11 pages
An Aggregate-Disaggregate Intermittent Demand Approach (ADIDA) To Forecasting: An Empirical Proposition and Analysis
No ratings yet
An Aggregate-Disaggregate Intermittent Demand Approach (ADIDA) To Forecasting: An Empirical Proposition and Analysis
17 pages
Airline Reservation Confirmation - Finish - American Airlines - AA
67% (3)
Airline Reservation Confirmation - Finish - American Airlines - AA
2 pages
ITC Q4 24-25 Highlights
No ratings yet
ITC Q4 24-25 Highlights
9 pages
COOR BEER AS TAB by Les Paul @
No ratings yet
COOR BEER AS TAB by Les Paul @
3 pages
Performance of Novice and Experienced Teachers Using Blended Learning Modality in The Division of Quezon: Basis For Intervention Program
No ratings yet
Performance of Novice and Experienced Teachers Using Blended Learning Modality in The Division of Quezon: Basis For Intervention Program
17 pages
Induction Program & University Orientation Program 2025
No ratings yet
Induction Program & University Orientation Program 2025
12 pages
Annual Budget 2024-25 Andhra
No ratings yet
Annual Budget 2024-25 Andhra
37 pages
Job Application Form for Marico
No ratings yet
Job Application Form for Marico
7 pages
BRKDCT 3101 PDF
No ratings yet
BRKDCT 3101 PDF
199 pages
Journal Pre-Proof: Journal of Obsessive-Compulsive and Related Disorders
No ratings yet
Journal Pre-Proof: Journal of Obsessive-Compulsive and Related Disorders
44 pages
Delhi Defamation Case Analysis
No ratings yet
Delhi Defamation Case Analysis
5 pages
Amala College of Nursing: Management Director Principal Parent Hospital Details of Contact Person
No ratings yet
Amala College of Nursing: Management Director Principal Parent Hospital Details of Contact Person
1 page
Screenshot 2022-10-20 at 11.58.50 PM
No ratings yet
Screenshot 2022-10-20 at 11.58.50 PM
227 pages
Steals & Deals Southeastern Editiion 5-28-20
No ratings yet
Steals & Deals Southeastern Editiion 5-28-20
12 pages
Ch2 4 Problems
No ratings yet
Ch2 4 Problems
5 pages
Online ATM Simulator Project
No ratings yet
Online ATM Simulator Project
8 pages
IIT Jammu Semester Schedule 2020
No ratings yet
IIT Jammu Semester Schedule 2020
1 page
Apec Architects
No ratings yet
Apec Architects
9 pages
CLINIMED Atraumix Scissor For Atraumatic Tissue Dissection
No ratings yet
CLINIMED Atraumix Scissor For Atraumatic Tissue Dissection
2 pages
Total Cost of Ownership (TCO) Comparison: Notices
No ratings yet
Total Cost of Ownership (TCO) Comparison: Notices
23 pages
DoD PKE SIPR Certificates List
No ratings yet
DoD PKE SIPR Certificates List
1 page
Single Core/pvc /cu
No ratings yet
Single Core/pvc /cu
20 pages
Engineering Software for ECE Students
No ratings yet
Engineering Software for ECE Students
7 pages
Lab Management
No ratings yet
Lab Management
19 pages
Sarkari School
No ratings yet
Sarkari School
3 pages
Payroll Specialist Resume Cover Letter
100% (2)
Payroll Specialist Resume Cover Letter
5 pages
Full Corporate Offer For Railways Second Grade R50-R65
No ratings yet
Full Corporate Offer For Railways Second Grade R50-R65
3 pages
Welding Standards & Specifications
No ratings yet
Welding Standards & Specifications
9 pages
Company Profile and Financial Analysis of ITC LTD - 1
No ratings yet
Company Profile and Financial Analysis of ITC LTD - 1
44 pages
Map TQ
No ratings yet
Map TQ
1 page

Text Mining Using Python

Uploaded by

Text Mining Using Python

Uploaded by

Effective Text Data

Cleaning using Python

The Business Problem

Steps for Data Cleaning

STEP Escaping HTML characters

Decoding data STEP

STEP Apostrophe Lookup

Removal of Stop-Words STEP

STEP Removal of Punctuations

STEP Split Attached Words

Slangs lookup STEP

STEP Standardizing word

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))

Removal of URLs STEP

Final cleaned tweet:

Advanced Data Cleaning

Your Next Steps…

1. Framework to build a niche dictionary for text mining

2 Step by Step guide to extract insights from free text

3. 2014 FIFA World Cup Prediction using Twitter Mining

4. Text Mining Hack using Google API

For more resources on analytics/data science, visit

You might also like