0% found this document useful (0 votes)

60 views5 pages

Text Data Cleaning with Python

Uploaded by

yiliawanghk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views5 pages

Text Data Cleaning with Python

Uploaded by

yiliawanghk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Steps for effective text data cleaning (with case study using

Python)
BE G I NNE R BI G D AT A D AT A S C I E NC E NLP PYT HO N T E C HNI Q UE T E XT UNS T RUC T URE D D AT A

Introduction

The days when one would get data in tabulated spreadsheets are truly behind us. A moment of silence for
the data residing in the spreadsheet pockets. Today, more than 80% of the data is unstructured – it is
either present in data silos or scattered around the digital archives. Data is being produced as we speak –
from every conversation we make in the social media to every content generated from news sources. In
order to produce any meaningful actionable insight from data, it is important to know how to work with it in
its unstructured form. As a Data Scientist at one of the fastest growing Decision Sciences firm, my bread
and butter comes from deriving meaningful insights from unstructured text information.

One of the first steps in working with text data is to pre-process it. It is an essential step before the data is
ready for analysis. Majority of available text data is highly unstructured and noisy in nature – to achieve
better insights or to build better algorithms, it is necessary to play with clean data. For example, social
media data is highly unstructured – it is an informal communication – typos, bad grammar, usage of slang,
presence of unwanted content like URLs, Stopwords, Expressions etc. are the usual suspects.

In this blog, therefore I discuss about these possible noise elements and how you could clean them step
by step. I am providing ways to clean data using Python.

As a typical business problem, assume you are interested in finding: which are the features of an iPhone
which are more popular among the fans. You have extracted consumer opinions related to iPhone and here
is a tweet you extracted:

[stextbox id = “grey”]
“I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy
http://www.apple.com”

[/stextbox]

Steps for data cleaning:

Here is what you do:

1. Escaping HTML characters: Data obtained from web usually contains a lot of html entities like < >
& which gets embedded in the original data. It is thus necessary to get rid of these entities. One
approach is to directly remove them by the use of specific regular expressions. Another approach is to
use appropriate packages and modules (for example htmlparser of Python), which can convert these
entities to standard html tags. For example: < is converted to “<” and & is converted to “&”.

2. Decoding data: Thisis the process of transforming information from complex symbols to simple and
easier to understand characters. Text data may be subject to different forms of decoding like “Latin”,
“UTF8” etc. Therefore, for better analysis, it is necessary to keep the complete data in standard
encoding format. UTF-8 encoding is widely accepted and is recommended to use.

[stextbox id = “grey”]

Snippet:

tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)

Output:

>> “I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy
http://www.apple.com”
[/stextbox]

3. Apostrophe Lookup: To avoid any word sense disambiguation in text, it is recommended to maintain
proper structure in it and to abide by the rules of context free grammar. When apostrophes are used,
chances of disambiguation increases.

For example “it’s is a contraction for it is or it has”.

All the apostrophes should be converted into standard lexicons. One can use a lookup table of all
possible keys to get rid of disambiguates.

[stextbox id = “grey”]

Snippet:

APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary

words = tweet.split()

reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]

reformed = " ".join(reformed)

Outcome:

>> “I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo happppppy
http://www.apple.com”

[/stextbox]

4. Removal of Stop-words: When data analysis needs to be data driven at the word level, the commonly
occurring words (stop-words) should be removed. One can either create a long list of stop-words or
one can use predefined language specific libraries.
5. Removal of Punctuations: All the punctuation marks according to the priorities should be dealt with.
For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be
removed.
. Removal of Expressions: Textual data (usually speech transcripts) may contain human expressions
like [laughing], [Crying], [Audience paused]. These expressions are usually non relevant to content of
the speech and hence need to be removed. Simple regular expression can be useful in this case.

7. Split Attached Words: We humans in the social forums generate text data, which is completely informal
in nature. Most of the tweets are accompanied with multiple attached words like RainyDay,
PlayingInTheCold etc. These entities can be split into their normal forms using simple rules and regex.

[stextbox id = “grey”]

Snippet:

cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))

Outcome:

>> “I luv my <3 iphone & you are awsm apple. Display Is Awesome, sooo happppppy
http://www.apple.com”

[/stextbox]

. Slangs lookup: Again, social media comprises of a majority of slang words. These words should be
transformed into standard words to make free text. The words like luv will be converted to love, Helo to
Hello. The similar approach of apostrophe look up can be used to convert slangs to standard words. A
number of sources are available on the web, which provides lists of all possible slangs, this would be
your holy grail and you could use them as lookup dictionaries for conversion purposes.

[stextbox id = “grey”]

Snippet:

tweet = _slang_loopup(tweet)

Outcome:

>> “I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo happppppy
http://www.apple.com”

[/stextbox]

9. Standardizing words: Sometimes words are not in proper formats. For example: “I looooveee you”
should be “I love you”. Simple rules and regular expressions can help solve these cases.

[stextbox id = “grey”]

Snippet:

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))

Outcome:

>> “I love my <3 iphone & you are awesome apple. Display Is Awesome, so happy
http://www.apple.com”

[/stextbox]

10. Removal of URLs: URLs and hyperlinks in text data like comments, reviews, and tweets should be
removed.

[stextbox id = “grey”]

Final cleaned tweet:

>> “I love my iphone & you are awesome apple. Display Is Awesome, so happy!” , <3 ,

[/stextbox]
Advanced data cleaning:

1. Grammar checking: Grammar checking is majorly learning based, huge amount of proper text data is
learned and models are created for the purpose of grammar correction. There are many online tools
that are available for grammar correction purposes.
2. Spelling correction: In natural language, misspelled errors are encountered. Companies like Google
and Microsoft have achieved a decent accuracy level in automated spell correction. One can use
algorithms like the Levenshtein Distances, Dictionary Lookup etc. or other modules and packages to fix
these errors.

End Notes:

Hope you found this article helpful. These were some tips and tricks, I have learnt while working with a lot
of text data. If you follow the above steps to clean the data, you can drastically improve the accuracy of
your results and draw better insights. Do share your views/doubts in the comments section and I would be
happy to participate.

Go Hack

If you like what you just read & want to continue your analytics
learning, subscribe to our emails, follow us on twitter or like
our facebook page.

Article Url - https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

Shivam5992 Bansal
Shivam Bansal is a data scientist with exhaustive experience in Natural Language Processing and
Machine Learning in several domains. He is passionate about learning and always looks forward to
solving challenging analytical problems.

Text Data Cleaning Steps in Python
No ratings yet
Text Data Cleaning Steps in Python
6 pages
Text Mining Using Python
No ratings yet
Text Mining Using Python
1 page
Quick Guide - Steps To Perform Text Data Cleaning in Python
No ratings yet
Quick Guide - Steps To Perform Text Data Cleaning in Python
6 pages
Experiment No 3
No ratings yet
Experiment No 3
7 pages
Text Noise Removal & Preprocessing
No ratings yet
Text Noise Removal & Preprocessing
38 pages
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
No ratings yet
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
20 pages
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
NLP (DP) Notes1
No ratings yet
NLP (DP) Notes1
61 pages
Data Wrangling
No ratings yet
Data Wrangling
4 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Unstructured Data Classification
100% (2)
Unstructured Data Classification
83 pages
Statistical Computing With Python
No ratings yet
Statistical Computing With Python
21 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Text Cleaning Methods in NLP
No ratings yet
Text Cleaning Methods in NLP
7 pages
Data Cleaning & Storage Guide
No ratings yet
Data Cleaning & Storage Guide
9 pages
Toxic Comment Analysis Report
No ratings yet
Toxic Comment Analysis Report
20 pages
Preprocessing in Python
No ratings yet
Preprocessing in Python
50 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
III Unit
No ratings yet
III Unit
4 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Python Data Cleaning Techniques
No ratings yet
Python Data Cleaning Techniques
36 pages
Sma Exp 3
No ratings yet
Sma Exp 3
7 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Social Media
No ratings yet
Social Media
7 pages
Text Analysis in Business Using Python
No ratings yet
Text Analysis in Business Using Python
5 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Quiz 2
No ratings yet
Quiz 2
11 pages
Unit 3
No ratings yet
Unit 3
102 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
Data Science
No ratings yet
Data Science
9 pages
1745064423339-Coders of Delhi
No ratings yet
1745064423339-Coders of Delhi
12 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Module 2 NLP
No ratings yet
Module 2 NLP
109 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Data Cleaning Guide for Python Users
No ratings yet
Data Cleaning Guide for Python Users
14 pages
Sma 3
No ratings yet
Sma 3
3 pages
Task 1
No ratings yet
Task 1
2 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
No ratings yet
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
12 pages
Python String Cleaning Guide
No ratings yet
Python String Cleaning Guide
2 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Bda Module 5
No ratings yet
Bda Module 5
6 pages
Wrangle Report
No ratings yet
Wrangle Report
3 pages
Code File Analysis
No ratings yet
Code File Analysis
9 pages
DW Sem
No ratings yet
DW Sem
25 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
Text Processing in Python
100% (1)
Text Processing in Python
479 pages
String and Text Processing
No ratings yet
String and Text Processing
8 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Analytics and Tech Mining For Engineering Managers 9781606505113 1606505114 9781606505106
No ratings yet
Analytics and Tech Mining For Engineering Managers 9781606505113 1606505114 9781606505106
146 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Data Science Essentials in Python PDF
No ratings yet
Data Science Essentials in Python PDF
8 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
6.2 - Manipulating Strings
No ratings yet
6.2 - Manipulating Strings
16 pages
Sample-Oriented Task-Driven Visualizations: Allowing Users To Make Better, More Confident Decisions
No ratings yet
Sample-Oriented Task-Driven Visualizations: Allowing Users To Make Better, More Confident Decisions
10 pages
Useful Junk The Effects of Visual Embellishment On Comprehension and Memorability of Charts
No ratings yet
Useful Junk The Effects of Visual Embellishment On Comprehension and Memorability of Charts
11 pages
1.2 - Handling Text in Python
No ratings yet
1.2 - Handling Text in Python
14 pages
English Language Quiz
No ratings yet
English Language Quiz
10 pages
Sanika Kolekar: Web Developer & Java Expert
No ratings yet
Sanika Kolekar: Web Developer & Java Expert
1 page
Module 2 English Enhancement
100% (1)
Module 2 English Enhancement
4 pages
Attapur JULY Newsletter 2025.
No ratings yet
Attapur JULY Newsletter 2025.
9 pages
NESSUS Group #01 (004,042) IS Presentation
No ratings yet
NESSUS Group #01 (004,042) IS Presentation
24 pages
Sinhala Bible - Genesis 1
No ratings yet
Sinhala Bible - Genesis 1
5 pages
Module 4 English Summaries (Ipte)
No ratings yet
Module 4 English Summaries (Ipte)
33 pages
The Curse of Fate in Romeo and Juliet
No ratings yet
The Curse of Fate in Romeo and Juliet
5 pages
Problems of Equivalence, Group 9-1
No ratings yet
Problems of Equivalence, Group 9-1
9 pages
AR Silver
No ratings yet
AR Silver
29 pages
SAP S4 HANA Business Partner (BP) End To End Confi... - SAP Community
No ratings yet
SAP S4 HANA Business Partner (BP) End To End Confi... - SAP Community
32 pages
C++ Basics: Beginner Assignments
No ratings yet
C++ Basics: Beginner Assignments
55 pages
CCS - View Topic - SOLVED - Problem With INT - RDA Not Beeing F
100% (3)
CCS - View Topic - SOLVED - Problem With INT - RDA Not Beeing F
5 pages
Yom Kippur Katan Siddur
No ratings yet
Yom Kippur Katan Siddur
72 pages
Tigre Tigrigna Standard, Dehai (Kekia) PDF
100% (1)
Tigre Tigrigna Standard, Dehai (Kekia) PDF
49 pages
LRN International GCSE Islamiyat 2141 Mock4 MS Paper2
No ratings yet
LRN International GCSE Islamiyat 2141 Mock4 MS Paper2
4 pages
A Practical Guide Using LLMs ChatGPT and Beyond
No ratings yet
A Practical Guide Using LLMs ChatGPT and Beyond
24 pages
8CA - C2 - Lesson 10 - 28.08
No ratings yet
8CA - C2 - Lesson 10 - 28.08
3 pages
Design A Park
100% (1)
Design A Park
3 pages
Soal Bahasa Inggris Kelas X Semester Genap 2022
No ratings yet
Soal Bahasa Inggris Kelas X Semester Genap 2022
7 pages
Data Science Chatbot
No ratings yet
Data Science Chatbot
9 pages
Powerpoint For Microsoft 365 Powerpoint 2019 Powerpoint 2016 Powerpoint 2013
No ratings yet
Powerpoint For Microsoft 365 Powerpoint 2019 Powerpoint 2016 Powerpoint 2013
7 pages
Macmillan 1 Revision Unit 8-9-10
No ratings yet
Macmillan 1 Revision Unit 8-9-10
22 pages
Y4 Unit 1 Our Community
No ratings yet
Y4 Unit 1 Our Community
12 pages
Literacy Rate Analysis Project File
50% (2)
Literacy Rate Analysis Project File
41 pages
Achieve3000 Lesson
No ratings yet
Achieve3000 Lesson
1 page
SQL Ledger
No ratings yet
SQL Ledger
22 pages
Adverbs
No ratings yet
Adverbs
2 pages
Guideline Document
No ratings yet
Guideline Document
81 pages
A For and Against' Essay
No ratings yet
A For and Against' Essay
24 pages

Text Data Cleaning with Python

Uploaded by

Text Data Cleaning with Python

Uploaded by

Steps for effective text data cleaning (with case study using

Steps for data cleaning:

Here is what you do:

For example “it’s is a contraction for it is or it has”.

reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]

reformed = " ".join(reformed)

cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))

Final cleaned tweet:

Article Url - https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

You might also like