The document discusses data cleaning, highlighting common errors such as outliers, missing data, and erroneous data, along with methods to handle them. It also covers data visualization techniques, including area graphs, bar charts, histograms, line graphs, scatterplots, flow charts, and pie charts, emphasizing their importance in understanding and communicating data insights. These techniques aid in defining strategies for model selection and presenting data trends effectively.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
5 views5 pages
Data Visualization Cleaning and Errors
The document discusses data cleaning, highlighting common errors such as outliers, missing data, and erroneous data, along with methods to handle them. It also covers data visualization techniques, including area graphs, bar charts, histograms, line graphs, scatterplots, flow charts, and pie charts, emphasizing their importance in understanding and communicating data insights. These techniques aid in defining strategies for model selection and presenting data trends effectively.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
1) Data Cleaning :
Data cleaning helps in getting rid of commonly found errors and
mistakes in a data set. These are the 3 commonly found errors in data.
1) Outliers: Data points existing out of the range.
2) Missing data: Data points missing at certain places.
3) Erroneous data: Incorrect data points.
hey GS ky.
Outliers =¢ Bream tao [Date Bir Ered oe %
on
An outlier is a data point in a dataset A
that is distant from all other & We
observations. 6 fs
7 é °
2 30
An outlier is something that behaves
differently from the combination/
collection of the data.~_ ”-
Mig ae Ey dbase
ie tee 8 te?
: S
0 SE taey we te ising ves 8 he 3
hy ONL data set
v
We can handle them in two ways:
1. By eliminating the rows of missing
values, (Generally, not recommended
as it might reduce the data set to some
cextent leading to less data to be trained)
2By Using an imputer to find the best possible substitute to replace missing values.
23. Erroneous Data:
[Erroneous data is test data that falls outside of what is.
lacceptable and should be rejected by the system. Student Name|
RIVAGEORGE scsi |
JOSHUA SAM rl
[APARNA BINU A
IDHARDHVR x
INITHILAM Er
ATHULYAMS a
pee
(ERT SHNAATH 7 J
1) Data Visualization
| pod
| oe wubb eno iirem ont totic
er a Sweat
1) Uae j chads 7 [WW Peereaing Maks i Mod
CD Depiadey ctaokgg 7 SA te Wlenhly Hebd tae you howe
Dainty ogg Sidy Wed Cahaahe
H mohay)pattems contained within the data
2) Ithelps us define strategy for which model to use at a later stage.
Visual representation is easier to understand and communicate to others. Example
‘pe Table
&
Yearly Employee Wage Cost
sssooa
287 $3885,
sam § aio) S000
sen $7536
mosis 3500)
22088 $3062) grap
mses $45,128)
Wen § 3076) seam
uegn2_§_7s8
2aaaon_$sooe7| see
ties $s)
cman
3: Data Visualization Techniques
1. Area Graphs
‘Area Graphs are Line Graphs but with the area
below the line filled in with a'certain colour or
texture. Like Line
Graphs, Area Graphs are used to
display the development of quantitative
TLEDPELE
PEPGEETl
:
iene memos
values over an interval or time period. They
‘are most commonly used to show trends,
rather than convey
specific values.
2. BarChans
The classic Bar Chart uses either horizontal or
vertical bars (column chart) to show discrete,
numerical comparison across categories. Bars
Charts are distinguished from Histograms, as
they do not display continuous developments
over an interval. Bar Chart'sdiscrete data is categorical data and therefore
answers the question of "how many?" in each
category,
3. Histogram
A Histogram visualizes the distribution of
data over a continuous interval or certain time
period. Each bar in a histogram represents the
tabulated frequency at each interval/bin.
Histograms help give an estimate as to where
values are concentrated, what the extremes are
and whether there are any gaps or unusual
values.
4. Line Graphs
Line Graphs are used to display
quantitative values over a continuous
interval or time period. A Line Graph is
most frequently used to show trends and
analyze how the data has changed over
time. Line Graphs are drawn by first
plotting data points on a Cartesian
coordinate grid, then connecting a line
between all of these points.
‘Typically, the y-axis has a quantitative
value, while the x-axis is a timescale or a
sequence of intervals. Negative values can
be displayed below the x-axis,
5. Scaterpots
A scatterplot is a type of data display that
shows the relationship between two
numerical variables. Each member of thedataset gets plotted as
& point whose (x, y)
coordinates relates 10 its
values for the
Variables, a
6. Flow Chars
This type of diagram is used to show the sequentiabien> _seves memes
—
steps of a process. Flow Co
Charts map out a_ process
using a series of connected
symbols, which makes
process easy to understand
and aids in its
communication to other
people. Flow Charts are
useful for explaining how a
complex andlor abstract
procedure, system, concept
or algorithm, work. Drawing
a Flow Chart can also help in
planning an developing an
existing one relationship or
correlation between the two
variables exists
Co
A
SF
re
ai
1. Pie Charts
Die Charts help show proportions and percentages
between categories, by dividing a circle into
proportional segments. Each ac length represents @
proportion of each category, while the full circle
represents the total sum of all the data, equal to 100%.
Pie Charts are ideal for giving the reader a quick idea