[go: up one dir, main page]

0% found this document useful (0 votes)
13 views5 pages

Understanding Data

Chapter 7 of the document discusses the fundamentals of data, including its importance, types (structured and unstructured), collection, storage, processing, and statistical techniques for analysis. It highlights various examples of data usage and explains key statistical measures such as mean, median, mode, range, and standard deviation. The chapter emphasizes the necessity of processing data to derive useful information for decision-making and mentions programming tools like Python for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Understanding Data

Chapter 7 of the document discusses the fundamentals of data, including its importance, types (structured and unstructured), collection, storage, processing, and statistical techniques for analysis. It highlights various examples of data usage and explains key statistical measures such as mean, median, mode, range, and standard deviation. The chapter emphasizes the necessity of processing data to derive useful information for decision-making and mentions programming tools like Python for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Department of Computer Science, PPUC, Udupi

CHAPTER-7
UNDERSTANDING DATA
• Introduction to Data
• Data Collection
• Data Storage
• Data Processing
• Statistical Techniques for Data Processing

7.1 Introduction to data


• Data is a collection of characters, numbers, and other symbols
E.g
Census data, Bank data, Placement data, Shopping data, Online Posts, Medical data,
Images/videos, Satellite data, Documents and web pages, Signals generated by sensors
Plural: Data
Singular :“datum”.
• Why data is important?
Important decisions are taken based on the data because, large amounts of data can uncover
patterns and useful information
 Large number of data is generated through digital devices.
 Speed of data generation is increasing.
 Computers make processing and analyzing easier.
 Data needs to be processed and analyzed before decisions can be made.
 Data visualization and summarization help in understanding.
 Examples:
- Cab price comparison
- Happy hours(discount price in restaurants)
- Debiting money from ATM
- Vaccine effectiveness analysis
- Satellite data monitoring (cyclones)
- Market behavior tracking
- Voting results
- Pharmaceutical data while trying out a new medicine to see its effectiveness.
- Search engine results data

Types of data
Data can be from different data sources so they can be different formats
E.g.
• Image is collection of pixels
• Video is made up of frames
• Fee slip is made up of numeric and non numeric data
• Chat is made up of texts, icons, images, videos

Structured data
• Data which is organized and can be recorded in a well-defined format is called
structured data.
• Stored in tabular form(rows and column)

1
Department of Computer Science, PPUC, Udupi

• Each row represents an observation or records


• Each column represents an attribute or property of the observation

Unstructured data
• No fixed format/pattern in which data is stored
• Data which are not in row and column structure is called unstructured data.
• Examples : News, Emails, Websites, text documents, business report, books, audio,
video social media messages.
• Unstructured data are described using metadata: data about data
• Metadata for image :type,size,resolution etc.
• Metadata for email: sender,receiver,subject,attachments etc.

Data Collection
• Data colletion means identifying and collecting data from various sources
• To collect data, data can be in file or register or data are already in digital
format like csv file or data can be collected from a software.
• Data need to be gathered before it can be analysed
• Examples:
• Collecting Patients data to improve services
• Collecting online posts data to analyze public opinion before election
• Collecting shopping data to analyze frequently bought items by customer

Data Storage
• Data can be stored for future purposes.
• Large amount of data are being generated so data storage became challenging task.
• Storage devices are used to store data for later use
• Data is stored as files or in the form of databases
• Examples of storage devices:
• Hard Disk Drive
• Solid State Drive
• Pen drives
• Memory cards

Data Processing
• It is not possible to take any decisions by looking at vast amount of data.
• Data must be processed to derive useful information
• The information derived from data is then analysed to make decisions

2
Department of Computer Science, PPUC, Udupi

Statistical Techniques for Data Processing


• Statistical techniques are used for data summarization
• Summarization methods are applied on tabular data.
• Commonly used statistical techniques for data summarization are given below:
1. Measures of Central tendency
• Mean, Median, Mode
2. Measures of variability
• Range and standard deviation

Measures of Central Tendency


• A measure of central tendency is a single value that gives us some idea about the data.

Mean(average)
• Average of numeric values present in an attribute
• Formula: Sum of all the values / Number of values

3
Department of Computer Science, PPUC, Udupi

• Not suitable for data having outliers


• Very large or small value as compared to other values in the data
• Outliers are considered errors as they change the result drastically
• Remove Outliers before calculating mean
n

 xi
• Definition: Given n values x1, x2, x3,...xn, mean is computed as i

n
• Example: Assume that height (in cm) of students in a class are as follows
[90,102,110,115,85,90,100,110,110]. Mean or average height of the class is

90 + 102 + 110 + 115 + 85 + 90 + 100 + 110 + 110 912


= = 101.33 𝑐𝑚
9 9

Median
• When all the values are sorted in ascending or descending order, the middle value is
called the Median.
• Computed on the values of a single attribute
• If there are odd number of values, median is the middle value.
• If there are even number of values, then median is the average of the two middle
values.
• Median divides the sorted list into two equal parts.
• Eg:
• [85,90,90,100,102,110,110,110, 115].
• As there are total 9 values (odd number), the median is the value at position 5, that is
102 cm

Mode
• Value that appears maximum number of times in an attribute
• Computed using the frequency of occurrence of distinct values
• If each value occurs once, there is no mode
• If there are more than one values having the maximum frequency, there will be
multiple modes
• Applicable for both numeric and non-numeric data
• Eg:
• [85,90,90,100,102,110,110,110, 115]. Mode=110

Measures of variability
• Defines the variation of values around the mean
• They are also called measures of dispersion that indicates the degree of diversity in
the data
• Two datasets having same measures of central tendency, can have very different
measures of variability and vice versa
• Examples: Range and Standard Deviation

Range
• Difference between the largest value and the smallest value
• Computed only for numerical data
• Easily affected by the presence of outliners
• Heights of students

4
Department of Computer Science, PPUC, Udupi

• [90,102,110,115,85,90,100,110,110]
• Range: 115-85 = 30 cm

Standard deviation
• Uses all the values for calculating the spread of data
• Computed only for numerical data
• Small value of standard deviation indicates less variation in data
• Large value of standard deviation indicates large variation in data
• Given n values x1 , x2 , x3 ,...xn , and their mean , the standard deviation, represented

 xi  x
n
2

as σ (greek letter sigma) is computed as   i 1

• It is important to understand statistical techniques so that one can decide which


statistical technique to use to arrive at a decision.
• Different programming tools are available for efficient analysis of large volumes of
summary data. These tools make use of statistical techniques for data analysis.
• One such programming tool is Python and it has libraries specially built for data
processing and analysis.

You might also like