Data Science Unit 1 Notes
Data Science Unit 1 Notes
Syllabus
Data Science : Benefits and uses - facets of data Defining research goals -
Retrieving data - Data preparation - Exploratory Data analysis - build the model
presenting findings and building applications Warehousing - Basic Statistical
descriptions of Data.
Data Science
• Data science uses advanced analytical theory and various methods such as time
series analysis for predicting future. From historical data, Instead of knowing how
many products sold in previous quarter, data science helps in forecasting future
product sales and revenue more accurately.
• Data science is devoted to the extraction of clean information from raw data to
form actionable insights. Data science practitioners apply machine learning
algorithms to numbers, text, images, video, audio and more to produce artificial
intelligence systems to perform tasks that ordinarily require human intelligence.
• As a general rule, data scientists are skilled in detecting patterns hidden within
large volumes of data and they often use advanced algorithms and implement
machine learning models to help businesses and organizations make accurate
assessments and predictions. Data science and big data evolved from statistics
and traditional data management but are now considered to be distinct disciplines.
1
• Life cycle of data science:
1. Capture: Data acquisition, data entry, signal reception and data extraction.
3. Process Data mining, clustering and classification, data modeling and data
summarization.
Big Data
• Big data can be defined as very large volumes of data available at various
sources, in varying degrees of complexity, generated at different speed i.e.
velocities and varying degrees of ambiguity, which cannot be processed using
traditional technologies, processing methods, algorithms or any commercial off
the-shelf solutions.
• 'Big data' is a term used to describe collection of data that is huge in size and
yet growing exponentially with time. In short, such a data is so large and
complex that none of the traditional data management tools are able to store it
or process it efficiently.
• Characteristics of big data are volume, velocity and variety. They are often
referred to as the three V's.
• These three dimensions are also called as three V's of Big Data.
• Veracity refers to the trustworthiness of the data. Can the manager rely on the
fact that the data is representative? Every good manager knows that there are
inherent discrepancies in all the data collected.
• Spatial veracity: For vector data (imagery based on points, lines and polygons),
the quality varies. It depends on whether the points have been GPS determined or
determined by unknown origins or manually. Also, resolution and projection
issues can alter veracity.
• For geo-coded points, there may be errors in the address tables and in the point
location algorithms associated with addresses.
b) Value :
3
• The ultimate objective of any big data project should be to generate some sort
of value for the company doing all the analysis. Otherwise, user just performing
some technological task for technology's sake.
• For real-time spatial big data, decisions can be enhance through visualization
of dynamic change in such spatial phenomena as climate, traffic, social-media-
based attitudes and massive inventory locations.
• Once spatial big data are structured, formal spatial analytics can be applied,
such as spatial autocorrelation, overlays, buffering, spatial cluster techniques
and location quotients.
5
g) Regression: Predicting food delivery times, predicting home prices based on
amenities
4. Re-develop our products : Big Data can also help us understand how others
perceive our products so that we can adapt them or our marketing, if need be.
1. Social media : Social media is one of the biggest contributors to the flood of
data we have today. Facebook generates around 500+ terabytes of data everyday
in the form of content generated by the users like status messages, photos and
video uploads, messages, comments etc.
6
5. Compliance data : Many organizations like healthcare, hospitals, life
sciences, finance etc has to file compliance reports.
Facets of Data
• Very large amount of data will generate in big data and data science. These
data is various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
Structured Data
• Structured data is arranged in rows and column format. It helps for application
to retrieve and process data easily. Database management system is used for
storing structured data.
• The term structured data refers to data that is identifiable because it is organized
in a structure. The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns and
rows.
• Structured data is also searchable by data type within content. Structured data
is understood by computers and is also efficiently organized for human readers.
Unstructured Data
7
• Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.
• Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information
from these various sources is a very big challenge.
Natural Language
8
machine translation. It is an iterative process comprised of several layers of text
analysis.
• Machine data contains a definitive record of all activity and behavior of our
customers, users, transactions, applications, servers, networks, factory machinery
and so on.
• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.
• Examples of machine data are web server logs, call detail records, network
event logs and telemetry.
9
• Nodes represent entities, which can be of any
problem domain. By connecting nodes with e
(network) of nodes.
10
• Graph theory has proved to be very effective on large-scale datasets such as
social network data. This is because it is capable of by-passing the building of an
actual visual representation of the data to run directly on data matrices.
• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage
format for sound/music and moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can be uncompressed, lossless
compressed or lossy compressed depending on the desired quality and use cases.
11
• Data Science is playing an important role to address these challenges in
multimedia data. Multimedia data usually contains various forms of media, such
as text, image, video, geographic coordinates and even pulse waveforms, which
come from multiple sources. Data Science can be a key instrument covering big
data, machine learning and data mining solutions to store, handle and analyze
such heterogeneous data.
Streaming Data
• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-
game player activity, information from social networks, financial trading floors
or geospatial services and telemetry from connected devices or instrumentation
in data centers.
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling
13
• Step 1: Discovery or Defining research goal
This step involves acquiring data from all the identified internal and external
sources, which helps to answer the business question.
It collection of data which required for project. This is the process of gaining a
business understanding of the data user have and deciphering what each piece of
data means. This could entail determining exactly what data is required and the
best methods for obtaining it. This also entails determining what each of the data
points means in terms of the company. If we have given a data set from a client,
for example, we shall need to know what each column and row represents.
Data can have many inconsistencies like missing values, blank columns, an
incorrect data format, which needs to be cleaned. We need to process, explore and
condition data before modeling. The cleandata, gives the better predictions.
14
there are outliers. To achieve this use descriptive statistics, visual techniques and
simple modeling. This steps is also called as Exploratory Data Analysis.
Deliver the final baselined model with reports, code and technical documents in
this stage. Model is deployed into a real-time production environment after
thorough testing. In this stage, the key findings are communicated to all
stakeholders. This helps to decide if the project results are a success or a failure
based on the inputs from the model.
• To understand the project, three concept must understand: what, why and
• In this phase, the data science team must learn and investigate the problem,
develop context and understanding and learn about the data sources needed and
available for the project.
• Understanding the domain area of the problem is essential. In many cases, data
scientists will have deep computational and quantitative knowledge that can be
broadly applied across many disciplines.
15
• Data scientists have deep knowledge of the methods, techniques and ways for
applying heuristics to a variety of business and conceptual problems.
2. Resources :
• As part of the discovery phase, the team needs to assess the resources available
to support the project. In this context, resources include technology, tools,
systems, data and people.
3. Frame the problem :
• Each team member may hear slightly different things related to the needs and
the problem and have somewhat different ideas of possible solutions.
• The team can identify the success criteria, key risks and stakeholders, which
should include anyone who will benefit from the project or will be significantly
impacted by the project.
• When interviewing stakeholders, learn about the domain area and any relevant
history from similar analytics projects.
• The team should plan to collaborate with the stakeholders to clarify and frame
the analytics problem.
• At the outset, project sponsors may have a predetermined solution that may not
necessarily realize the desired outcome.
• In these cases, the team must use its knowledge and expertise to identify the
true underlying problem and appropriate solution.
• When interviewing the main stakeholders, the team needs to take time to
thoroughly interview the project sponsor, who tends to be the one funding the
project or providing the high-level requirements.
16
• This person understands the problem and usually has an idea of a potential
working solution.
• This step involves forming ideas that the team can test with data. Generally, it
is best to come up with a few primary hypotheses to test and then be creative
about developing several more.
• These Initial Hypotheses form the basis of the analytical tests the team will use
in later phases and serve as the foundation for the findings in phase.
Retrieving Data
• Most of the high quality data is freely available for public and commercial use.
Data can be stored in various format. It is in text file format and tables in database.
Data may be internal or external.
1. Start working on internal data, i.e. data stored within the company
• First step of data scientists is to verify the internal data. Assess the relevance
and quality of the data that's readily in company. Most companies have a program
for maintaining key data, so much of the cleaning work may already be done. This
data can be stored in official data repositories such as databases, data marts, data
warehouses and data lakes maintained by a team of IT professionals.
17
data repository is a large database infrastructure, several databases that collect,
manage and store data sets for data analysis, sharing and reporting.
• Data repository can be used to describe several ways to collect and store data:
a) Data warehouse is a large data repository that aggregates data usually from
multiple sources or segments of a business, without the data being necessarily
related.
b) Data lake is a large data repository that stores unstructured data that is
classified and tagged with metadata.
c) Data marts are subsets of the data repository. These data marts are more
targeted to what the data user needs and easier to use.
d) Metadata repositories store data about data and databases. The metadata
explains where the data source, how it was captured and what it represents.
e) Data cubes are lists of data with three or more dimensions stored as a
ii. Data isolation allows for easier and faster data reporting.
iii. Unauthorized users can access all sensitive data more easily than if it was
distributed across several locations.
18
• If required data is not available within the company, take the help of other
company, which provides such types of database. For example, Nielsen and
GFK are provides data for retail industry. Data scientists also take help of
Twitter, LinkedIn and Facebook.
• Government's organizations share their data for free with the world. This data
can be of excellent quality; it depends on the institution that creates and manages
it. The information they share covers a broad range of topics such as the number
of accidents or amount of drug abuse in a certain region and its demographics.
• Allocate or spend some time for data correction and data cleaning. Collecting
suitable, error free data is success of the data science project.
• Most of the errors encounter during the data gathering phase are easy to spot,
but being too careless will make data scientists spend many hours solving data
issues that could have been prevented during data import.
• Data scientists must investigate the data during the import, data preparation
and exploratory phases. The difference is in the goal and the depth of the
investigation.
• In data retrieval process, verify whether the data is right data type and data is
same as in the source document.
• With data preparation process, more elaborate checks performed. Check any
shortcut method is used. For example, check time and data format.
• During the exploratory phase, Data scientists focus shifts to what he/she can
learn from the data. Now Data scientists assume the data to be clean and look at
the statistical properties such as distributions, correlations and outliers.
Data Preparation
• Missing value: These dirty data will affects on miming procedure and led to
unreliable and poor output. Therefore it is important for some data cleaning
routines. For example, suppose that the average salary of staff is Rs. 65000/-. Use
this value to replace the missing value for salary.
• Data entry errors: Data collection and data entry are error-prone processes.
They often require human intervention and because humans are only human, they
make typos or lose their concentration for a second and introduce an error into the
chain. But data collected by machines or computers isn't free from errors either.
Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure. Examples of errors originating from machines are transmission
errors or bugs in the extract, transform and load phase (ETL).
• Whitespace error: Whitespaces tend to be hard to detect but cause errors like
other redundant characters would. To remove the spaces present at start and end
of the string, we can use strip() function on the string in Python.
20
• Python provides string conversion like to convert a string to lowercase,
uppercase using lower(), upper().
• The lower() Function in python converts the input string to lowercase. The
upper() Function in python converts the input string to uppercase.
Outlier
• Fig. 1.6.1 shows outliers detection. Here O1 and O2 seem outliers from the rest.
• Outlier analysis and detection has various applications in numerous fields such
as fraud detection, credit card, discovering computer intrusion and criminal
behaviours, medical and public health outlier detection, industrial damage
detection.
• General idea of application is to find out data which deviates from normal
behaviour of data set.
21
Dealing with Missing Value
• These dirty data will affects on miming procedure and led to unreliable and
poor output. Therefore it is important for some data cleaning routines.
1. Ignore the tuple: Usually done when the class label is missing. This
method is not good unless the tuple contains several attributes with missing
values.
3. Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant.
4. Use the attribute mean to fill in the missing value: For example, suppose
that the average salary of staff is Rs 65000/-. Use this value to replace the
missing value for salary.
5. Use the attribute mean for all samples belonging to the same class as the
given tuple.
• If error is not corrected in early stage of project, then it create problem in latter
stages. Most of the time, we spend on finding and correcting error. Retrieving
data is a difficult task and organizations spend millions of dollars on it in the
hope of making better decisions. The data collection process is errorprone and
in a big organization it involves many steps and teams.
b) If errors are not corrected early on in the process, the cleansing will have to
be done for every project that uses that data.
c) Data errors may point to a business process that isn't working as designed.
1. Joining table
• Joining tables allows user to combine the information of one observation found
in one table with the information that we find in another table. The focus is on
enriching a single observation.
• A primary key is a value that cannot be duplicated within a table. This means
that one value can only be seen once within the primary key column. That same
key can exist as a foreign key in another table which creates the relationship. A
foreign key can have duplicate instances within a table.
• Fig. 1.6.2 shows Joining two tables on the CountryID and CountryName
keys.
23
2. Appending tables
• Appending table is called stacking table. It effectively adding observations
from one table to another table. Fig. 1.6.3 shows Appending table. (See Fig.
1.6.3 on next page)
24
• Duplication of data is avoided by using view and append. The append table
requires more space for storage. If table size is in terabytes of data, then it
becomes problematic to duplicate the data. For this reason, the concept of a view
was invented.
• Fig. 1.6.4 shows how the sales data from the different months is combined
virtually into a yearly sales table instead of duplicating the data.
25
Transforming Data
• Reducing the number of variables: Having too many variables in the model
makes the model difficult to handle and certain techniques don't perform well
when user overload them with too many input variables.
Euclidean distance :
• Variables can be turned into dummy variables. Dummy variables canonly take
two values: true (1) or false√ (0). They're used to indicate
the absence of acategorical effect that may explain the observation.
26
Exploratory Data Analysis
• EDA is used by data scientists to analyze and investigate data sets and
summarize their main characteristics, often employing data visualization
methods. It helps determine how best to manipulate data sources to get the
answers user need, making it easier for data scientists to discover patterns, spot
anomalies, test a hypothesis or check assumptions.
27
1. Maximize insight into a data set;
• Box plots are an excellent tool for conveying location and variation information
in data sets, particularly for detecting and illustrating location and variation
changes between different groups of data.
1. Univariate analysis: Provides summary statistics for each field in the raw
data set (or) summary only on one variable. Ex : CDF,PDF,Box plot
28
3. Multivariate analysis is performed to understand interactions between
different fields in the dataset (or) finding interactions between variables more
than 2.
• A box plot is a type of chart often used in explanatory data analysis to visually
show the distribution of numerical data and skewness through displaying the
data quartiles or percentile and averages.
2. Lower quartile : 25% of scores fall below the lower quartile value.
3. Median: The median marks the mid-point of the data and is shown by the
line that divides the box into two parts.
6. Whiskers: The upper and lower whiskers represent scores outside the
middle 50%.
7. The interquartile range: This is the box plot showing the middle 50% of
scores.
• Boxplots are also extremely usefule for visually checking group differences.
Suppose we have four groups of scores and we want to compare them by
teaching method. Teaching method is our categorical grouping variable and
score is the continuous outcomes variable that the researchers measured.
29
Build the Models
• To build the model, data should be clean and understand the content
properly. The components of model building are as follows:
b) Execution of model
• For this phase, consider model performance and whether project meets all the
requirements to use model, as well as other factors:
2. How difficult is the maintenance on the model: how long will it remain
relevantif left untouched?
Model Execution
• Various programming language is used for implementing the model. For model
execution, Python provides libraries like StatsModels or Scikit-learn. These
packages use several of the most popular techniques.
1. SAS enterprise miner: This tool allows users to run predictive and
descriptive models based on large volumes of data from across the enterprise.
4. Alpine miner: This tool provides a GUI front end for users to develop
analytic workflows and interact with Big Data tools and platforms on the back
end.
Try to build multiple model and then select best one based on multiple criteria.
Working with a holdout sample helps user pick the best-performing model.
• In Holdout Method, the data is split into two different datasets labeled as a
training and a testing dataset. This can be a 60/40 or 70/30 or 80/20 split. This
technique is called the hold-out validation technique.
Suppose we have a database with house prices as the dependent variable and two
independent variables showing the square footage of the house and the number of
rooms. Now, imagine this dataset has 30 rows. The whole idea is that you build a
model that can predict house prices accurately.
• To 'train' our model or see how well it performs, we randomly subset 20 of those
rows and fit the model. The second step is to predict the values of those 10 rows
that we excluded and measure how well our predictions were.
32
• As a rule of thumb, experts suggest to randomly sample 80% of the data into
the training set and 20% into the test set.
• The holdout method has two, basic drawbacks :
• The team delivers final reports, briefings, code and technical documents.
• The last stage of the data science process is where user soft skills will be most
useful.
Data Mining
33
Functions of Data Mining
5. Data evolution analysis describes and models' regularities for objects whose
behaviour changes over time. It may include characterization, discrimination,
association, classification or clustering of time-related data.
Data mining tasks can be classified into two categories: descriptive and
predictive.
• It involves the supervised learning functions used for the prediction of the
target value. The methods fall under this mining category are the classification,
time series analysis and regression.
34
• Data modeling is the necessity of the predictive analysis, which works by
utilizing some variables to anticipate the unknown future data values for other
variables.
• Historical and transactional data are used to identify patterns and statistical
models and algorithms are used to capture relationships in various datasets.
• Predictive analytics has taken off in the big data era and there are many tools
available for organisations to predict future outcomes.
• Two primary techniques are used for reporting past events : data aggregation
and data mining.
• It presents past data in an easily digestible format for the benefit of a wide
business audience.
35
• A set of techniques for reviewing and examining the data set to understand
the data and analyze business performance.
• It also helps to describe and present data in such format, which can be easily
understood by a wide variety of business readers.
• Fig. 1.10.1 (See on next page) shows typical architecture of data mining
system.
• Components of data mining system are data source, data warehouse server,
data mining engine, pattern evaluation module, graphical user interface and
knowledge base.
• Knowledge base is helpful in the whole data mining process. It might be useful
for guiding the search or evaluating the interestingness of the result patterns. The
knowledge base might even contain user beliefs and data from user experiences
that can be useful in the process of data mining.
• The data mining engine is the core component of any data mining system. It
consists of a number of modules for performing data mining tasks including
association, classification, characterization, clustering, prediction, time-series
analysis etc.
• The graphical user interface module communicates between the user and the
data mining system. This module helps the user use the system easily and
efficiently without knowing the real complexity behind the process.
37
• When the user specifies a query or a task, this module interacts with the data
mining system and displays the result in an easily understandable manner.
Classification of DM System
Data Warehousing
• Databases and data warehouses are related but not the same.
40
"How are organizations using the information from data warehouses ?"
• Most of the organizations makes use of this information for taking business
decision like :
1. Subject oriented Data are organized based on how the users refer to them. A
data warehouse can be used to analyse a particular subject area. For example,
"sales" can be a particular subject.
3. Non-volatile: Data are stored in read-only format and do not change over time.
Typical activities such as deletes, inserts and changes that are performed in an
operational application environment are completely non-existent in a DW
environment.
4. Time variant : Data are not current but normally time series. Historical
information is kept in a data warehouse. For example, one can retrieve files from
3 months, 6 months, 12 months or even previous data from a data warehouse.
41
2. End users are time-sensitive and desire speed-of-thought response
a) Single-tier architecture.
b) Two-tier architecture.
• Single tier warehouse architecture focuses on creating a compact data set and
minimizing the amount of data stored. While it is useful for removing
redundancies. It is not effective for organizations with large data needs and
multiple streams.
• Three tier architecture creates a more structured flow for data from raw sets to
actionable insights. It is the most widely used architecture for data warehouse
systems.
42
• Fig. 1.11.1 shows three tier architecture. Three tier architecture sometimes
called multi-tier architecture.
• The bottom tier is the database of the warehouse, where the cleansed and
transformed data is loaded. The bottom tier is a warehouse database server.
• The middle tier is the application layer giving an abstracted view of the database.
It arranges the data to make it more suitable for analysis. This is done with an
OLAP server, implemented using the ROLAP or MOLAP model.
43
• The top tier represents the front-end client layer. The client level which includes
the tools and Application Programming Interface (API) used for high-level data
analysis, inquiring and reporting. User can use reporting tools, query, analysis or
data mining tools.
2) Store historical data: Data warehouse is required to store the time variable
data from the past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in
the data warehouse. So, data warehouse contributes to making strategic
decisions.
4) For data consistency and quality Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected
loads and types of queries, which demands a significant degree of flexibility and
quick response time.
44
f) Data warehousing provide the capabilities to analyze a large amount of
historical data.
• Metadata is simply defined as data about data. The data that is used to represent
other data is known as metadata. In data warehousing, metadata is one of the
essential aspects.
c) Metadata acts as a directory. This directory helps the decision support system
to locate the contents of a data warehouse.
• In a data warehouse, we create metadata for the data names and definitions of a
given data warehouse. Along with this metadata, additional metadata is also
created for time-stamping any extracted data, the source of extracted data.
a) First, it acts as the glue that links all parts of the data warehouses.
45
b) Next, it provides information about the contents and structures to the
developers.
c) Finally, it opens the doors to the end-users and makes the contents
recognizable in their terms.
• Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
1. Mean :
• The mean of a data set is the average of all the data values. The sample mean x
is the point estimator of the population mean μ.
2. Median :
Sum of the values of then observations Number of observations in the sample
population
• The median of a data set is the value in the middle when the data items are
arranged in ascending order. Whenever a data set has extreme values, the median
is the preferred measure of central location.
• The median is the measure of location most often reported for annual income
and property value data. A few extremely large incomes of property values can
inflate the mean.
47
Median=19
8 observations = 26 18 29 12 14 27 30 19
Numbers in ascending order =12, 14, 18, 19, 26, 27, 29, 30
3. Mode:
• The mode of a data set is the value that occurs with greatest frequency. The
greatest frequency can occur at two or more different values. If the data have
exactly two modes, the data have exactly two modes, the data are bimodal. If
the data have more than two modes, the data are multimodal.
• Trimmed mean: A major problem with the mean is its sensitivity to extreme
(e.g., outlier) values. Even a small number of extreme values can corrupt the
mean. The trimmed mean is the mean obtained after cutting off values at the
high and low extremes.
• For example, we can sort the values and remove the top and bottom 2 % before
computing the mean. We should avoid trimming too large a portion (such as 20
%) at both ends as this can result in the loss of valuable information.
• Holistic measure is a measure that must be computed on the entire data set as
a whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.
• Third quartile (Q3): The third quartile is the value, where 75 % of the
values are smaller than Q3 and 25% are larger.
• The box plot is a useful graphical display for describing the behavior of the
data in the middle as well as at the ends of the distributions. The box plot uses
the median and the lower and upper quartiles. If the lower quartile is Q1 and the
upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range
or IQ.
values Variance :
• The variance is a measure of variability that utilizes all the data. It is based on
the difference between the value of each observation (x;) and the mean (x) for a
sample, u for a population).
• The variance is the average of the squared between each data value and the
mean.
Standard Deviation :
• The standard deviation of a data set is the positive square root of the variance.
It is measured in the same in the same units as the data, making it more easily
interpreted than the variance.
1. Scatter diagram
• To find out whether or not two sets of data are connected scatter diagrams can
be used. Scatter diagram shows the relationship between children's age and
height.
• While scatter diagram shows relationships, it does not by itself prove that one
variable causes other. In addition to showing possible cause and effect
relationships, a scatter diagram can show that two variables are from a common
cause that is unknown or that one variable can be used as a surrogate for the other.
2. Histogram
• To construct a histogram from a continuous variable you first need to split the
data into intervals, called bins. Each bin contains the number of occurrences of
scores in the data set that are contained within that bin.
• The width of each bar is proportional to the width of each category and the
height is proportional to the frequency or percentage of that category.
51
3. Line graphs
• Line graphs are usually used to show time series data that is how one or more
variables vary over a continuous period of time. They can also be used to
compare two different variables over time.
• Typical examples of the types of data that can be presented using line graphs
are monthly rainfall and annual unemployment rates.
• Line graphs are particularly useful for identifying patterns and trends in the data
such as seasonal effects, large changes and turning points. Fig. 1.12.1 show line
graph. (See Fig. 1.12.1 on next page)
• As well as time series data, line graphs can also be appropriate for displaying
data that are measured over other continuous variables such as distance.
• For example, a line graph could be used to show how pollution levels vary with
increasing distance from a source or how the level of a chemical varies with depth
of soil.
• In a line graph the x-axis represents the continuous variable (for example year
or distance from the initial measurement) whilst the y-axis has a scale and
indicated the measurement.
• Several data series can be plotted on the same line chart and this is particularly
useful for analysing and comparing the trends in different datasets.
52
• Line graph is often used to visualize rate of change of a quantity. It is more
useful when the given data has peaks and valleys. Line graphs are very simple to
draw and quite convenient to interpret.
4. Pie charts
• A type of graph is which a circle is divided into sectors that each represents a
proportion of whole. Each sector shows the relative size of each value.
• A pie chart displays data, information and statistics in an easy to read "pie
slice" format with varying slice sizes telling how much of one data element
exists.
• Pie chart is also known as circle graph. The bigger the slice, the more of that
particular data was gathered. The main use of a pie chart is to show comparisons.
Fig. 1.12.2 shows pie chart. (See Fig. 1.12.2 on next page)
• Various applications of pie charts can be found in business, school and at home.
For business pie charts can be used to show the success or failure of certain
products or services.
• At school, pie chart applications include showing how much time is allotted to
each subject. At home pie charts can be useful to see expenditure of monthly
income in different needs.
• Reading of pie chart is as easy figuring out which slice of an actual pie is the
biggest.
53
and read.
54