Data Analytics Unit-I
Data Analytics Unit-I
Tech I- Semester
Data Analytics
UNIT-I
Data Management: Design Data Architecture and manage the data for analysis, understand various
sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality (noise, outliers, missing
values, duplicate data) and Data Processing & Processing.
Data Management:
Data Management is an administrative process that includes acquiring, validating, storing, protecting and
processing required data to ensure the accessibility, reliability and timelines of the data for its users.
Design Data Architecture and manage the Data for analysis Data
architecture
o is collected,
o how it is stored,
o arranged,
o integrated,
o put to use
in data systems and in organizations.
• Data is usually one of several architecture domains that form the pillars of an enterprise architecture
or solution architecture.
Various constraints and influences that will have an effect on data architecture design are
• enterprise requirements
• technology drivers
• economics
• business policies
• Data processing needs.
Enterprise requirements
• These are also important factors that must be considered during the data architecture phase.
• It is possible that some solutions, while optimal in principle, may not be potential candidates
due to their cost.
• External factors such as
o business cycle,
o interest rates,
o market conditions,
o legal considerations
Could all have an effect on decisions relevant to data architecture?
Business policies
• These include
o accurate and reproducible transactions performed in high volumes,
o data warehousing for the support of management information systems (and
potential data mining),
o repetitive periodic reporting,
o ad hoc reporting,
o support of various organizational initiatives as required (i.e. annual budgets, new
product development).
General Approach
The General Approach is based on designing the Architecture at three Levels of Specification:
• Sensor Data: Sensor data is the output of a device that detects and responds to some
type of input from the physical environment. The output may be used to provide
information or input to another system or to guide a process.
• The Global Positioning System (GPS) has been developed in order to allow accurate
determination of geographical locations by military and civil users. It is based on the use
of satellites in Earth orbit that transmit information which allow to measure the distance
between the satellites and the user.
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through
its daily transaction.
Observation method
▪ The observation method involves human or mechanical observation of what people
actually do or what events take place during a buying or consumption situation.
▪ “Information is collected by observing process at work.”
▪ The following are a few situations:
o Service Stations-
▪ Pose as a customer,
▪ go to a service station and observe.
o To evaluate the effectiveness of display of Dunlop Pillow Cushions-
o In a departmental store, observer notes:-
▪ How many pass by;
▪ How many stopped to look at the display;
▪ How many decide to buy.
o Super Market-
▪ Which is the best location in the shelf? Hidden cameras are used.
▪ To determine typical sales arrangement and find out sales enthusiasm
shown by various salesmen-
o Normally this is done by an investigator using a concealed tape-recorder.
▪ Advantages of Observation Method
o If the researcher observes and record events, it is not necessary to rely on the
willingness and ability of respondents to report accurately.
o The biasing effect of interviewers is either eliminated or reduced. Data
collected by observation are, thus, more objective and generally more
accurate.
▪ Disadvantages of Observation Method
o The most limiting factor in the use of observation method is the inability to
observe such things such as attitudes, motivations, customers/consumers state
of mind, their buying motives and their images.
o It also takes time for the investigator to wait for a particular action to take
place.
o Personal and intimate activities, such as watching television late at night, are
more easily discussed with questionnaires than they are observed.
o Cost is the final disadvantage of observation method.
o Under most circumstances, observational data are more expensive to obtain
than other survey data.
o The observer has to wait doing nothing, between events to be observed.
o The unproductive time is an increased cost.
Survey Method
There are mainly 4 methods by which we can collect data through the Survey Method
• Telephonic Interview
• Personal Interview
• Mail Interview
• Electronic Interview
Telephonic Interview
• Best method for gathering quickly needed information.
• Responses are collected from the respondents by the researcher on telephone.
• Advantages of Telephonic Interview
o It is very fast method of data collection.
o It has the advantage over “Mail Questionnaire” of permitting the interviewer
to talk to one or more persons and to clarifying his questions if they are not
understood.
o Response rate of telephone interviewing seems to be a little better than mail
questionnaires
o The quality of information is better
o It is less costly method and there are less administration problems
• Disadvantages of Telephonic Interview
o They can’t handle interview which need props
o It can’t handle unstructured interview
o It can’t be used for those questions which requires long descriptive answers
o Respondents cannot be observed
o People are reluctant to disclose personal information on telephone
o People who don’t have telephone facility cannot be approached
Personal Interviewing
• It is the most versatile of the all methods. They are used when props are required
along with the verbal response non-verbal responses can also be observed.
Mail Survey
• Questionnaires are sent to the respondents; they fill it up and send it back.
• Advantages of Mail Survey
o It can reach all types of people.
o Response rate can be improved by offering certain incentives.
• Disadvantages of Mail Survey
o It cannot be used for unstructured study.
o It is costly.
o It requires established mailing list.
o It is time consuming.
o There is problem in case of complex questions.
Electronic Interview
• Electronic interviewing is a process of recognizing and noting people, objects, and
occurrences rather than asking for information.
• For example-When you go to store, you notice which product people like to use.
• The Universal Product Code (UPC) is also a method of observing what people are
buying.
• Advantages of Electronic Interview
o There is no relying on willingness or ability of respondent.
o The data is more accurate and objective.
• Disadvantages of Electronic Interview
o Attitudes cannot be observed.
o Those events which are of long duration cannot be observed.
o There is observer bias. It is not purely objective.
o If the respondents know that they are being observed, their response can be
biased.
o It is a costly method.
Experimental Method
• There are number of experimental designs that are used in carrying out and experiment.
• However, Market researchers have used 4 experimental designs most frequently.
• These are
o CRD - Completely Randomized Design
o RBD - Randomized Block Design
o LSD - Latin Square Design
o FD - Factorial Designs
ABCDB
CDACD
ABDAB
C
• The balance arrangement achieved in a Latin Square is its main strength.
• In this design, the comparisons among treatments, will be free from both differences
between rows and columns.
• Thus the magnitude of error will be smaller than any other design.
FD - Factorial Designs
• This design allows the experimenter to test two or more variables simultaneously.
• It also measures interaction effects of the variables and analyzes the impacts of each of
the variables.
• In a true experiment, randomization is essential so that the experimenter can infer cause
and effect without any bias.
▪ Sales Force Report- It gives information about the sale of a product. The information
provided is of outside the organization.
▪ Internal Experts- These are people who are heading the various departments. They
can give an idea of how a particular thing is working
▪ Miscellaneous Reports- These are what information you are getting from operational
reports. If the data available within the organization are unsuitable or inadequate, the
marketer should extend the search to external secondary data sources.
DATA QUALITY
Data Quality is a Perception or an assessment of data's fitness to serve its purpose in a given
context.
✓ Improved data quality leads to better decision-making across an organization. The more
high-quality data you have, the more confidence you can have in your decisions. Good data
decreases risk and can result in consistent improvements in results.
Outliers
• Outliers are either
1. data objects that have characteristics that are different from most of the other data
objects in the data set, or
2. Values of an attribute that are unusual with respect to the typical values for that
attribute.
• Outliers can be legitimate data objects or values.
• Unlike noise, outliers may sometimes be of interest.
Missing Values
• It is not unusual for an object to be missing one or more attribute values.
• In some cases, the information was not collected; e.g., some people decline to give their
age or weight.
• In other cases, some attributes are not applicable to all objects; e.g., often, forms have
conditional parts that are filled out only when a person answers a previous question in a
certain way, but for simplicity, all fields are stored.
• Missing values should be taken into account during the data analysis.
• Strategies for dealing with missing data, each of which may be appropriate in certain
circumstances:
Eliminate Data Objects or Attributes
o A simple and effective strategy is to eliminate objects with missing values.
o However, even a partially specified data object contains some information, and if
many objects have missing values, then a reliable analysis can be difficult or
impossible.
o However, if a data set has only a few objects that have missing values, then it may
be convenient to omit them.
o A related strategy is to eliminate attributes that have missing values.
o This should be done with caution, however, since the eliminated attributes may be
the ones that are critical to the analysis.
Estimate Missing Values
o Sometimes missing data can be reliably estimated.
o For example, consider a time series that changes in a reasonably smooth fashion,
but has a few, widely scattered missing values.
o In such cases, the missing values can be estimated using the remaining values.
o As another example, consider a data set that has many similar data points.
o In this situation, the attribute values of the points closest to the point with the
missing value are often used to estimate the missing value.
o If the attribute is continuous, then the average attribute value of the nearest
neighbors is used
o if the attribute is categorical, then the most commonly occurring attribute value
can be taken.
Inconsistent Values
• Data can contain inconsistent values.
• Consider an address field, where both a zip code and city are listed, but the specified zip
code area is not contained in that city.
• It may be that the individual entering this information transposed two digits, or perhaps a
digit was misread when the information was scanned from a handwritten form.
• It is important to detect and, if possible, correct such problems.
• Some types of inconsistencies are easy to detect. For instance, a person's height should
not be negative.
• In some cases, it can be necessary to consult an external source of information.
• For example, when an insurance company processes claims for reimbursement, it checks
the names and addresses on the reimbursement forms against a database of its customers.
• Once an inconsistency has been detected, it is sometimes possible to correct the data.
• A product code may have "check" digits, or it may be possible to double-check a product
code against a list of known product codes, and then correct the code if it is incorrect, but
close to a known code.
• The correction of an inconsistency requires additional or redundant information.
Duplicate Data
• A data set may include data objects that are duplicates, or almost duplicates, of one
another.
• Many people receive duplicate mailings because they appear in a database multiple times
under slightly different names.
• To detect and eliminate such duplicates, two main issues must be addressed.
o First, if there are two objects that actually represent a single object, then the
values of corresponding attributes may differ, and these inconsistent values must
be resolved.
o Second, care needs to be taken to avoid accidentally combining data objects that
are similar, but not duplicates, such as two distinct people with identical names.
• The term duplication is often used to refer to the process of dealing with these issues.
• In some cases, two or more objects are identical with respect to the attributes measured
by the database, but they still represent different objects.
• Here, the duplicates are legitimate, but may still cause problems for some algorithms if
the possibility of identical objects is not specifically accounted for in their design.
o Relevance
▪ The available data must contain the information necessary for the
application.
▪ Consider the task of building a model that predicts the accident rate for
drivers.
▪ If information about the age and gender of the driver is omitted, then it is
likely that the model will have limited accuracy unless this information is
indirectly available through other attributes.
▪ Making sure that the objects in a data set are relevant is also challenging.
o Sampling bias
▪ Which occurs when a sample does not contain different types of objects in
proportion to their actual occurrence in the population?
▪ For example, survey data describes only those who respond to the survey.
▪ Because the results of a data analysis can reflect only the data that is
present, sampling bias will typically result in an erroneous analysis.
o Knowledge about the Data
▪ Ideally, data sets are accompanied by documentation that describes
different aspects of the data.
▪ The quality of this documentation can either aid or hinder the subsequent
analysis.
▪ For example, if the documentation identifies several attributes as being
strongly related, these attributes are likely to provide highly redundant
information, and we may decide to keep just one.
▪ If the documentation is poor, however, and fails to tell us, for example,
that the missing values for a particular field are indicated with a -9999,
then our analysis of the data may be faulty.
▪ Other important characteristics are the precision of the data, the type of
features (nominal, ordinal, interval, ratio), the scale of measurement (e.g.,
meters or feet for length), and the origin of the data.
Data Preprocessing
Data Preprocessing is a Data Mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends and is likely to contain many errors. Data Preprocessing is a
proven method of resolving such issues.
Data Preprocessing is one of the most Data Mining steps which deals with data preparation and
transformation of the data set and seeks at the same time to make knowledge discovery more
efficient.
They are Data Cleaning/Cleansing, Data Integration, Data Transformation, and Data
Reduction.
1. Data Cleaning/Cleansing
Data can be noisy, having incorrect attribute values. Owing to the following, the data
collection instruments used may be fault. Maybe human or computer errors occurred at data
entry. Errors in data transmission can also occur.
Cleaning “dirty” data
“Dirty” data can cause confusion for the mining procedure. Although most mining routines
have some procedures, they deal incomplete or noisy data, which are not always robust.
Therefore, a useful Data Preprocessing step is to run the data through some Data
Cleaning/Cleansing routines.
2. Data Integration
Data Integration is involved in data analysis task which combines data from multiple
sources into a coherent data store, as in data warehousing. These sources may include
multiple databases, data cubes, or flat files. The issue to be considered in Data Integration is
schema integration. It is tricky.
How can real-world entities from multiple data sources be ‘matched up’? This is referred as
entity identification problem. For example, how can a data analyst be sure that customer_id
in one database and cust_number in another refer to the same entity? The answer is
metadata. Databases and data warehouses typically have metadata. Simply, metadata is data
about data.
Metadata is used to help avoiding errors in schema integration. Another important issue is
redundancy. An attribute may be redundant, if it is derived from another table.
Inconsistencies in attribute or dimension naming can also cause redundancies in the
resulting data set.
3. Data Transformation
Data are transformed into appropriate forms of mining. Data Transformation involves the
following:
Constructing data cube
1. In Normalisation, where the attribute data are scaled to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from the data. Such techniques include
binning, clustering, and regression.
3. In Aggregation, summary or aggregation operations are applied to the data. For
example, daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of
the data at multiple granularities.
4. In Generalisation of the Data, low level or primitive/raw data are replaced by higher
level concepts through the use of concept hierarchies. For example, categorical
attributes are generalised to higher level concepts street into city or country.
Similarly, the values for numeric attributes may be mapped to higher level concepts
like, age into young, middle-aged, or senior.
4. Data Reduction
Complex data analysis and mining on huge amounts of data may take a very long time,
making such analysis impractical or infeasible. Data Reduction techniques are helpful in
analysing the reduced representation of the data set without compromising the integrity of
the original data and yet producing the qualitative knowledge. Strategies for data reduction
include the following:
1. In Data Cube Aggregation, aggregation operations are applied to the data in the
construction of a data cube.
2. In Dimension Reduction, irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
3. In Data Compression, encoding mechanisms are used to reduce data set size. The
methods used for Data Compression are Wavelet Transform and Principle
Component Analysis.
4. In Numerosity Reduction, data is replaced or estimated by alternative and smaller
data representations such as parametric models (which store only the model
parameters instead of the actual data, e.g. Regression and Log-Linear Models) or
non-parametric methods (e.g. Clustering, Sampling, and the use of histograms).
5. In Discretisation and Concept Hierarchy Generation, raw data values for attributes
are replaced by ranges or higher conceptual levels. Concept hierarchies allow the
mining of data at multiple levels of abstraction and are powerful tools for data
mining.
Data Preprocessing
• Steps that should be applied to make the data more suitable for data mining.
• Consists of a number of different strategies and techniques that are interrelated in
complex ways.
Goal:
• To improve the data mining analysis with respect to time, cost, and quality.
Aggregation
• Quantitative attributes are typically aggregated by taking a sum or an average.
• A qualitative attribute can either be omitted or summarized.
Disadvantage of aggregation
• Potential loss of interesting details.
Sampling
• An approach for selecting a subset of the data objects to be analyzed.
Sampling Approaches
• Random sampling.
• Progressive or Adaptive Sampling
Random sampling
• Sampling without replacement: as each item is selected, it is removed from the set
of all objects that together constitute the population.
• Sampling with replacement: objects are not removed from the population as they
are selected for the sample. Same object can be picked more than once.
Dimensionality reduction
• Data mining algorithms work better if the dimensionality - the number of
attributes in the data - is lower.
• Eliminate irrelevant features and reduce noise.
• Lead to a more understandable model due to fewer attributes.
• Allow the data to be more easily visualized.
• Amount of time and memory required by the data mining algorithm is reduced.
Binarization
• Transform both continuous and discrete attributes into one or more binary
attributes.
Variable transformation
• A transformation that is applied to all the values of a variable.