Chapter 9
Chapter 9
1
Development of Data Processing
The history of DP can be divided into three phases as a result of technological advancements
s
Manual DP: Manual DP involves Mechanical DP: This phase Electronic DP: And
processing data without much began in 1890 (Bohme et finally, the electronic
assistance from machines. Prior al., 1991) when a system DP replaced the other
to the phase of mechanical DP made up of intricate punch two that resulted fall in
only small-scale data processing card machines was installed mistakes and rising
efforts. However, in some special Census in order to assist in processing is being done
today, and it is typically due to recent national population computers and other
read by machines, like in the case easier to search and widely used in industry,
of retrieving data from outdated compute the data than research institutions
The relevance of data processing and data science in the area of finance is increasing every day.
The eleven significant areas where data science play important role are:
Low Risk: This category includes data that is public and easily
recoverable. Such data poses minimal risk if accessed by
unauthorized individuals because it is intended for wide
distribution or has limited sensitivity.
Moderate Risk: Data in this category is not public but also not
critical to operations. It includes internal data such as
proprietary operating processes, product costs, and some
corporate documents. While this data is not intended for public
access, its exposure poses a lesser threat than high-risk data.
1. Determine Classification Criteria and Categories: Organizations need to establish clear criteria
and categories for classifying data. This involves understanding and defining the organization's
objectives for the data, and the implications of each category in terms of security, privacy, and
compliance requirements.
2. Set Up Operational Framework: Once categories are defined, organizations must outline the
roles and responsibilities of employees and third parties involved in data management. This
includes detailing how data should be stored, transferred, and retrieved within these roles.
3. Develop and Implement Policies and Procedures: Policies should clearly articulate the security
needs, confidentiality requirements, and handling procedures for each data type. These policies
need to be simple enough for all staff members to understand and follow, ensuring compliance
and mitigating security risks.
4. Understanding the Current Setup: Before classifying data, it's essential to have a comprehensive
understanding of where all organizational data is stored and any relevant legislation that may
affect its handling. This step ensures that all data is accounted for and that the classification
aligns with legal requirements.
5. Create a Data Classification Policy: Developing a formal data classification policy is critical. This
policy serves as the backbone for all data classification efforts, providing guidelines that help
maintain compliance with data protection standards.
6. Prioritize and Organize Data: With a policy in place, data can be systematically categorized
according to its sensitivity and privacy requirements. This involves tagging data accurately and
prioritizing security measures based on the classification level of the data.
Chapter – 9.3
Data Organization and Distribution
unstructured data into clear categories and organizes and quantifies the possible values of
groups to facilitate easier access, analysis, and a variable and their probabilities of
manipulation. This process, essential for occurrence. This process is essential for
As data volumes grow, organizing data becomes for analysis. In practice, data distributions
critical to reduce search times and enhance are often visualized using graphs such as
usability. In business contexts, both semi- histograms, box plots, and pie charts, which
structured and unstructured data are analyzed help to estimate the likelihood of specific
and integrated into comprehensive data observations within a data set. Probability
businesses across industries, enabling improved framework for predicting the outcomes of
and overall enhancement of business models. It random variables involved, whether discrete
valuable assets that drive decision-making and making through statistical measures like
Types of distribution
(i) Discrete distributions: A discrete distribution that results from countable data and has a finite
number of potential values. In addition, discrete distributions may be displayed in tables, and the
values of the random variable can be counted. Example: rolling dice, selecting a specific amount
of heads, etc.
Following are the discrete distributions of various types:
(a) Binomial distributions: The binomial distribution quantifies the chance of obtaining a specific
number of successes or failures each experiment. Binomial distribution applies to attributes that
are categorised into two mutually exclusive and exhaustive classes, such as number of
successes/failures and number of acceptances/rejections.
Example: When tossing a coin: The likelihood of a coin falling on its head is one-half and the
probability of a coin landing on its tail is one-half.
(b) Poisson distribution: The Poisson distribution is the discrete probability distribution that
quantifies the chance of a certain number of events occurring in a given time period, where the
events occur in a well-defined order.
Poisson distribution applies to attributes that can potentially take on huge values, but in practise
take on tiny ones.
Example: Number of flaws, mistakes, accidents, absentees etc.
(c) Hypergeometric distribution: The hypergeometric distribution is a discrete distribution that
assesses the chance of a certain number of successes in (n) trials, without replacement, from a
sufficiently large population (N). Specifically, sampling without replacement. The hypergeometric
distribution is comparable to the binomial distribution; the primary distinction between the two
is that the chance of success is not the same for all trials in the binomial distribution but it is in
the hypergeometric distribution.
(d) Geometric distribution: The geometric distribution is a discrete distribution that assesses the
probability of the occurrence of the first success. A possible extension is the negative binomial
distribution.
Example: A marketing representative from an advertising firm chooses hockey players from
several institutions at random till he discovers an Olympic participant.
(ii) Continuous distributions: A distribution with an unlimited number of (variable) data points
that may be represented on a continuous measuring scale. A continuous random variable is a
random variable with an unlimited and uncountable set of potential values. It is more than a
simple count and is often described using probability density functions (pdf). The probability
density function describes the characteristics of a random variable. Normally clustered frequency
distribution is seen. Therefore, the probability density function views it as the distribution’s
“shape.”
Following are the continuous distributions of various types:
(i) Normal distribution: Gaussian distribution is another name for normal distribution. It is a bell-
shaped curve with a greater frequency (probability density) around the core point. As values go
away from the centre value on each side, the frequency drops dramatically. In other words,
features whose dimensions are expected to fall on either side of the target value with equal
likelihood adhere to normal distribution.
(ii) Lognormal distribution: A continuous random variable x follows a lognormal distribution if the
distribution of its natural logarithm, ln(x), is normal. As the sample size rises, the distribution of
the sum of random variables approaches a normal distribution, independent of the distribution
of the individuals.
(iii) F distribution: The F distribution is often employed to examine the equality of variances
between two normal populations. The F distribution is an asymmetric distribution with no
maximum value and a minimum value of 0. The curve approaches 0 but never reaches the
horizontal axis.
(iv) Chi square distributions: When independent variables with standard normal distribution are
squared and added, the chi square distribution occurs. Example: y = Z12+ Z22 +Z32 +Z42+....+
Zn2 if Z is a typical normal random variable. The distribution of chi square values is symmetrical
and constrained below zero. And approaches the form of the normal distribution as the number
of degrees of freedom grows.
(v) Exponential distribution: The exponential distribution is a probability distribution and one of
the most often employed continuous distributions. Used frequently to represent products with a
consistent failure rate. The exponential distribution and the Poisson distribution are closely
connected. Has a constant failure rate since its form characteristics remain constant.
Data cleaning is a fundamental process in data management, consisting of several key steps to
ensure the accuracy and utility of a dataset:
1. Removal of Duplicate and Irrelevant Information: Identify and eliminate any duplicate entries
or data points that are not relevant to the study's focus, such as data from unrelated demographic
groups, to streamline the dataset and align it more closely with research objectives.
2. Fix Structural Errors: Correct inconsistencies in the data, such as typos, unusual naming
conventions, or inconsistent capitalization, which can lead to mislabeled categories or erroneous
classifications.
3. Filter Unwanted Outliers: Evaluate outliers to determine whether they represent errors or are
valid data points that could potentially confirm a hypothesis. Remove outliers only if they are
clearly erroneous or irrelevant to the analysis.
4. Handle Missing Data: Address missing values, which are problematic for many analytical
algorithms. Options include removing observations with missing data, which risks losing valuable
information, or imputing missing values based on other observations, which could introduce bias.
5. Validation and QA: After cleaning, validate the data to ensure it makes sense, adheres to
applicable standards, and supports or refutes the working hypothesis. Check if the data patterns
can generate further hypotheses, and establish a culture of data quality within the organization
to prevent the future generation of flawed data.
Benefits of quality data
Determining the quality of data needs an analysis of its properties and a weighting of those
attributes based on what is most essential to the company and the application(s) for which the
data will be utilised.
Ultimately, having clean data would boost overall productivity and provide with the greatest
quality information for decision-making. Benefits include:
(i) Error correction when numerous data sources are involved.
(ii) Fewer mistakes result in happier customers and less irritated workers.
(iii) Capability to map the many functions and planned uses of your data.
(iv) Monitoring mistakes and improving reporting to determine where errors are originating can
make it easier to repair inaccurate or damaged data in future applications.
(v) Using data cleaning technologies will result in more effective corporate procedures and speedier
decision-making.
Data validation is a critical yet often overlooked step in data management that ensures the
accuracy, clarity, and relevance of data before its use. It involves verifying the precision and
appropriateness of both the data inputs and the data model itself. Modern data integration
systems can automate and integrate validation into the workflow, streamlining the process and
preventing it from being a bottleneck. Validating data helps avoid "garbage in, garbage out" issues
and ensures that decisions are based on reliable and current information, ultimately supporting
the validity of analytical conclusions and mitigating the risk of project failures.
1. Data type check: A data type check verifies that the entered data has the appropriate data
type. For instance, a field may only take numeric values. If this is the case, the system should
reject any data containing other characters, such as letters or special symbols.
2. Code check: A code check verifies that a field’s value is picked from a legitimate set of options
or that it adheres to specific formatting requirements. For instance, it is easy to verify the validity
of a postal code by comparing it to a list of valid codes. The same principle may be extended to
other things, including nation codes and NIC industry codes.
3. Range check: A range check determines whether or not input data falls inside a specified range.
Latitude and longitude, for instance, are frequently employed in geographic data. A latitude value
must fall between -90 and 90 degrees, whereas a longitude value must fall between -180 and
180 degrees. Outside of this range, values are invalid.
4. Format check: Numerous data kinds adhere to a set format. Date columns that are kept in a
fixed format, such as “YYYY-MM-DD” or “DD-MM-YYYY,” are a popular use case. A data
validation technique that ensures dates are in the correct format contributes to data and
temporal consistency.
5. Consistency check: A consistency check is a form of logical check that verifies that the data has
been input in a consistent manner. Checking whether a package’s delivery date is later than its
shipment date is one example.
6. Uniqueness check: Some data like PAN or e-mail ids are unique by nature. These fields should
typically contain unique items in a database. A uniqueness check guarantees that an item is not
put into a database numerous time.