[go: up one dir, main page]

0% found this document useful (0 votes)
20 views15 pages

Chapter 9

Ch 9

Uploaded by

dar mal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views15 pages

Chapter 9

Ch 9

Uploaded by

dar mal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter – 9.

1
Development of Data Processing

Data processing (DP) involves organizing, categorizing, and


manipulating data to extract useful information, such as trends
and connections that can help solve important problems.
Recently, advancements in technology have greatly enhanced the
capacity and efficiency of DP, reducing the need for extensive
human labor. Modern techniques and algorithms in DP have
improved, particularly in areas like facial data classification for
recognition and time series analysis for stock market data. The
success of DP in extracting valuable information largely depends
on the quality of the data, which can be compromised by issues
such as missing data, duplications, incorrect equipment design,
and biased data collection. These factors present significant
challenges in ensuring the effectiveness of data processing.

The history of DP can be divided into three phases as a result of technological advancements

s
Manual DP: Manual DP involves Mechanical DP: This phase Electronic DP: And

processing data without much began in 1890 (Bohme et finally, the electronic

assistance from machines. Prior al., 1991) when a system DP replaced the other

to the phase of mechanical DP made up of intricate punch two that resulted fall in

only small-scale data processing card machines was installed mistakes and rising

was possible using manual by the US Bureau of the productivity. Data

efforts. However, in some special Census in order to assist in processing is being done

cases Manual DP is still in use compiling the findings of a electronically using

today, and it is typically due to recent national population computers and other

the data’s difficulty in census. Use of mechanical cutting-edge

digitization or inability to be DP made it quicker and electronics. It is now

read by machines, like in the case easier to search and widely used in industry,

of retrieving data from outdated compute the data than research institutions

texts or documents. manual process. and academia.


How data processing and data science is relevant for finance?

The relevance of data processing and data science in the area of finance is increasing every day.
The eleven significant areas where data science play important role are:

Risk analytics: In the financial sector, the integration of data


science and analytics is crucial for managing risks and
improving customer service. Techniques like risk analytics,
powered by machine learning, allow for the analysis of both
structured and unstructured data to identify and prioritize
potential risks. This helps in preventing fraud, enhancing
customer segmentation, and providing personalized services.
Additionally, real-time analytics enable dynamic risk
assessment models that adapt to new transactions or
changes in customer data, thus optimizing decision-making
and reducing the likelihood of human error. Overall,
leveraging data science in finance helps firms navigate risks
more effectively and make informed decisions.

Real-time analytics: have revolutionized data processing,


enabled by advancements in Data Engineering, such as
Airflow, Spark, and cloud technologies. Previously, data was
only available in historical batches, delaying analysis and
potentially leading to outdated conclusions. Now, with real-
time analytics, businesses can immediately analyze data as it
is generated, allowing for instant assessments of customer
value, precise credit ratings, and accurate transactions. This
integration of Data Engineering, Data Science, Machine
Learning, and Business Intelligence significantly enhances
decision-making and user experience.
Data science has transformed customer data management,
especially in the face of challenges posed by big data and
unstructured data sources like social media and IoT devices.
Traditional methods used in Business Intelligence are no
longer sufficient for analyzing this complex data landscape.
Data science employs advanced techniques such as text
analytics, data mining, and natural language processing to
handle large volumes of unstructured data. This not only
improves data accessibility but also enhances a company’s
analytical capabilities, providing deeper insights into market
trends and customer behavior.

Consumer analytics, empowered by machine learning and


real-time analytics, allows insurance companies to efficiently
process vast amounts of customer data. This capability
supports personalized customer service by swiftly analyzing
transaction histories and prior data patterns. It enables firms
to identify less profitable customers, enhance cross-selling
opportunities, and estimate the lifetime value of consumers.
As a result, financial institutions can maintain security and
provide tailored assessments for each client application.

Customer segmentation is a critical strategy where consumers


are categorized based on attributes like geography, age, and
purchasing patterns. This segmentation allows organizations to
evaluate the current and potential long-term value of different
customer groups. By leveraging machine learning algorithms,
data scientists can automate the segmentation process, assigning
relevance scores to various attributes. This approach helps
businesses identify and focus on high-value customers while
minimizing resources spent on less promising ones. Comparing
these segments with historical data further aids in predicting the
future value of relationships with each customer segment.
Personalized services in the finance sector are increasingly
important for enhancing customer satisfaction and increasing
lifetime value. By analyzing customer interactions through
natural language processing (NLP) and voice recognition
technologies, companies can tailor services to individual needs,
facilitating effective cross-selling and improving overall customer
service. As NLP technology continues to advance, the potential
for even more refined and effective personalization is significant,
promising further improvements in how businesses engage with
their customers.

Advanced customer service in data science leverages real-


time analytics and natural language processing to enhance
interactions. This allows customer service agents to provide
more effective recommendations, offer practical financial
advice, and opportunistically cross-sell or up-sell based on
the customer's immediate needs and conversation cues. Over
time, insights gathered from each interaction improve the
system's overall effectiveness, continually enhancing
customer service delivery.

Predictive analytics in the financial sector uses machine


learning to analyze historical data and forecast future
trends and patterns. This technology helps in making
informed investment decisions and developing trading
strategies. Deep learning techniques, which do not require
manual data preparation, often outperform shallower
learning methods by autonomously adjusting data,
leading to more accurate predictions.
Fraud detection in financial institutions has become more
effective with the advancement of artificial intelligence and
machine learning technologies. These tools analyze vast
amounts of data to identify patterns and respond in real-
time to suspicious activities, such as uncharacteristic large
purchases on a credit card. This capability allows immediate
actions like card blocking and notifications to prevent further
misuse, protecting the customer, the bank, and associated
insurers. Additionally, in trading, machine learning detects
irregularities, prompting swift investigations to mitigate risks.

Anomaly detection in financial services utilizes advanced


algorithms like Recurrent Neural Networks, Long Short-Term
Memory models, and Transformers to identify unusual activities,
such as illegal insider trading, before significant damage occurs.
These technologies analyze trading behaviors and patterns to
detect illegal advantages taken in stock market forecasts, helping
to protect investors and maintain market integrity.

Algorithmic trading utilizes data science to automate


stock market trades, minimizing losses due to human
indecision and error. This method employs
Reinforcement Learning, where trading algorithms are
developed and refined through a system of penalties and
rewards, learning from each transaction. Key benefits
include high-frequency trading and precision, as the
computer rapidly executes trades based on learned
behaviors and predefined rules, engaging only when
there's a perceived profit opportunity.
Chapter – 9.2
Functions of Data Processing
Data processing generally involves the following processes:

(i) Validation: Data validation is a crucial process that


ensures data meets specific quality standards and rule
compliance before being accepted for use. It verifies that
data values fall within an acceptable range and
maintains the integrity of the final data set. In the
context of official statistics, data validation contributes
to several quality dimensions such as relevance,
accuracy, timeliness, accessibility, comparability,
coherence, and comprehensiveness, helping to ensure
that the data is fit for its intended use.

(ii) Sorting: Data sorting is a process that organizes


data into meaningful order to facilitate easier
understanding, analysis, and visualization. It can be
applied to raw data or aggregated information and
is often used in various applications such as data
cleaning, ranking records, and optimizing data
presentation in visualizations like tables and charts.
Sorting can be done by numerical values, labels, or
other factors, and is essential for accurate data
interpretation. Different software has various default
sorting behaviors and capabilities, which can impact
the sorting process, especially when dealing with
non-unique data. Sorting is a fundamental feature in
most analytical and statistical software, crucial across
all stages of data processing.
(iii) Aggregation: Data aggregation is the process of
collecting and summarizing data, often transforming
individual data rows into statistical summaries. This is
commonly used in data warehouses to enhance the
efficiency of querying large datasets by reducing the
time required. Aggregated data can represent large
quantities of individual entries, making it quicker to
query and analyze. As businesses handle increasing
volumes of data, aggregation helps in efficiently
accessing the most significant and frequently requested
information.

(iv) Analysis: Data analysis involves cleaning,


transforming, and modeling data to extract insights and
support decision-making. This process helps businesses
understand past performance and predict future
outcomes to make informed decisions. Whether
addressing stagnation or fostering further growth, data
analysis is crucial for evaluating and improving business
strategies and operations.

(v) Reporting: Data reporting involves gathering, organizing, and


presenting raw data in a consumable format to assess an
organization's ongoing performance. It helps answer fundamental
questions about business status through tools like Excel or data
visualization platforms. Typically, static in format and sourcing from
consistent data points, these reports provide key insights into areas
such as financial health or sales performance, highlighting metrics like
revenue, KPIs, and net profits. Effective data reporting is crucial across
industries, aiding decisions in healthcare, education, and business, by
translating vast data into actionable insights. Despite its benefits,
traditional static reporting can lack real-time updates, which may
limit its applicability in dynamic decision-making scenarios.
(vi) Classification: Data classification is a process that organizes
data into specific categories to enhance its use and ensure
effective protection. This process makes data easier to locate,
reduces duplication, saves storage costs, and accelerates
retrieval. It's crucial for risk management, compliance, and
data security, often being a regulatory requirement. Data
classification involves tagging data with labels to denote its
type, sensitivity, and integrity, which aids in applying suitable
security measures. The three primary methods of data
classification are content-based, which analyzes file contents
for sensitive data; context-based, which uses metadata like
location or creator; and user-based, which depends on human
judgment during document handling. Overall, data
classification helps manage data security and compliance by
defining how data is accessed and protected.

In managing data, organizations classify their data and


systems into three risk categories to ensure appropriate
handling and security measures:

Low Risk: This category includes data that is public and easily
recoverable. Such data poses minimal risk if accessed by
unauthorized individuals because it is intended for wide
distribution or has limited sensitivity.

Moderate Risk: Data in this category is not public but also not
critical to operations. It includes internal data such as
proprietary operating processes, product costs, and some
corporate documents. While this data is not intended for public
access, its exposure poses a lesser threat than high-risk data.

High Risk: This category encompasses data that is highly


sensitive or critical to operational security. High-risk data
includes any information that, if compromised, could lead to
significant financial loss, legal repercussions, or reputational
damage. This also covers data that is difficult to recreate or
retrieve if lost.
The data classification process is crucial for managing and protecting an organization's data
effectively. It involves several key steps:

1. Determine Classification Criteria and Categories: Organizations need to establish clear criteria
and categories for classifying data. This involves understanding and defining the organization's
objectives for the data, and the implications of each category in terms of security, privacy, and
compliance requirements.

2. Set Up Operational Framework: Once categories are defined, organizations must outline the
roles and responsibilities of employees and third parties involved in data management. This
includes detailing how data should be stored, transferred, and retrieved within these roles.

3. Develop and Implement Policies and Procedures: Policies should clearly articulate the security
needs, confidentiality requirements, and handling procedures for each data type. These policies
need to be simple enough for all staff members to understand and follow, ensuring compliance
and mitigating security risks.

4. Understanding the Current Setup: Before classifying data, it's essential to have a comprehensive
understanding of where all organizational data is stored and any relevant legislation that may
affect its handling. This step ensures that all data is accounted for and that the classification
aligns with legal requirements.

5. Create a Data Classification Policy: Developing a formal data classification policy is critical. This
policy serves as the backbone for all data classification efforts, providing guidelines that help
maintain compliance with data protection standards.

6. Prioritize and Organize Data: With a policy in place, data can be systematically categorized
according to its sensitivity and privacy requirements. This involves tagging data accurately and
prioritizing security measures based on the classification level of the data.
Chapter – 9.3
Data Organization and Distribution

Data organization involves structuring Data distribution is a statistical function that

unstructured data into clear categories and organizes and quantifies the possible values of

groups to facilitate easier access, analysis, and a variable and their probabilities of

manipulation. This process, essential for occurrence. This process is essential for

efficient data management, includes techniques determining the type of distribution a

such as classification, frequency distribution population follows, allowing for the

tables, and various graphical representations. appropriate statistical methods to be applied

As data volumes grow, organizing data becomes for analysis. In practice, data distributions

critical to reduce search times and enhance are often visualized using graphs such as

usability. In business contexts, both semi- histograms, box plots, and pie charts, which

structured and unstructured data are analyzed help to estimate the likelihood of specific

and integrated into comprehensive data observations within a data set. Probability

systems using advanced technological tools. distributions, a key aspect of data

Effective data organization is vital for distributions, provide a mathematical

businesses across industries, enabling improved framework for predicting the outcomes of

business intelligence, streamlined operations, various scenarios based on the types of

and overall enhancement of business models. It random variables involved, whether discrete

transforms raw, unstructured data into or continuous. This facilitates decision-

valuable assets that drive decision-making and making through statistical measures like

strategic planning. mean, mode, range, and probability.

Types of distribution

Distributions are basically classified based on the type of data:

(i) Discrete distributions: A discrete distribution that results from countable data and has a finite
number of potential values. In addition, discrete distributions may be displayed in tables, and the
values of the random variable can be counted. Example: rolling dice, selecting a specific amount
of heads, etc.
Following are the discrete distributions of various types:

(a) Binomial distributions: The binomial distribution quantifies the chance of obtaining a specific

number of successes or failures each experiment. Binomial distribution applies to attributes that
are categorised into two mutually exclusive and exhaustive classes, such as number of
successes/failures and number of acceptances/rejections.

Example: When tossing a coin: The likelihood of a coin falling on its head is one-half and the
probability of a coin landing on its tail is one-half.
(b) Poisson distribution: The Poisson distribution is the discrete probability distribution that
quantifies the chance of a certain number of events occurring in a given time period, where the
events occur in a well-defined order.

Poisson distribution applies to attributes that can potentially take on huge values, but in practise
take on tiny ones.
Example: Number of flaws, mistakes, accidents, absentees etc.
(c) Hypergeometric distribution: The hypergeometric distribution is a discrete distribution that

assesses the chance of a certain number of successes in (n) trials, without replacement, from a
sufficiently large population (N). Specifically, sampling without replacement. The hypergeometric
distribution is comparable to the binomial distribution; the primary distinction between the two
is that the chance of success is not the same for all trials in the binomial distribution but it is in
the hypergeometric distribution.

(d) Geometric distribution: The geometric distribution is a discrete distribution that assesses the
probability of the occurrence of the first success. A possible extension is the negative binomial
distribution.

Example: A marketing representative from an advertising firm chooses hockey players from
several institutions at random till he discovers an Olympic participant.

(ii) Continuous distributions: A distribution with an unlimited number of (variable) data points
that may be represented on a continuous measuring scale. A continuous random variable is a
random variable with an unlimited and uncountable set of potential values. It is more than a
simple count and is often described using probability density functions (pdf). The probability
density function describes the characteristics of a random variable. Normally clustered frequency
distribution is seen. Therefore, the probability density function views it as the distribution’s
“shape.”
Following are the continuous distributions of various types:

(i) Normal distribution: Gaussian distribution is another name for normal distribution. It is a bell-
shaped curve with a greater frequency (probability density) around the core point. As values go
away from the centre value on each side, the frequency drops dramatically. In other words,
features whose dimensions are expected to fall on either side of the target value with equal
likelihood adhere to normal distribution.

(ii) Lognormal distribution: A continuous random variable x follows a lognormal distribution if the
distribution of its natural logarithm, ln(x), is normal. As the sample size rises, the distribution of
the sum of random variables approaches a normal distribution, independent of the distribution
of the individuals.

(iii) F distribution: The F distribution is often employed to examine the equality of variances
between two normal populations. The F distribution is an asymmetric distribution with no
maximum value and a minimum value of 0. The curve approaches 0 but never reaches the
horizontal axis.

(iv) Chi square distributions: When independent variables with standard normal distribution are
squared and added, the chi square distribution occurs. Example: y = Z12+ Z22 +Z32 +Z42+....+
Zn2 if Z is a typical normal random variable. The distribution of chi square values is symmetrical
and constrained below zero. And approaches the form of the normal distribution as the number
of degrees of freedom grows.

(v) Exponential distribution: The exponential distribution is a probability distribution and one of
the most often employed continuous distributions. Used frequently to represent products with a
consistent failure rate. The exponential distribution and the Poisson distribution are closely
connected. Has a constant failure rate since its form characteristics remain constant.

(vi) T student distribution: The t distribution or student’s t distribution is a probability distribution


with a bell shape that is symmetrical about its mean. Used frequently for testing hypotheses and
building confidence intervals for means. Substituted for the normal distribution when the
standard deviation cannot be determined. When random variables are averages, the distribution
of the average tends to be normal, similar to the normal distribution, independent of the
distribution of the individuals.
Chapter – 9.4
Data Cleaning and Validation
Data cleaning is the crucial process of identifying and correcting or removing inaccurate,
corrupted, or irrelevant records from a dataset. This process becomes particularly important
when datasets from multiple sources are merged, leading to potential duplications and
mislabeling. Proper data cleaning ensures the reliability and accuracy of analytics and algorithms.
Unlike data transformation, which involves changing the format or structure of data, data
cleaning focuses solely on purifying the dataset by removing flawed, duplicate, or irrelevant data.
Establishing a standardized data cleaning procedure, tailored to specific datasets, is essential for
maintaining data integrity during analysis and decision-making processes.

Data cleaning is a fundamental process in data management, consisting of several key steps to
ensure the accuracy and utility of a dataset:

1. Removal of Duplicate and Irrelevant Information: Identify and eliminate any duplicate entries
or data points that are not relevant to the study's focus, such as data from unrelated demographic
groups, to streamline the dataset and align it more closely with research objectives.

2. Fix Structural Errors: Correct inconsistencies in the data, such as typos, unusual naming
conventions, or inconsistent capitalization, which can lead to mislabeled categories or erroneous
classifications.

3. Filter Unwanted Outliers: Evaluate outliers to determine whether they represent errors or are
valid data points that could potentially confirm a hypothesis. Remove outliers only if they are
clearly erroneous or irrelevant to the analysis.

4. Handle Missing Data: Address missing values, which are problematic for many analytical
algorithms. Options include removing observations with missing data, which risks losing valuable
information, or imputing missing values based on other observations, which could introduce bias.

5. Validation and QA: After cleaning, validate the data to ensure it makes sense, adheres to
applicable standards, and supports or refutes the working hypothesis. Check if the data patterns
can generate further hypotheses, and establish a culture of data quality within the organization
to prevent the future generation of flawed data.
Benefits of quality data

Determining the quality of data needs an analysis of its properties and a weighting of those
attributes based on what is most essential to the company and the application(s) for which the
data will be utilised.

Main characteristics of quality data are:


(i) Validity
(ii) Accuracy
(iii) Completeness
(iv) Consistency
Benefits of data cleaning

Ultimately, having clean data would boost overall productivity and provide with the greatest
quality information for decision-making. Benefits include:
(i) Error correction when numerous data sources are involved.
(ii) Fewer mistakes result in happier customers and less irritated workers.
(iii) Capability to map the many functions and planned uses of your data.
(iv) Monitoring mistakes and improving reporting to determine where errors are originating can
make it easier to repair inaccurate or damaged data in future applications.
(v) Using data cleaning technologies will result in more effective corporate procedures and speedier
decision-making.

Data validation is a critical yet often overlooked step in data management that ensures the
accuracy, clarity, and relevance of data before its use. It involves verifying the precision and
appropriateness of both the data inputs and the data model itself. Modern data integration
systems can automate and integrate validation into the workflow, streamlining the process and
preventing it from being a bottleneck. Validating data helps avoid "garbage in, garbage out" issues
and ensures that decisions are based on reliable and current information, ultimately supporting
the validity of analytical conclusions and mitigating the risk of project failures.

Types of data validation

1. Data type check: A data type check verifies that the entered data has the appropriate data
type. For instance, a field may only take numeric values. If this is the case, the system should
reject any data containing other characters, such as letters or special symbols.
2. Code check: A code check verifies that a field’s value is picked from a legitimate set of options
or that it adheres to specific formatting requirements. For instance, it is easy to verify the validity
of a postal code by comparing it to a list of valid codes. The same principle may be extended to
other things, including nation codes and NIC industry codes.

3. Range check: A range check determines whether or not input data falls inside a specified range.
Latitude and longitude, for instance, are frequently employed in geographic data. A latitude value
must fall between -90 and 90 degrees, whereas a longitude value must fall between -180 and
180 degrees. Outside of this range, values are invalid.

4. Format check: Numerous data kinds adhere to a set format. Date columns that are kept in a
fixed format, such as “YYYY-MM-DD” or “DD-MM-YYYY,” are a popular use case. A data
validation technique that ensures dates are in the correct format contributes to data and
temporal consistency.

5. Consistency check: A consistency check is a form of logical check that verifies that the data has
been input in a consistent manner. Checking whether a package’s delivery date is later than its
shipment date is one example.

6. Uniqueness check: Some data like PAN or e-mail ids are unique by nature. These fields should
typically contain unique items in a database. A uniqueness check guarantees that an item is not
put into a database numerous time.

You might also like