[go: up one dir, main page]

0% found this document useful (0 votes)
21 views5 pages

Data Generalization

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 5

Guide: Data Masking

Data Generalization: The Specifics


of Generalizing Data
Data mining is not a new concept that emerged with the digital revolution. The idea has been around for
about a century, although it became more popular in the 1930s. In 1936, Alan Turing proposed a universal
machine that could perform computations comparable to your current computers, one of the first forms of
data mining.

Data Mining, also called Knowledge Discovery in Data (KDD), is a technique for extracting patterns and
other useful information from huge data sets. Because of the advancements in data warehousing
technologies and the rise of big data, the use of data mining techniques has exploded in recent decades,
supporting businesses in turning raw data into valuable knowledge.
In this article, you will receive an in-depth view of a concept closely knitted to data mining - data
generalization. Specifically:

What is Data Generalization? Data Generalization vs. Data Aggregation

When is Data Generalization Important? Approaches to Data Generalization

Data Generalization vs. Data Mining Examples of Data Generalization

What is Data Generalization?

When faced with the question of generalization in data mining, one can simply answer that data
generalization is the process of broadening the classification of data in a database. This helps a user expand
out from the data to provide a broader picture of trends or insights.

Below is a generalization in data mining, with an example.

If you have a data set with a collection of people’s ages, for example, the data generalization process would
look like this:

Data generalization in data mining substitutes a precise value with a less accurate value, which may appear
counterintuitive. Still, it is a widely practical and used technique in data mining, analysis, and secure storage.

https://satoricyber.com | contact@satoricyber.com 33
Guide: Data Masking

Two Forms of Data Generalization in Data Miningt

There are two main forms of data generalization in data mining: Automated and Declarative.

Automated Generalization distorts values until a given value of k gets reached. Because you can utilize an
algorithm to apply the least amount of distortion required to obtain the stated value of k, this method may
offer the optimal balance between privacy and accuracy. You can select which deals are of most impor-
tance for your use case, and those values can be blurred using one of the various approaches to achieve any
value of k.

Declarative Generalization, on the other hand, allows you to set the bin sizes upfront, such as always
rounding to entire months. Outliers sometimes get discarded from this procedure, which might skew the
data and add bias. Although, you must remember a declarative generalization does not always lead to
k-anonymity.

Although declarative generalization may not help you reach k-anonymity, it is a good idea to use it as a
default. Therefore, the recipient of the de-identified material only sees the level of detail they need.

Identifiers used in Data Generalization in Data Mining

Identifiers are data points about a subject that can determine their identity and link to other personal
information. There are two main types of identifiers: direct identifiers and quasi-identifiers.

Direct identifiers are data points that can identify an individual while allowing other data to link to that
person. Even if multiples of the same data point exist in the data, a data point can be a direct identifier. For
example, even if two people are named “Mary,” the name is still a direct identifier.

Quasi Identifiers, on the other hand, do not allow you to identify a person on their own. Still, you can use
them in conjunction with additional information to do so. Quasi Identifiers can be unique within a data
collection. Still, they are also expected to appear in different data sets shortly or are currently present in
other unique data sets.

Suppose you have a data set that includes a person’s gender and zip code. There will be enough people of
that gender who live in that zip code that this person cannot get identified only based on those two data
factors. However, suppose that person also appears in another data collection, including their gender, zip

When is Data Generalization Important?

Data generalization in data mining allows you to abstract personal data by removing identifying
characteristics.

https://satoricyber.com | contact@satoricyber.com 34
Guide: Data Masking

This generalization allows you to examine the data you have collected without jeopardizing the people’s
privacy in your dataset. It is crucial to remember that there are several methods for generalizing data, and
you should choose the one that makes the most sense for your case. In some circumstances, masking
direct identifiers is the best course of action, while in others, you want to keep the signal in data analytics.

Remember that there is no one-size-fits-all solution for retaining privacy. Due to this fact, you should learn
about different approaches like tokenization, redaction, and pseudonymization. Once you understand
those concepts, you can apply them as needed to get the most out of your data without jeopardizing
privacy.

Data Generalization vs. Data Mining

Treading the line between data generalization vs. data mining need not be difficult.

Data generalization is the process of summarizing data by replacing relatively low-level numbers with
higher-level concepts. In contrast, data mining involves investigating and analyzing vast data blocks to
uncover relevant patterns and trends. Data generalization is a type of descriptive data mining, to put it
simply.

Data Generalization vs. Data Aggregation

Data aggregation is a notion linked to, and frequently confused with, data generalization in data mining.

When treading the line between data generalization vs. data aggregation, the primary distinction is that
accumulation creates a general class from many classes. In contrast, generalization is the process of
constructing a specific general class from numerous classes.

Put simply:

Approaches to Data Generalization

There are two basic approaches to Data Generalization in Data Mining:

https://satoricyber.com | contact@satoricyber.com 35
Guide: Data Masking

Data Cube Approach


In most cases, a data cube makes data easier to understand. It is very helpful when displaying data with
dimensions as specific gauges of business needs. Every cube dimension reflects a different aspect of the
database, such as daily, monthly, or yearly sales.

A data cube’s data allows for analyzing nearly all figures for virtually any or all customers, sales agents,
products, among other things. As a result, a data cube can assist in identifying trends and analyzing
performance.

In a nutshell:
It is also known as the OLAP approach or Online Analytical Processing.

It is a practical strategy because it aids in the creation of a previous selling graph.

The Data cube gets used to holding the computation and results in this method.

On a data cube, roll-up and drill-down procedures get employed.

Aggregate functions like count(), sum(), average(), and max() are commonly used in these procedures.

These materialized, you can then use perspectives for decision-making, information discovery, and

various other uses.

Attribute Oriented Induction

Attribute Oriented Induction is a database mining technique that compresses the original data collection
into a generalized relation, resulting in concise and comprehensive information about the huge datasets.

Moreover, attribute generalization in data mining allows for the transition of similar data collections, origi-
nally stated at a low (primitive) level in a database, into more abstract conceptual representations.

In a nutshell:
Attribute generalization in data mining is a query-oriented, generalization-based technique to online

data analysis.

Generalizations get made using this method based on varying values of each attribute within the

relevant data set. Then, to do aggregation, the same tuple is merged, and their corresponding counts

get accumulated.

Before an OLAP or data mining query gets submitted for processing, it performs offline aggregation.

It does not get restricted to specific metrics or categorical data.

Attribute-Oriented Induction uses two methods:

Attribute removal

Attribute generalization

https://satoricyber.com | contact@satoricyber.com 36
Guide: Data Masking

Examples of Data Generalization

Market Basket Analysis is one of the most well-known examples of data generalization in data mining.
Market Basket Analysis is a method for analyzing the purchases made by a customer in a supermarket.

The idea is to use the concept to identify the things that a customer buys together. What are the chances
that if a person buys bread, they will also buy butter? This analysis aids in the promotion of company offers
and discounts. Data mining is used to do the same thing.

Moreover, business reporting for sales or marketing, management reporting, business process manage-
ment (BPM), budgeting and forecasting, financial reporting, and similar sectors commonly use Market
Basket Analysis. However, other sectors such as Agriculture are beginning to find new ways of using this
analysis as well.

Conclusion

With the realization of the significance of data, businesses are continuously finding ways to use and lever-
age data to their advantage. As a result, data scientists have become increasingly important to companies
worldwide as they strive to achieve greater heights with data science than ever before. However, with this,
comes the need to protect the privacy of individuals and follow compliance, which brings the need for data
generalization, as well as other data anonymization strategies.

https://satoricyber.com | contact@satoricyber.com 37

You might also like