Data Generalization
Data Generalization
Data Generalization
Data Mining, also called Knowledge Discovery in Data (KDD), is a technique for extracting patterns and
other useful information from huge data sets. Because of the advancements in data warehousing
technologies and the rise of big data, the use of data mining techniques has exploded in recent decades,
supporting businesses in turning raw data into valuable knowledge.
In this article, you will receive an in-depth view of a concept closely knitted to data mining - data
generalization. Specifically:
When faced with the question of generalization in data mining, one can simply answer that data
generalization is the process of broadening the classification of data in a database. This helps a user expand
out from the data to provide a broader picture of trends or insights.
If you have a data set with a collection of people’s ages, for example, the data generalization process would
look like this:
Data generalization in data mining substitutes a precise value with a less accurate value, which may appear
counterintuitive. Still, it is a widely practical and used technique in data mining, analysis, and secure storage.
https://satoricyber.com | contact@satoricyber.com 33
Guide: Data Masking
There are two main forms of data generalization in data mining: Automated and Declarative.
Automated Generalization distorts values until a given value of k gets reached. Because you can utilize an
algorithm to apply the least amount of distortion required to obtain the stated value of k, this method may
offer the optimal balance between privacy and accuracy. You can select which deals are of most impor-
tance for your use case, and those values can be blurred using one of the various approaches to achieve any
value of k.
Declarative Generalization, on the other hand, allows you to set the bin sizes upfront, such as always
rounding to entire months. Outliers sometimes get discarded from this procedure, which might skew the
data and add bias. Although, you must remember a declarative generalization does not always lead to
k-anonymity.
Although declarative generalization may not help you reach k-anonymity, it is a good idea to use it as a
default. Therefore, the recipient of the de-identified material only sees the level of detail they need.
Identifiers are data points about a subject that can determine their identity and link to other personal
information. There are two main types of identifiers: direct identifiers and quasi-identifiers.
Direct identifiers are data points that can identify an individual while allowing other data to link to that
person. Even if multiples of the same data point exist in the data, a data point can be a direct identifier. For
example, even if two people are named “Mary,” the name is still a direct identifier.
Quasi Identifiers, on the other hand, do not allow you to identify a person on their own. Still, you can use
them in conjunction with additional information to do so. Quasi Identifiers can be unique within a data
collection. Still, they are also expected to appear in different data sets shortly or are currently present in
other unique data sets.
Suppose you have a data set that includes a person’s gender and zip code. There will be enough people of
that gender who live in that zip code that this person cannot get identified only based on those two data
factors. However, suppose that person also appears in another data collection, including their gender, zip
Data generalization in data mining allows you to abstract personal data by removing identifying
characteristics.
https://satoricyber.com | contact@satoricyber.com 34
Guide: Data Masking
This generalization allows you to examine the data you have collected without jeopardizing the people’s
privacy in your dataset. It is crucial to remember that there are several methods for generalizing data, and
you should choose the one that makes the most sense for your case. In some circumstances, masking
direct identifiers is the best course of action, while in others, you want to keep the signal in data analytics.
Remember that there is no one-size-fits-all solution for retaining privacy. Due to this fact, you should learn
about different approaches like tokenization, redaction, and pseudonymization. Once you understand
those concepts, you can apply them as needed to get the most out of your data without jeopardizing
privacy.
Treading the line between data generalization vs. data mining need not be difficult.
Data generalization is the process of summarizing data by replacing relatively low-level numbers with
higher-level concepts. In contrast, data mining involves investigating and analyzing vast data blocks to
uncover relevant patterns and trends. Data generalization is a type of descriptive data mining, to put it
simply.
Data aggregation is a notion linked to, and frequently confused with, data generalization in data mining.
When treading the line between data generalization vs. data aggregation, the primary distinction is that
accumulation creates a general class from many classes. In contrast, generalization is the process of
constructing a specific general class from numerous classes.
Put simply:
https://satoricyber.com | contact@satoricyber.com 35
Guide: Data Masking
A data cube’s data allows for analyzing nearly all figures for virtually any or all customers, sales agents,
products, among other things. As a result, a data cube can assist in identifying trends and analyzing
performance.
In a nutshell:
It is also known as the OLAP approach or Online Analytical Processing.
The Data cube gets used to holding the computation and results in this method.
Aggregate functions like count(), sum(), average(), and max() are commonly used in these procedures.
These materialized, you can then use perspectives for decision-making, information discovery, and
Attribute Oriented Induction is a database mining technique that compresses the original data collection
into a generalized relation, resulting in concise and comprehensive information about the huge datasets.
Moreover, attribute generalization in data mining allows for the transition of similar data collections, origi-
nally stated at a low (primitive) level in a database, into more abstract conceptual representations.
In a nutshell:
Attribute generalization in data mining is a query-oriented, generalization-based technique to online
data analysis.
Generalizations get made using this method based on varying values of each attribute within the
relevant data set. Then, to do aggregation, the same tuple is merged, and their corresponding counts
get accumulated.
Before an OLAP or data mining query gets submitted for processing, it performs offline aggregation.
Attribute removal
Attribute generalization
https://satoricyber.com | contact@satoricyber.com 36
Guide: Data Masking
Market Basket Analysis is one of the most well-known examples of data generalization in data mining.
Market Basket Analysis is a method for analyzing the purchases made by a customer in a supermarket.
The idea is to use the concept to identify the things that a customer buys together. What are the chances
that if a person buys bread, they will also buy butter? This analysis aids in the promotion of company offers
and discounts. Data mining is used to do the same thing.
Moreover, business reporting for sales or marketing, management reporting, business process manage-
ment (BPM), budgeting and forecasting, financial reporting, and similar sectors commonly use Market
Basket Analysis. However, other sectors such as Agriculture are beginning to find new ways of using this
analysis as well.
Conclusion
With the realization of the significance of data, businesses are continuously finding ways to use and lever-
age data to their advantage. As a result, data scientists have become increasingly important to companies
worldwide as they strive to achieve greater heights with data science than ever before. However, with this,
comes the need to protect the privacy of individuals and follow compliance, which brings the need for data
generalization, as well as other data anonymization strategies.
https://satoricyber.com | contact@satoricyber.com 37