[go: up one dir, main page]

0% found this document useful (0 votes)
37 views11 pages

Unit 4 Data Analytics

The document discusses frequent itemset mining, particularly using the Apriori algorithm and its efficient alternative, point-wise frequent itemset mining. It also covers market basket modeling, handling large datasets, limited pass algorithms, and various clustering techniques including K-Means, hierarchical clustering, and density-based clustering. Overall, it provides insights into data mining methods and their applications in analyzing consumer behavior and market dynamics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views11 pages

Unit 4 Data Analytics

The document discusses frequent itemset mining, particularly using the Apriori algorithm and its efficient alternative, point-wise frequent itemset mining. It also covers market basket modeling, handling large datasets, limited pass algorithms, and various clustering techniques including K-Means, hierarchical clustering, and density-based clustering. Overall, it provides insights into data mining methods and their applications in analyzing consumer behavior and market dynamics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Unit-IV: Frequent Itemsets and Clustering

1. Mining Frequent Itemsets


Frequent itemset mining is a popular data mining task that involves identifying sets of items that
frequently co-occur in a given dataset. In other words, it involves finding the items that occur
together frequently and then grouping them into sets of items. One way to approach this problem is
by using the Apriori algorithm, which is one of the most widely used algorithms for frequent itemset
mining.
The Apriori algorithm works by iteratively generating candidate itemsets and then checking their
fre- quency against a minimum support threshold. The algorithm starts by generating all possible
itemsets of size 1 and counting their frequencies in the dataset. The itemsets that meet the minimum
support threshold are then selected as frequent itemsets. The algorithm then proceeds to generate
candidate itemsets of size 2 from the frequent itemsets of size 1 and counts their frequencies. This
process is repeated until no more frequent itemsets can be generated.
However, when dealing with large datasets, this approach can become computationally
expensive due to the potentially large number of candidate itemsets that need to be generated and
counted. Point-wise frequent itemset mining is a more efficient alternative that can reduce the
computational complexity of the Apriori algorithm by exploiting the sparsity of the dataset.
Pointwise frequent itemset mining works by iterating over the transactions in the dataset and
identifying the itemsets that occur in each transaction. For each transaction, the algorithm generates a
bitmap vector where each bit corresponds to an item in the dataset, and its value is set to 1 if the item
occurs in the transaction and 0 otherwise. The algorithm then performs a bitwise AND operation
between the bitmap vectors of each transaction to identify the itemsets that occur in all the
transactions. The itemsets that meet the minimum support threshold are then selected as frequent
itemsets.
The advantage of point-wise frequent itemset mining is that it avoids generating candidate
itemsets that are not present in the dataset, thereby reducing the number of itemsets that need to be
generated and counted. Additionally, point-wise frequent itemset mining can be parallelized, making
it suitable for mining large datasets on distributed systems.
In summary, point-wise frequent itemset mining is an efficient alternative to the Apriori
algorithm for Getting Deals Done with Trump frequent itemset mining. It works by iterating over the
transactions in the dataset and identifying the itemsets that occur in each transaction, thereby
avoiding the generation of candidate itemsets that are not present in the dataset.
2. Market basket Modeling
Market-based modeling is a technique used in economics and business to analyze and simulate
the behavior of markets, particularly in relation to the supply and demand of goods and services. This
modeling technique involves creating mathematical models that can simulate how different market
participants (consumers, producers, and intermediaries) interact with each other in a market setting.
One of the most common market-based models is the supply and demand model, which assumes that
the price of a good or service is determined by the balance between its supply and demand. In this
model, the price of a good or service will rise if the demand for it exceeds its supply, and will fall if
the supply exceeds the demand.
Another popular market-based model is the game theory model, which is used to analyze how
different participants in a market interact with each other. Game theory models assume that market
participants are rational and act in their own self-interest, and seek to identify the strategies that each
participant is likely to adopt in a given situation.
Market-based models can be used to analyze a wide range of economic phenomena, from the
pricing of individual goods and services to the behavior of entire industries and markets. They can
also be used to test the potential impact of various policies and interventions on the behavior of
markets and market participants.
Overall, market-based modeling is a powerful tool for understanding and predicting the behavior
of markets and the economy as a whole. By creating mathematical models that simulate the behavior
of market participants and the interactions between them, economists and business analysts can gain
valuable insights into the workings of markets, and develop strategies for managing and optimizing
their performance.

3. Apriori Algorithm
The Apriori algorithm is a popular algorithm used in data mining and machine learning to discover
frequent itemsets in large transactional datasets. It was proposed by Agrawal and Srikant in 1994 and
is widely used association rule mining, market basket analysis, and other data mining applications.
The Apriori algorithm uses a bottom-up approach to generate all frequent itemsets by first identifying
frequent individual items and then using those items to generate larger itemsets. The algorithm works
by performing the following steps:
 First, the algorithm scans the entire dataset to identify all individual items and their frequency of
occurrence. This information is used to generate the initial set of frequent itemsets.
 Next, the algorithm uses a level-wise search strategy to generate larger itemsets by combining
fre- quent itemsets from the previous level. The algorithm starts with two-itemsets and then
progressively generates larger itemsets until no more frequent itemsets can be found.
 At each level, the algorithm prunes the search space by eliminating itemsets that cannot be
frequent based on the minimum support threshold. This is done using the Apriori principle,
which states that any subset of a frequent itemset must also be frequent.
The algorithm terminates when no more frequent itemsets can be generated or when the
maximum itemset size is reached. Once all frequent itemsets have been identified, the Apriori
algorithm can be used to generate association rules that describe the relationships between different
items in the dataset. An association rule is a statement of the form X − > Y, where X and Y are
itemsets and X is a subset of Y. The rule indicates that there is a strong relationship between items in
X and items in Y.
The strength of an association rule is measured using two metrics: support and confidence.
Support is the percentage of transactions in the dataset that contain both X and Y, while confidence is
the percentage of transactions that contain Y given that they also contain X.
Overall, the Apriori algorithm is a powerful tool for discovering frequent itemsets and association
rules in large datasets. By identifying patterns and relationships between different items in the
dataset, it can be used to gain valuable insights into consumer behavior, market trends, and other
important business and economic phenomena.
4. Handling Large Datasets in main Memory
Handling Large Datasets in Main Memory Handling large datasets in main memory can be a
challenging task, as the amount of memory available on most computer systems is often limited.
However, there are several techniques and strategies that can be used to effectively manage and
analyze large datasets in main memory:
 Use data compression: Data compression techniques can be used to reduce the amount of
memory required to store a dataset. Techniques such as gzip or bzip2 can compress text
data, while binary data can be compressed using libraries like LZ4 or Snappy.
 Use data partitioning: Large datasets can be partitioned into smaller, more manageable
subsets, which can be processed and analyzed in main memory. This can be done using
techniques such as horizontal partitioning, vertical partitioning, or hybrid partitioning.
 Use data sampling: Data sampling can be used to select a representative subset of data for
analysis, without requiring the entire dataset to be loaded into memory. Random sampling,
stratified sampling, and cluster sampling are some of the commonly used sampling
techniques.
 Use in-memory databases: In-memory databases can be used to store large datasets in main
memory for faster querying and analysis. Examples of in-memory databases include Apache
Ignite, SAP HANA, and VoltDB.
 Use parallel processing: Parallel processing techniques can be used to distribute the
processing of large datasets across multiple processors or cores. This can be done using
libraries like Apache Spark, which provides distributed data processing capabilities.
 Use data streaming: Data streaming techniques can be used to process large datasets in real-
time by processing data as it is generated, rather than storing it in memory. Apache Kafka,
Apache Flink, and Apache Storm are some of the popular data streaming platforms.
 Overall, effective management of large datasets in main memory requires a combination of
data compres- sion, partitioning, sampling, in-memory databases, parallel processing, and
data streaming techniques. By leveraging these techniques, it is possible to effectively
analyze and process large datasets in main memory, without requiring expensive hardware
upgrades or specialized software tools.

5. Limited Pass Algorithm


A limited pass algorithm is a technique used in data processing and analysis to efficiently process
large datasets with limited memory resources. In a limited pass algorithm, the dataset is processed in
a fixed number of passes or iterations, where each pass involves processing a subset of the data. The
algorithm ensures that each pass is designed to capture the relevant information needed for the
analysis, while minimizing the memory required to store the data.
For example, a limited pass algorithm for processing a large text file could involve reading the file
in chunks or sections, processing each section in memory, and then discarding the processed data
before moving onto the next section. This approach enables the algorithm to handle large datasets
that cannot be loaded entirely into memory.
Limited pass algorithms are often used in situations where the data cannot be stored in main
memory, or when the processing of the data requires significant computational resources. Examples
of applications that use limited pass algorithms include text processing, machine learning, and data
mining.
While limited pass algorithms can be useful for processing large datasets with limited memory
resources, they can also be less efficient than algorithms that can process the entire dataset in a single
pass. Therefore, it is important to carefully design the algorithm to ensure that it can capture the
relevant information needed for the analysis, while minimizing the number of passes required
processing the data.
6. Counting Frequent Itemsets in a Stream
Counting frequent itemsets in a stream is a problem of finding the most frequent itemsets in a
continuous stream of transactions. This problem is commonly known as the Frequent Itemset Mining
problem. Here are the steps involved in counting frequent itemsets in a stream:
a. Initialize a hash table to store the counts of each itemset. The size of the hash table should be
limited to prevent it from becoming too large.
b. Read each transaction in the stream one at a time.
c. Generate all the possible itemsets from the transaction. This can be done using the Apriori
algorithm, which generates candidate itemsets by combining smaller frequent itemsets.
d. Increment the count of each itemset in the hash table.
e. Prune infrequent itemsets from the hash table. An itemset is infrequent if its count is less than a
predefined threshold.
f. Repeat steps 2-5 for each transaction in the stream.
g. Output the frequent itemsets that remain in the hash table after processing all the transactions.

The main challenge in counting frequent itemsets in a stream is to keep track of the changing
frequencies of the itemsets as new transactions arrive. This can be done efficiently using the hash
table to store the counts of the itemsets. However, the hash table can become too large if the number
of distinct itemsets is too large. To prevent this, the hash table can be limited in size by using a hash
function that maps each itemset to a fixed number of hash buckets. The size of the hash table can be
adjusted dynamically based on the number of items and transactions in the stream.
Another challenge in counting frequent itemsets in a stream is to choose the threshold for the
minimum count of an itemset to be considered frequent. The threshold should be set high enough to
exclude infrequent itemsets, but low enough to include all the important frequent itemsets. The
threshold can be determined using heuristics or by using machine learning techniques to learn the
optimal threshold from the data.
7. Clustering Techniques
Clustering techniques are used to group similar data points together in a dataset based on their
similarity or distance measures. Here are some popular clustering techniques:
7.1 K-Means Clustering: This is a popular clustering algorithm that partitions a dataset into K
clusters based on the mean distance of the data points to their assigned cluster centers. It involves an
iterative process of assigning data points to clusters and updating the cluster centers until
convergence. K-Means is commonly used in image segmentation, marketing, and customer
segmentation.
7.1.1 K-means Clustering algorithm
K-Means clustering is a popular unsupervised machine learning algorithm that partitions a dataset
into k clusters, where k is a pre-defined number of clusters. The algorithm works as follows:
 Initialize the k cluster centroids randomly.
 Assign each data point to the nearest cluster centroid based on its distance. ˆ Calculate the
new cluster centroids based on the mean of all data points assigned to that cluster.
 Repeat steps 2-3 until the cluster centroids no longer change significantly, or a maximum
number of iterations is reached.
 The distance metric used for step 2 is typically the Euclidean distance, but other distance
metrics can be used as well.
 The K-Means algorithm aims to minimize the sum of squared distances between each data
point and its assigned cluster centroid. This objective function is known as the within-cluster
sum of squares (WCSS) or the sum of squared errors (SSE).
 To determine the optimal number of clusters, a common approach is to use the elbow method.
This involves plotting the WCSS or SSE against the number of clusters and selecting the
number of clusters at the”elbow” point, where the rate of decrease in WCSS or SSE begins to
level off.
K-Means is a computationally efficient algorithm that can scale to large datasets. It is particularly
useful when the data is high-dimensional and traditional clustering algorithms may be too slow.
However, K-Means requires the number of clusters to be pre-defined and may converge to a
suboptimal solution if the initial cluster centroids are not well chosen. It is also sensitive to non-linear
data and may not work well with such data.
Here are some of its advantages and disadvantages:
Advantages:
 Simple and easy to understand: K-Means is easy to understand and implement, making it a
popular choice for clustering tasks.
 Fast and scalable: K-Means is a computationally efficient algorithm that can scale to large
datasets. It is particularly useful when the data is high-dimensional and traditional clustering
algorithms may be too slow.
 Works well with circular or spherical clusters: K-Means works well with circular or spherical
clusters, making it suitable for datasets that exhibit these types of shapes.
 Provides a clear and interpretable result: K- Means provides a clear and interpretable
clustering result, where each data point is assigned to one of the k clusters.
Disadvantages:
 Requires pre-defined number of clusters: K- Means requires the number of clusters to be
pre-defined, which can be a challenge when the number of clusters is unknown or difficult
to determine.
 Sensitive to initial cluster centers: K-Means is sensitive to the initial placement of cluster
centers and can converge to a suboptimal solution if the initial centers are not well chosen.
 Can converge to a local minimum: K-Means can converge to a local minimum rather than
the global minimum, resulting in a suboptimal clustering solution.
 Not suitable for non-linear data: K-Means assumes that the data is linearly separable and
may not work well with non-linear data.
In summary, K-Means is a simple and fast clustering algorithm that works well with
circular or spherical clusters. However, it requires the number of clusters to be pre-defined
and may converge to a suboptimal solution if the initial cluster centers are not well chosen.
It is also sensitive to non-linear data and may not work well with such data.
7.2 Hierarchical Clustering: This technique builds a hierarchy of clusters by recursively dividing or
merging clusters based on their similarity. It can be agglomerative (bottom-up) or divisive (top-
down). In agglomerative clustering, each data point starts in its own cluster, and then pairs of clusters
are successively merged until all data points belong to a single cluster. Divisive clustering starts with
all data points in a single cluster and recursively divides them into smaller clusters. Hierarchical
clustering is useful in gene expression analysis, social network analysis and image analysis.
7.3 Density-based Clustering: This technique identifies clusters based on the density of data points. It
assumes that clusters are areas of higher density separated by areas of lower density. Density-based
clustering algorithms, such as DBSCAN (Density-Based Spatial Clustering of Applications with
Noise), group together data points that are closely packed together and separate outliers. Density-
based clustering is commonly used in image processing, geospatial data analysis, and anomaly
detection.
7.4 Gaussian Mixture Models: This technique models the distribution of data points using a mixture
of Gaussian probability distributions. Each component of the mixture represents a cluster, and the
algorithm estimates the parameters of the mixture using the Expectation-Maximization algorithm.
Gaussian Mixture Models are commonly used in image segmentation, handwriting recognition, and
speech recognition.
7.5 Spectral Clustering: This technique converts the data points into a graph and then partitions the
graph into clusters based on the eigenvalues and eigenvectors of the graph Laplacian matrix. Spectral
clustering is useful in image segmentation, community detection in social networks, and document
clustering.
Each clustering technique has its own strengths and weaknesses, and the choice of clustering
algorithm depends on the nature of the data, the clustering objective, and the computational resources
available.
8. Clustering High Dimensional Data
Clustering high-dimensional data Clustering high-dimensional data is a challenging task because
the distance or similarity measures used in most clustering algorithms become less meaningful in
high-dimensional space. Here are some techniques for clustering high-dimensional data
8.1 Dimensionality Reduction
High dimensional data can be transformed into a lower-dimensional space using dimensionality
reduction techniques, such as Principal Component Analysis (PCA) or t-SNE (t-distributed Stochastic
Neighbor Em- bedding). Dimensionality reduction can help to reduce the curse of dimensionality and
make the clustering algorithms more effective.
8.2 Feature Selection:
Not all features in high-dimensional data are equally informative. Feature selection techniques can be
used to identify the most relevant features for clustering and discard the redundant or noisy features.
This can help to improve the clustering accuracy and reduce the computational cost.
8.3 Subspace Clustering:
Subspace clustering is a clustering technique that identifies clusters in subspaces of the high-
dimensional space. This technique assumes that the data points lie in a union of subspaces, each of
which represents a cluster. Subspace clustering algorithms, such as CLIQUE (CLustering In QUEst),
identify the subspaces and clusters simultaneously.
8.4 Density-Based Clustering:
Density-based clustering algorithms, such as DBSCAN, can be used for clustering high-dimensional
data by defining the density of data points in each dimension. The clustering algorithm identifies
regions of high density in the multidimensional space, which correspond to clusters.
8.5 Ensemble Clustering:
Ensemble clustering combines multiple clustering algorithms or different parameter settings of the
same algorithm to improve the clustering performance. Ensemble clustering can help to reduce the
sensitivity of the clustering results to the choice of algorithm or parameter settings.
8.6 Deep Learning-Based Clustering:
Deep learning-based clustering techniques, such as Deep Embedded Clustering (DEC) and
Autoencoder- based Clustering (AE-Clustering), use neural networks to learn a low-dimensional
representation of high dimensional data and cluster the data in the reduced space. These techniques
have shown promising results in clustering high-dimensional data in various domains, including
image analysis and gene expression analysis. Clustering high-dimensional data requires careful
consideration of the choice of clustering algorithm, feature selection or dimensionality reduction
technique, and parameter settings. A combination of different techniques may be required to achieve
the best clustering performance.
8.7 CLIQUE and ProCLUS
CLIQUE (CLustering In QUEst) and ProCLUS are two popular subspace clustering algorithms for
high- dimensional data. CLIQUE is a density-based algorithm that works by identifying dense
subspaces in the data. It assumes that clusters exist in subspaces of the data that are dense in at least k
dimensions, where k is a user-defined parameter. The algorithm identifies all possible dense
subspaces by enumerating all combinations of k dimensions and checking if the corresponding
subspaces are dense. It then merges the overlapping subspaces to form clusters. CLIQUE is efficient
for high-dimensional data because it only considers a small number of dimensions at a time.
ProCLUS (PROjective CLUSters) is a subspace clustering algorithm that works by identifying
clusters in a low-dimensional projection of the data. It first selects a random projection matrix and
projects the data onto a lower-dimensional space. It then uses K-Means clustering to cluster the
projected data. The algorithm iteratively refines the projection matrix and re-clusters the data until
convergence. The final clusters are projected back to the original high-dimensional space. ProCLUS
is effective for high-dimensional data because it reduces the dimensionality of the data while
preserving the clustering structure.
Both CLIQUE and ProCLUS are designed to handle high-dimensional data by identifying clusters in
subspaces of the data. They are effective for clustering data that have a natural subspace structure.
However, they may not work well for data that do not have a clear subspace structure or when the
data points are widely spread out in the high-dimensional space. It is important to carefully choose
the appropriate algorithm based on the characteristics of the data and the clustering objectives.
9. Frequent pattern based Clustering Methods
Frequent pattern-based clustering methods combine frequent pattern mining with clustering
techniques to identify clusters based on frequent patterns in the data. Here are some examples of
frequent pattern-based
1. Frequent Pattern --based clustering: is a clustering algorithm that uses frequent pattern mining to
identify clusters in transactional data. The algorithm first identifies frequent itemsets in the data using
Apriori or FP-Growth algorithms. It then constructs a graph where each frequent itemset is a node,
and the edges represent the overlap between the itemsets. The graph is partitioned into clusters using
a graph clustering algorithm. The resulting clusters are then used to assign objects to clusters based
on their membership in the frequent itemsets.
2. Frequent Pattern-based Clustering Method: is a clustering algorithm that uses frequent pattern
mining to identify clusters in high-dimensional data. The algorithm first discretizes the continuous
data into categorical data. It then uses Apriori or FP-Growth algorithms to identify frequent itemsets
in the categorical data. The frequent itemsets are used to construct a binary matrix that represents the
membership of objects in the frequent itemsets. The binary matrix is clustered using a standard
clustering algorithm, such as K-Means or Hierarchical clustering. The resulting clusters are then used
to assign objects to clusters based on their membership in the frequent itemsets.
3. Clustering based on Frequent Pattern Combination: is a clustering algorithm that combines
frequent pattern mining with pattern combination techniques to identify clusters in transactional data.
The algorithm first identifies frequent itemsets in the data using Apriori or FP-Growth algorithms. It
then uses pattern combination techniques, such as Minimum Description Length (MDL) or Bayesian
Information Criterion (BIC), to generate composite patterns from the frequent itemsets. The
composite patterns are then used to construct a graph, which is partitioned into clusters using a graph
clustering algorithm.
Frequent pattern-based clustering methods are effective for identifying clusters based on frequent
patterns in the data. They can be applied to a wide range of data types, including transactional data
and high- dimensional data. However, these methods may suffer from the curse of dimensionality
when applied to high-dimensional data. It is important to carefully select the appropriate frequent
pattern mining and clustering techniques based on the characteristics of the data and the clustering
objectives.

1. Clustering in Non-Euclidean Space


Clustering in non-Euclidean space refers to the clustering of data points that are not represented in
the Euclidean space, such as graphs, time series, or text data. Traditional clustering algorithms, such
as K- Means and Hierarchical clustering, assume that the data points are represented in the Euclidean
space and use distance metrics, such as Euclidean distance or cosine similarity, to measure the
similarity between data points. However, in non-Euclidean spaces, the notion of distance is different,
and distance-based clustering methods may not be suitable. Here are some approaches for clustering
in non-Euclidean spaces:
1. Spectral clustering: Spectral clustering is a popular clustering algorithm that can be applied to data
represented in non-Euclidean spaces, such as graphs or time series. It uses the eigenvalues and eigen-
vectors of the Laplacian matrix of the data to identify clusters. Spectral clustering converts the data
points into a graph representation and then computes the Laplacian matrix of the graph. The eigen-
vectors of the Laplacian matrix are used to embed the data points into a lower-dimensional space,
where clustering is performed using a standard clustering algorithm, such as K-Means or Hierarchical
clustering.
2. Density-Based Spatial Clustering of Applications with Noise: is a density-based clustering
algorithm that can be applied to data represented in non-Euclidean spaces. It does not rely on a
distance metric and can cluster data points based on their density. DBSCAN identifies clusters by
defining two parameters: the minimum number of points required to form a cluster and a radius that
determines the neighbourhood of a point. DBSCAN labels each point as either a core point, a border
point, or a noise point, based on its neighborhood. The core points are used to form clusters.
3. Topic modelling: Topic modelling is a clustering method that can be applied to text data, which is
typically represented in a non-Euclidean space. Topic modeling identifies latent topics in the text data
by analyzing the co-occurrence of words. It represents each document as a distribution over topics,
and each topic as a distribution over words. The resulting topic distribution of each document can be
used to cluster the documents based on their similarity.
Clustering in non-Euclidean spaces requires careful consideration of the appropriate algorithms
and techniques that are suitable for the specific data type. Spectral clustering and DBSCAN are
effective for clustering data represented as graphs or time series, while topic modeling is suitable for
text data. Other approaches, such as manifold learning and kernel methods, can also be used for
clustering in non-Euclidean spaces.

2. Clustering for Stream and Parallelism


Clustering for streams and parallelism are two important considerations for clustering large datasets.
Stream data refers to data that arrives continuously and in real-time, while parallelism refers to the
ability to distribute the clustering task across multiple computing resources. Here are some
approaches for clustering streams and parallelism:
1. Online clustering: Online clustering is a technique that can be applied to streaming data. It updates
the clustering model continuously as new data arrives. Online clustering algorithms, such as BIRCH
and CluStream, are designed to handle data streams and can scale to large datasets. These algo-
rithms incrementally update the cluster model as new data arrives and discard outdated data points to
maintain the cluster model’s accuracy and efficiency.
2. Parallel clustering: Parallel clustering refers to the use of multiple computing resources, such as
multiple processors or computing clusters, to speed up the clustering process. Parallel clustering
algorithms, such as K-Means Parallel, Hierarchical Parallel, and DBSCAN Parallel, distribute the
clustering task across multiple computing resources. These algorithms partition the data into smaller
subsets and assign each subset to a separate computing resource. The resulting clusters are then
merged to produce the final clustering result.
3. Distributed clustering: Distributed clustering refers to the use of multiple computing resources that
are distributed across different physical locations, such as different data centers or cloud resources.
Distributed clustering algorithms, such as MapReduce and Hadoop, distribute the clustering task
across multiple computing resources and handle data that is too large to fit into a single computing
resource’s memory. These algorithms partition the data into smaller subsets and assign each subset to
a separate computing resource. The resulting clusters are then merged to produce the final clustering
result.
Clustering for streams and parallelism requires careful consideration of the appropriate algorithms
and techniques that are suitable for the specific clustering objectives and data types. Online clustering
is effective for clustering data, while parallel clustering and distributed clustering can speed up the
clustering process for large datasets.

You might also like