Decaying Window Algorithm:
This algorithm allows you to identify the most popular elements (trending, in other words) in
an incoming data stream.
The decaying window algorithm not only tracks the most recurring elements in an incoming
data stream, but also discounts any random spikes or spam requests that might have
boosted an element’s frequency.
In a decaying window, you assign a score/weight to every element of the incoming data
stream. Further, you need to calculate the aggregate sum for each distinct element by
adding all the weights assigned to that element. The element with the highest total score is
listed as trending or the most popular.
In a decaying window algorithm, you assign more weight to newer elements. For a new
element, you first reduce the weight of all the existing elements by a constant factor k and
then assign the new element with a specific weight. The aggregate sum of the decaying
exponential weights can be calculated using the following formula:
Here, c is usually a small constant of the order 10 -6 or 10 -9. Whenever a new element, say
at+1, arrives in the data stream you perform the following steps to achieve an updated sum:
1. Multiply the current sum/score by the value (1 − c).
2. Add the weight corresponding to the new element.
In a data stream consisting of various elements, you maintain a separate sum for each
distinct element. For every incoming element, you multiply the sum of all the existing
elements by a value of (1 − c). Further, you add the weight of the incoming element to its
corresponding aggregate sum.
A threshold can be kept to, ignore elements of weight lesser than that. Finally, the element
with the highest aggregate score is listed as the most popular element.
Advantages of Decaying Window Algorithm:
1. Sudden spikes or spam data is taken care.
2. New element is given more weight by this mechanism, to achieve right trending output.
E-Stream: Evolution-Based Technique for Stream Clustering
Basic Concepts and Definitions:
1. Fading decreases weight of data over time. In a data stream that has evolving data; older
data should have lesser weight. We decrease weight of every cluster over time to
achieve a fast adaptive cluster model. Let λ be the decay rate and t be elapsed time, the
fading function is
f(t) = 2 -λt
2. Weight of a cluster is the number of data elements in a cluster. Weight is determined
according to the fading function. Initially, each data element has a weight of 1. A cluster
can be increased its weight by assembling incoming data points or merging with other
clusters.
The behaviour of data in a data stream can evolve over time. We can classify this evolution
into five categories:
1. Appearance: A new cluster can appear if there is a sufficiently dense group of data
points in one area. Initially, such elements appear as a group of outliers, but (as more
data appears in the neighbourhood), they are recognized as a cluster.
2. Disappearance: Existing clusters can disappear because the existence of data is
diminished over time. Clusters that contain only old data are faded and eventually
disappear because they do not represent the presence of data.
3. Self-Evolution: Data can change their behaviours, which cause size or position of a
cluster to evolve. Evolution can be done faster if the data can fade.
4. Merge: A pair of clusters can be merged if their characteristics are very similar. Merged
clusters must cover the behaviour of the pair.
5. Split: A cluster can be split into two smaller clusters if the behaviour inside the cluster is
obviously separated.
E-Stream Algorithm:
In line 1, the algorithm starts by retrieving a new data point. In line 2, it fades all clusters and
deletes those having insufficient weight. Line 3 performs a histogram analysis and cluster
split. Then line 4 checks for overlap clusters and merges them. Line 5 checks the number of
clusters and merges the closest pairs if the cluster count exceeds the limit. Line 6 checks all
clusters whether their status is active. Lines 7-10 find the closest cluster to the incoming
data point. If the distance is less than or equal to radius_factor then the point is assigned to
the cluster, otherwise it is an isolated data point. The flow of control then returns to the top
of the algorithm and waits for a new data point.
FadingAll. The algorithm performs fading of all clusters and deletes the clusters whose
weight is less than remove_threshold.
CheckSplit is used to verify the splitting criteria in each cluster using the histogram. If a
splitting point is found in any cluster, then it is split. And store the index pairs of split clusters
in S.
CheckMerge is an algorithm for merging pairs of similar clusters. This algorithm checks every
pair of clusters and computes the cluster-cluster distance. If the distance is less than
merge_threshold and the merged pair is not in S then merge the pair.
LimitMaximumCluster is used to limit the number of clusters. This algorithm checks
whether the number of clusters is not greater than maximum_cluster (an input parameter);
if it exceeds then the closest pair of clusters is merged until the number of remaining
clusters is less than or equal to the threshold.
FlagActiveCluster is used to check the current active cluster. If the weight of any cluster is
greater or equal to active_threshold then it is flagged as an active cluster. Otherwise, the flag
is cleared.
FindClosestCluster is used to find the distance and index of the closest active cluster for an
incoming data point.
Clique (Clustering in Quest) Algorithm:
Clique is a density-based and grid-based subspace clustering algorithm.
o Grid-based: It discretizes the data space through a grid and estimates the density by
counting the number of points in a grid cell.
o Density-based: A cluster is a maximal set of connected dense units in a subspace.
A unit is dense if the fraction of total data points contained in the unit exceeds
the input model parameter.
Subspace Clustering: A subspace cluster is a set of neighbouring dense cells in an
arbitrary subspace. It also discovers some minimal descriptions of the clusters.
It automatically identifies subspaces of a high dimensional data space that allow better
clustering than original space using the Apriori principle.
Apriori Principle: If a collection of points S is a cluster in a k-dimensional space, then S is also
a part of a cluster in any (k - 1) dimensional projections of this space.