[go: up one dir, main page]

0% found this document useful (0 votes)
49 views10 pages

Bda Unit II Lecture1

This document discusses data stream mining and processing techniques. It introduces the stream model where data enters rapidly from external sources and the entire stream cannot be stored. It discusses using limited working storage and archival storage to make calculations about streams using sliding windows. Common applications include mining query streams, click streams, and network packet streams. It provides an example of calculating averages over a sliding window in a stream in constant time per input but requiring the entire window in memory.

Uploaded by

Anju2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views10 pages

Bda Unit II Lecture1

This document discusses data stream mining and processing techniques. It introduces the stream model where data enters rapidly from external sources and the entire stream cannot be stored. It discusses using limited working storage and archival storage to make calculations about streams using sliding windows. Common applications include mining query streams, click streams, and network packet streams. It provides an example of calculating averages over a sliding window in a stream in constant time per input but requiring the entire window in memory.

Uploaded by

Anju2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT II: Mining Data Streams

The Stream Model


Sliding Windows
Data Management Vs. Stream Management

 In a DBMS, input is under the control of the


programming staff.
 SQL INSERT commands or bulk loaders.
 Stream management is important when the
input rate is controlled externally.
 Example: Google search queries.

2
The Stream Model
 Input tuples enter at a rapid rate, at one or
more input ports.
 The system cannot store the entire stream
accessibly.
 How do you make critical calculations about the
stream using a limited amount of (primary or
secondary) memory?

3
Two Forms of Query
1. Ad-hoc queries: Normal queries asked one
time about streams.
 Example: What is the maximum value seen so
far in stream S?
2. Standing queries: Queries that are, in
principle, asked about the stream at all
times.
 Example: Report each new maximum value ever
seen in stream S.

4
Ad-Hoc
Queries

. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
Output
. . . a, r, v, t, y, h, b
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time

Streams Entering

Limited
Working
Storage Archival
Storage

5
Applications
 Mining query streams.
 Google wants to know what queries are more
frequent today than yesterday.
 Mining click streams.
 Yahoo! wants to know which of its pages are getting
an unusual number of hits in the past hour.
 Often caused by annoyed users clicking on a broken page.
 IP packets can be monitored at a switch.
 Gather information for optimal routing.
 Detect denial-of-service attacks.

6
Sliding Windows
 A useful model of stream processing is that
queries are about a window of length N – the N
most recent elements received.
 Alternative: elements received within a time interval
T.
 Interesting case: N is so large it cannot be
stored in main memory.
 Or, there are so many streams that windows for all
do not fit in main memory.

Reference: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,


http://www.mmds.org
7
qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

qwertyuiopasdfghjklzxcvbnm

Past Future

8
Example: Averages
 Stream of integers, window of size N.
 Standing query: what is the average of the
integers in the window?
 For the first N inputs, sum and count to get the
average.
 Afterward, when a new input i arrives, change the
average by adding (i - j)/N, where j is the oldest
integer in the window before i arrived.
 Good: O(1) time per input.
 Bad: Requires the entire window in main memory.

9
References
 Jure Leskovec, Anand Rajaraman, Jeff Ullman,
Mining of Massive Datasets, Cambridge
University Press, Second Edition, 2014.
 http://mmds.org/

10

You might also like