来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT
Chapter 4 Mining Data Streams
💡 Skip this chapter due to its difficulty and for me, it is hard to understand.
Summary of Chapter 4
- The Stream Data Model: This model assumes data arrives at a processing engine at a rate that makes it infeasible to store everything in active storage. One strategy to dealing with streams is to maintain summaries of the streams, sufficient to answer the expected queries about the data. A second approach is to maintain a sliding window of the most recently
arrived data. - Sampling of Streams: To create a sample of a stream that is usable for a class of queries, we identify a set of key attributes for the stream. By hashing the key of any arriving stream element, we can use the hash value to decide consistently whether all or none of the elements with that key will become part of the sample.
- Bloom Filters: This technique allows us to filter streams so elements that belong to a particular set are allowed through, while most nonmembers are deleted. We use a large bit array, and several hash functions. Members of the selected set are hashed to buckets, which are bits in the array, and those bits are set to 1. To test a stream element for membership, we hash the element to a set of bits using each of the hash functions, and only
accept the element if all these bits are 1. - Counting Distinct Elements: To estimate the number of different elements appearing in a stream, we can hash elements to integers, interpreted as binary numbers. 2 raised to the power that is the longest sequence of 0’s seen in the hash value of any stream element is an estimate of the number of different elements. By using many hash functions and combining these estimates, first by taking averages within groups, and then taking the median of the averages, we get a reliable estimate.
- Moments of Streams: The kth moment of a stream is the sum of the k-th powers of the counts of each element that appears at least once in the stream. The 0th moment is the number of distinct elements, and the 1st moment is the length of the stream.
- Estimating Second Moments: A good estimate for the second moment, or surprise number, is obtained by choosing a random position in the stream, taking twice the number of times this element appears in the stream from that position onward, subtracting 1, and multiplying by the length of the stream. Many random variables of this type can be combined like the estimates for counting the number of distinct elements, to produce a reliable estimate of the second moment.
- Estimating Higher Moments: The technique for second moments works for kth moments as well, as long as we replace the formula 2 x − 1 2x − 1 2x−1 (where x is the number of times the element appears at or after the selected position) by x k − ( x − 1 ) k x^k − (x − 1)^k xk−(x−1)k.
- Estimating the Number of 1’s in a Window: We can estimate the number of 1’s in a window of 0’s and 1’s by grouping the 1’s into buckets. Each bucket has a number of 1’s that is a power of 2; there are one or two buckets of each size, and sizes never decrease as we go back in time. If we record only the position and size of the buckets, we can represent the
contents of a window of size N with O ( l o g 2 2 N ) O(log_2^2 N) O(log22N) space. - Answering Queries About Numbers of 1’s: If we want to know the approximate numbers of 1’s in the most recent k elements of a binary stream, we find the earliest bucket B that is at least partially within the last k positions of the window and estimate the number of 1’s to be the sum of the sizes of each of the more recent buckets plus half the size of B. This estimate can never be off by more that 50% of the true count of 1’s.
- Closer Approximations to the Number of 1’s: By changing the rule for how many buckets of a given size can exist in the representation of a binary window, so that either r or r − 1 of a given size may exist, we can assure that the approximation to the true number of 1’s is never off by more than 1/r.
- Exponentially Decaying Windows: Rather than fixing a window size, we can imagine that the window consists of all the elements that ever arrived in the stream, but with the element that arrived t time units ago weighted by e − c t e^{−ct} e−ct for some time-constant c. Doing so allows us to maintain certain summaries of an exponentially decaying window easily. For instance, the weighted sum of elements can be recomputed, when a new element arrives, by multiplying the old sum by 1 − c and then adding the new element.
- Maintaining Frequent Elements in an Exponentially Decaying Window: We can imagine that each item is represented by a binary stream, where 0 means the item was not the element arriving at a given time, and 1 means that it was. We can find the elements whose sum of their binary stream is at least 1/2. When a new element arrives, multiply all recorded sums by 1 minus the time constant, add 1 to the count of the item that just arrived, and delete from the record any item whose sum has fallen below 1/2.