Count-Min Sketch in Python

Count-Min Sketch is a probabilistic data structure which approximates the frequency of items in a stream of data. It uses little memory while handling massive amounts of data and producing approximations of the answers. In this post, we’ll explore the idea behind the Count-Min Sketch, how it’s implemented in Python, and discuss its uses and drawbacks.

What is Count-Min Sketch?

Count-Min is a probabilistic data structure used to count unique items in a large stream of data. It is used to find an approximate frequency of the events on the streaming data.

The idea behind Count-Min Sketch is to use hash functions and a two-dimensional array (or matrix) to efficiently store the frequency of items. The array is made up of several rows and columns, where a bucket is represented by a column and a hash function by a row. The hash functions identify the locations in the array to increment or get counts when updating or querying the frequency of entries.

Key Operations in Count-Min Sketch:

Initialization: Set the number of rows and columns that you want in the Count-Min Sketch.

Update: To increase an element’s count, hash it through each hash function and update the array’s associated buckets.

Query: Find the lowest count across the related buckets after hashing an element with each hash algorithm to determine its estimated frequency.

Implementation of Count-Min Sketch in Python:

Below is the implementation of Count-Min Sketch in Python:

Python
import hashlib

class CountMinSketch:
    def __init__(self, rows, cols):
        self.rows = rows
        self.cols = cols
        self.count_matrix = [[0] * cols for _ in range(rows)]
        self.hash_functions = [hashlib.md5, hashlib.sha1, hashlib.sha256] 

    def update(self, element):
        for i, hash_func in enumerate(self.hash_functions):
            hash_value = int(hash_func(element.encode()).hexdigest(), 16)
            bucket_index = hash_value % self.cols
            self.count_matrix[i][bucket_index] += 1

    def query(self, element):
        min_count = float('inf')
        for i, hash_func in enumerate(self.hash_functions):
            hash_value = int(hash_func(element.encode()).hexdigest(), 16)
            bucket_index = hash_value % self.cols
            min_count = min(min_count, self.count_matrix[i][bucket_index])
        return min_count

# Example usage
cms = CountMinSketch(rows=3, cols=10)
data_stream = ["apple", "banana", "apple", "orange", "apple", "banana", "banana"]
for element in data_stream:
    cms.update(element)
print("Frequency of 'apple':", cms.query("apple"))

Output
Frequency of 'apple': 3

Count-Min Sketch is a data structure that provides accurate approximations of element frequencies in massive data streams. Python developers may efficiently address a range of frequency estimation issues by using Count-Min Sketch provided they have a thorough knowledge of its concepts, uses, and limits.


Contact Us