Need for Count-Min Sketch
Since Count-Min Sketch is used to find the frequency of an element, one might think if there is actually a need for such data structure! The answer is Yes. Let us see with the help of an example.
Let us try to solve this frequency estimation problem using a MultiSet Data Structure
Trying MultiSet as an alternative to Count-min sketch
Let’s try to implement this data structure using MultiSet with the below source code and try to find out the issues with this approach.
Java
// Java program to try MultiSet as an // alternative to Count-min sketch import com.google.common.collect.HashMultiset; import com.google.common.collect.Multiset; public class MultiSetDemo { public static void main(String[] args) { Multiset<String> blackListedIPs = HashMultiset.create(); blackListedIPs.add( "192.170.0.1" ); blackListedIPs.add( "75.245.10.1" ); blackListedIPs.add( "10.125.22.20" ); blackListedIPs.add( "192.170.0.1" ); System.out.println( blackListedIPs.count( "192.170.0.1" )); System.out.println( blackListedIPs.count( "10.125.22.20" )); } } |
Output:
Understanding the problem of using MultiSet instead of Count-Min Sketch
Now let’s look at the time and space consumed with this type of approach.
Number of UUIDs |
Insertion Time(ms) |
---|---|
10 |
<25 |
100 |
<25 |
1,000 |
30 |
10,000 |
257 |
100,000 |
1200 |
1,000,000 |
4244 |
Let’s have a look at the memory (space) consumed:
Number of UUIDs |
JVM heap used(MB) |
---|---|
10 |
<2 |
100 |
<2 |
1,000 |
3 |
10,000 |
9 |
100,000 |
39 |
1,000,000 |
234 |
We can easily understand that as data grows, the above approach is consuming a lot of memory and time to process the data.
This can be optimised if we use the Count-Min Sketch.
Count-Min Sketch Data Structure with Implementation
The Count-Min Sketch is a probabilistic data structure and is defined as a simple technique to summarize large amounts of frequency data. Count-min sketch algorithm talks about keeping track of the count of things. i.e, How many times an element is present in the set.
Contact Us