PCY Algorithm in Big Data

PCY was developed by Park, Chen, and Yu. It is used for frequent itemset mining when the dataset is very large.

What is the PCY Algorithm?

The PCY algorithm (Park-Chen-Yu algorithm) is a data mining algorithm that is used to find frequent itemets in large datasets. It is an improvement over the Apriori algorithm and was first described in 2001 in a paper titled “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth” by Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, and Helen Pinto.

The PCY algorithm uses hashing to efficiently count item set frequencies and reduce overall computational cost. The basic idea is to use a hash function to map itemsets to hash buckets, followed by a hash table to count the frequency of itemsets in each bucket.

Example problem solved using PCY algorithm


Apply the PCY algorithm on the following transaction to find the candidate sets (frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.

       T1 = {1, 2, 3}
       T2 = {2, 3, 4}
       T3 = {3, 4, 5}
       T4 = {4, 5, 6}
       T5 = {1, 3, 5}
       T6 = {2, 4, 6}
       T7 = {1, 3, 4}
       T8 = {2, 4, 5} 
       T9 = {3, 4, 6}
       T10 = {1, 2, 4}
       T11 = {2, 3, 5}
       T12 = {2, 4, 6}


There are several steps that you have to follow to get the Candidate table.

Step 1: Find the frequency of each element and remove the candidate set having length 1.
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to them write their frequency. Note – Note: Pairs should not get repeated avoid the pairs that are already written before. 
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It gives us the bucket number). It defines in what bucket this particular pair will be put.
Step 4: This is the last step, and in this step, we have to create a table with the following details – 

  • Bit vector – if the frequency of the candidate pair is greater than equal to the threshold then the bit vector is 1 otherwise 0. (mostly 1)
  • Bucket number – found in the previous step
  • Maximum number of support – frequency of this candidate pair, found in step 2.
  • Correct – the candidate pair will be mentioned here.
  • Candidate set – if the bit vector is 1, then “correct” will be written here.


Step 1: Find the frequency of each element and remove the candidate set having length 1.

Items 1 2 3 4 5 6
Frequency 4 7 7 8 6 4

 Step 2: One by one transaction-wise, create all the possible pairs and corresponding to it write its frequency.

T1  {(1, 2), (1, 3)} 2,3
T2 {(2, 3), (2, 4)} 3,4
T3 {(3, 4),(3, 5)} 4,3
T4 {(4, 5) ,(4, 6)} 3,4
T5 {(1, 5)} 1
T6 {(2, 6)} 2
T7 {(1, 4)} 2
T8 {(2, 5)} 2
T9 {(3, 6)} 1

Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It gives us the bucket number).

Hash Function = ( i * j) mod 10

   (1, 3) = (1*3) mod 10 = 3
   (2,3) = (2*3) mod 10 = 6
   (2,4) = (2*4) mod 10 = 8
   (3,4) = (3*4) mod 10 = 2
   (3,5) = (3*5) mod 10 = 5
   (4,5) = (4*5) mod 10 = 0
   (4,6) = (4*6) mod 10 = 4

Bucket No.

Bucket no. Pair
0 (4,5)
2 (3,4)
3 (1,3)
4 (4,6)
5 (3,5)
6 (2,3)
8 (2,4)

Step 4: Prepare candidate set

Bit Vector  Bucket No.  Highest Support Count  Pairs Candidate Set 
1 0 3 (4,5) (4,5)
1 2 4 (3,4) (3,4)
1 3 3 (1,3) (1,3)
1 4 4 (4,6) (4,6)
1 5 3 (3,5) (3,5)
1 6 3 (2,3) (2,3)
1 8 4 (2,4) (2,4)

Contact Us