How to use Hashing and Dictionary In Python

To start, this script will get a single folder or a list of folders, then through traversing the folder it will find duplicate files. Next, this script will compute a hash for every file present in the folder regardless of their name and are stored in a dictionary manner with hash being the key and path to the file as value. 

  • We have to import os, sys, hashlib libraries.
  • Then script iterates over the files and calls FindDuplicate() function to find duplicates. 
Syntax: FindDuplicate(Path)
Parameter: 
Path: Path to folder having files
Return Type: Dictionary
  • The function FindDuplicate() takes path to file and calls Hash_File() function
  • Then Hash_File() function is used to return HEXdigest of that file. For more info on HEXdigest read here
Syntax: Hash_File(path)
Parameters: 
path: Path of file
Return Type: HEXdigest of file
  • This MD5 Hash is then appended to a dictionary as key with file path as its value. After this,the FindDuplicate() function returns a dictionary in which keys has multiple values .i.e. duplicate files.
  • Now Join_Dictionary() function is called which joins the dictionary returned by FindDuplicate() and an empty dictionary. 
Syntax: Join_Dictionary(dict1,dict2)
Parameters: 
dict1, dict2: Two different dictionaries
Return Type: Dictionary
  • After this, we print the list of files having the same content using results.

Example:

We’re assuming here for example purposes that “text_1.txt”, “text_3.txt”, “text_4.txt” are files having the same content, and “text_2.txt”, “text_5.txt” are files having the same content.

Python3




# Importing Libraries
import os
import sys
from pathlib import Path
import hashlib
  
  
def FindDuplicate(SupFolder):
    
    # Duplic is in format {hash:[names]}
    Duplic = {}
    for file_name in files:
        
        # Path to the file
        path = os.path.join(folders, file_name)
          
        # Calculate hash
        file_hash = Hash_File(path)
          
        # Add or append the file path to Duplic
        if file_hash in Duplic:
            Duplic[file_hash].append(file_name)
        else:
            Duplic[file_hash] = [file_name]
    return Duplic
  
# Joins dictionaries
def Join_Dictionary(dict_1, dict_2):
    for key in dict_2.keys():
        
        # Checks for existing key
        if key in dict_1:
            
            # If present Append
            dict_1[key] = dict_1[key] + dict_2[key]
        else:
            
            # Otherwise Stores
            dict_1[key] = dict_2[key]
  
# Calculates MD5 hash of file
# Returns HEX digest of file
def Hash_File(path):
    
    # Opening file in afile
    afile = open(path, 'rb')
    hasher = hashlib.md5()
    blocksize=65536
    buf = afile.read(blocksize)
      
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    afile.close()
    return hasher.hexdigest()
  
Duplic = {}
folders = Path('path/to/directory')
files = sorted(os.listdir(folders))
for i in files:
    
    # Iterate over the files
    # Find the duplicated files
    # Append them to the Duplic
    Join_Dictionary(Duplic, FindDuplicate(i))
      
# Results store a list of Duplic values
results = list(filter(lambda x: len(x) > 1, Duplic.values()))
if len(results) > 0:
    for result in results:
        for sub_result in result:
            print('\t\t%s' % sub_result)
else:
    print('No duplicates found.')


Output:



Finding Duplicate Files with Python

In this article, we will code a python script to find duplicate files in the file system or inside a particular folder. 

Similar Reads

Method 1: Using Filecmp

The python module filecmp offers functions to compare directories and files. The cmp function compares the files and returns True if they appear identical otherwise False....

Method 2: Using Hashing and Dictionary

...

Contact Us