Hashing with Pandas - Zak Scholl

Dec 2, 2018
Python Data Analysis

I found myself in a situation where I had to hash a bunch of data to look for the presence of those hashes in another dataset.

I knew what the universe of possible tokens was, and it was relatively small, so it wasn’t necessarily a password cracking exercise, though it was akin to one.

I didn’t have a password cracking utility like hashcat or John the Ripper within arms reach, and since I didn’t need to chew through hundreds of millions of possible combinations, I went ahead and loaded up the data in Pandas.

import pandas as pd
import hashlib

tokens = pd.read_csv('tokens.csv')
hashes = pd.read_csv('hashes.csv')

salt = b'jrgOmxTlrf2rHUk'

def hash(token):
    return hashlib.sha512(token.encode('utf-8') + 
                          salt.encode('utf-8').hexdigest()

tokens['hash'] = tokens.token.apply(hash)

matches = hashes.merge(tokens, on='hash', how='inner')
matches.to_csv(r'matching_hashes.csv', index=False)

This is the simplified form of what I did. This assumes that you have two csv files, one called tokens.csv that contains the source tokens in a column titled token and another, hashes.csv which contains a column of hashes titled hash.

Encoding the salt and token with utf-8 ensures that the inputs to the hashing function are byte strings, and the merge function just combines the two dataframes where the hashes match. I often find myself using a different paradigm where merge tends to be more elegant.

matches = hashes[hashes.hash.isin(tokens.hash.values)]

This gets the same result, outside of adding the token column from the tokens dataframe.

This method will definitely not be as fast as using something like a real password cracking tool, but in a situation where you know all tokens you want to hash it, Pandas can be a handy tool at your disposal.