I found myself in a situation where I had to hash a bunch of data to look for the presence of those hashes in another dataset.
I knew what the universe of possible tokens was, and it was relatively small, so it wasn’t necessarily a password cracking exercise, though it was akin to one.
I didn’t have a password cracking utility like hashcat or John the Ripper within arms reach, and since I didn’t need to chew through hundreds of millions of possible combinations, I went ahead and loaded up the data in Pandas.
import pandas as pd
import hashlib
tokens = pd.read_csv('tokens.csv')
hashes = pd.read_csv('hashes.csv')
salt = b'jrgOmxTlrf2rHUk'
def hash(token):
return hashlib.sha512(token.encode('utf-8') +
salt.encode('utf-8').hexdigest()
tokens['hash'] = tokens.token.apply(hash)
matches = hashes.merge(tokens, on='hash', how='inner')
matches.to_csv(r'matching_hashes.csv', index=False)
This is the simplified form of what I did. This assumes that you have two csv files, one called
tokens.csv
that contains the source tokens in a column titled token
and another, hashes.csv
which contains a column of hashes titled hash
.
Encoding the salt and token with utf-8
ensures that the inputs to the hashing function
are byte strings, and the merge function just combines the two dataframes where the hashes match.
I often find myself using a different paradigm where merge tends to be more elegant.
matches = hashes[hashes.hash.isin(tokens.hash.values)]
This gets the same result, outside of adding the token
column from the tokens
dataframe.
This method will definitely not be as fast as using something like a real password cracking tool, but in a situation where you know all tokens you want to hash it, Pandas can be a handy tool at your disposal.