My First Letter-Frequency Cryptanalysis
Intro
As part of my effort to get better acquainted with cryptography, I’ve started reading “The Mathematics of Secrets”, a book by Joshua Holden which aims to explain the mathematical concepts behind cryptography of both ancient and modern times.
In its second chapter (page 31), Holden poses a challenge to the reader - deciphering a message:
CIPHERTEXT = "QBVDLWXTEQGXOKTNGZJQGKXSTRQLYRXJYGJNALRXOTQLSLRKJQFJYGJNGXLKQLYUZGJSXQGXSLQXNQXLVXKOJDVJNNBTKJZBKPXULYUNZXLQXUJYQGXNTYQGXKXQJKXULKQJNQNLQBYLOLKKXSJYQGXNGLUXRSBNXOFULYDSXUGJNSXDNVTYRGXUGJNLEESXLYUESLYYXUQGXNSLTDGQXKBAVBKXJYYBRXYQNQGXKXZLNYBSLRPBAVLQXKJLSOBFNGLEEXYXULSBYDXWXKFSJQQSXZGJSXQGXFRLVXQBMXXKOTQKXVLJYXUQBZGJQXZLNG"
We’re given two pieces of information about the message:
- This is a simple substitution cipher: each plaintext letter is mapped onto a single ciphertext letter
(e.g. if
a
is mapped ontoq
, thena
in the plaintext appears asq
in the ciphertext). - The original message is in English and has no spaces.
If you feel like giving it a shot without reading how I did it, go ahead :)
Take 1 - Letter Frequencies
Letter Frequency analysis is based on the simple fact that not all letters are equal, and some letters occur more frequently than others in normal text. Therefore, I set out to look for letter frequencies in English, and found the following data – English letters sorted by their frequency, in descending order:
english_letter_frequencies = ['e', 't', 'a', 'o', 'i', 'n', 's', 'r', 'h', 'd', 'l', 'u', 'c', 'm', 'f', 'y', 'w', 'g', 'p', 'b', 'v', 'k', 'x', 'q', 'j', 'z']
Next step was to map each letter in the ciphertext to its corresponding letter in plaintext, based on frequency similarity (e.g. if “e” is the most frequent letter in English, and “X” appears most in our ciphertext, then “X” is likely to represent “e”).
letter_counts = Counter(CIPHERTEXT) # Counter is imported from "collections"
letters_by_frequency = sorted(letter_counts, key=letter_counts.get, reverse=True) # This produces a list of letters sorted by their frequency (in descending order)
rules = {i: j for i, j in zip(letters_by_frequency, english_letter_frequencies)} # This creates the ciphertext-plaintext mapping
>> rules
>> {'X': 'e', 'Q': 't', 'L': 'a', 'G': 'o', 'J': 'i', 'Y': 'n', 'N': 's', 'K': 'r', 'S': 'h', 'B': 'd', 'U': 'l', 'T': 'u', 'R': 'c', 'V': 'm', 'Z': 'f', 'O': 'y', 'D': 'w', 'E': 'g', 'F': 'p', 'A': 'b', 'W': 'v', 'P': 'k', 'M': 'x'}
To perform the actual substitution, I wrote this simple function and tried using it to decipher the message:
def substitute(text, rules_dict):
return ''.join([rules_dict.get(c, '*') for c in text]) # If you can't find a mapping print "*" instead...
>> substitute(CIPHERTEXT, rules)
>> tdmwaveugtoeyrusofitorehuctanceinoisbaceyutahacritpinoisoeartanlfoihetoehatesteameryiwmissdurifdrkelanlsfeatelintoesuntoeretirelartistsatdnayarrehintoesoalechdseyplanwheloishewsmuncoeloisaggheanlghanneltoeshauwoterdbmdreinndcentstoerefasndhackdbmateriahydpsoaggenelahdnweverphitthefoihetoepcametdxeeryutremaineltdfoitefaso
The result was too disappointing to even look for words in. So with Iddo‘s peer pressure, I went on to look at bigrams.
Take 2 - Bigrams (Pun not Intended)
Bigrams are basically two-letter-blocks in the text. Just like letters, bigrams can be also counted and serve as the basis for frequency analysis. For example, in the sentence “I think there’s a problem there”, the most frequent bigram is “th” with 3 occurrences.
The 40 most popular bigrams in English are (in descending order):
th, he, in, er, an, re, nd, at, on, nt, ha, es, st, en, ed, to, it, ou, ea, hi, is, or, ti, as, te, et, ng, of, al, de, se, le, sa, si, ar, ve, ra, ld, ur
and in our ciphertext, these are:
bigram_counts = Counter(CIPHERTEXT[i:i + 2] for i in range(len(CIPHERTEXT)-1))
bigrams_by_frequency = sorted(bigram_counts, key=bigram_counts.get, reverse=True)
>> bigrams_by_frequency
>> ['GX', 'QG', 'XU', 'KX', 'GJ', 'XK', 'LY', 'JY', 'JN', 'SX', 'QX', 'NG', 'SL', 'QB', 'JQ', 'XL', 'XQ', 'LQ', 'XN', 'UL', 'YQ', 'XO', 'XS' ...]
Head Scratching & Manual Work
The most popular bigram in our text is “GX”. But according to my letter frequency analysis (take 1), this is translated into “oe” in plaintext, which is highly improbable given that “oe” is quite rare in English.
In addition,
- the two most frequent bigrams in English are “th” and “he”;
- the two most frequent bigrams in the cipher are “GX” and “QG”
The letter frequency analysis mapped X
to e
and Q
to t
. If this mapping is correct, then G
should be mapped to h
.
With this small fix, I performed substitution again:
rules['G'] = 'h'
del rules['S'] # remove the rule that originally mapped to "h"
>> substitute(CIPHERTEXT, rules)
>> tdmwaveugtheyrushfithre*uctanceinhisbaceyuta*acritpinhisheartanlfhi*ethe*atesteameryiwmissdurifdrkelanlsfeatelinthesuntheretirelartistsatdnayarre*intheshalec*dseyplanw*elhis*ewsmunchelhisagg*eanlg*annelthes*auwhterdbmdreinndcentstherefasnd*ackdbmateria*ydpshaggenela*dnweverp*itt*efhi*ethepcametdxeeryutremaineltdfhitefash
This time I stared longer at the result. Its ending, “fhitefash”, made me think that this f
should be
a w
, to render “whitewash” instead.
rules['Z'] = 'w' # "Z" originally went to "f", it should translate to "w"
del rules['D'] # remove the rule that originally went to "w"
>> substitute(CIPHERTEXT, rules)
>> tdm*aveugtheyrushwithre*uctanceinhisbaceyuta*acritpinhisheartanlwhi*ethe*atesteameryi*missduriwdrkelanlsweatelinthesuntheretirelartistsatdnayarre*intheshalec*dseyplan**elhis*e*smunchelhisagg*eanlg*annelthes*au*hterdbmdreinndcentstherewasnd*ackdbmateria*ydpshaggenela*dn*everp*itt*ewhi*ethepcametdxeeryutremaineltdwhitewash
Yet again, a substring from the end of the plaintext gave me more clues.
“everp*itt*ewhi*ethepcame
” looked to me like “every little while they came”, which meant that:
- whatever maps to “p” should map to “y” instead;
- a missing letter in my rules dictionary (which I found was “S”) should go to “l”.
rules['F'] = 'y' # "F" origianlly went to "p", should go to "y"
rules['S'] = 'l' # "S" is the ciphertext letter where all the "l"s should appear
del rules['O'] # remove the rule that originally went to "y"
>> substitute(CIPHERTEXT, rules)
>> tdm*aveugthe*rushwithreluctanceinhisbace*utalacrityinhisheartanlwhilethelatesteamer*i*missduriwdrkelanlsweatelinthesuntheretirelartistsatdna*arrelintheshalecldse*ylan*lelhisle*smunchelhisaggleanlglanneltheslau*hterdbmdreinndcentstherewasndlackdbmaterial*dyshaggenelaldn*everylittlewhiletheycametdxeer*utremaineltdwhitewash
Enough is Enough - Spoiler ⚠️
At that point, the deciphered text was not yet perfect but contained enough readable parts to perform a Google search. “with reluctance in his”, “alacrity in his heart” and “while the late steamer” are just a couple of examples of such substrings which eventually brought me to Mark Twain’s Tom Sawyer:
Tom gave up the brush with reluctance in his face, but alacrity in his heart. And while the late steamer Big Missouri worked and sweated in the sun, the retired artist sat on a barrel in the shade close by, dangled his legs, munched his apple, and planned the slaughter of more innocents. There was no lack of material; boys happened along every little while; they came to jeer, but remained to whitewash.
Final Solution
final_rules = {'X': 'e', 'Q': 't', 'L': 'a', 'G': 'h', 'J': 'i', 'Y': 'n', 'N': 's', 'K': 'r', 'S': 'l', 'B': 'o',
'U': 'd', 'T': 'u', 'R': 'c', 'V': 'm', 'Z': 'w', 'O': 'b', 'D': 'g', 'E': 'p', 'F': 'y', 'A': 'f',
'W': 'v', 'P': 'k', 'M': 'x'}
>> substitute(CIPHERTEXT, final_rules)
>> tomgaveupthebrushwithreluctanceinhisfacebutalacrityinhisheartandwhilethelatesteamerbigmissouriworkedandsweatedinthesuntheretiredartistsatonabarrelintheshadeclosebydangledhislegsmunchedhisappleandplannedtheslaughterofmoreinnocentstherewasnolackofmaterialboyshappenedalongeverylittlewhiletheycametoxeerbutremainedtowhitewash
As an anecdote, the initial letter frequency analysis produced a substitution rule which differed in 10 mappings from the correct one.
>> i = 0
.. for k, v in rules.items():
.. i = i+1 if v != final_rules[k] else i
>> i
>> 10