Decrypting the Two-time Pad

Problem statement

You have intercepted two encrypted messages, ciphertext-1 and ciphertext-2, encoded in the following 46-character alphabet:

[space]ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.?,-:;'()

The messages were encrypted with a one-time pad: a sequence of randomly drawn characters of the same length and alphabet as the plaintexts. The encryption algorithm is to add the code of a plaintext character with the code of the corresponding pad character, modulo 46. To illustrate, if the plaintext is "ITA SOFTWARE" and the pad is "9KS;UENFN068" then the ciphertext is ")4T;-TTZ.1E-".

plaintext	`ITA SOFTWARE`	`9`	`20`	`1`	`0`	`19`	`15`	`6`	`20`	`23`	`1`	`18`	`5`
pad	`9KS;UENFN068`	`36`	`11`	`19`	`42`	`21`	`5`	`14`	`6`	`14`	`27`	`33`	`35`
ciphertext	`)4T;-TTZ.1E-`	`45`	`31`	`20`	`42`	`40`	`20`	`20`	`26`	`37`	`28`	`5`	`40`

However, the sender made the critical mistake of using the same one-time pad for both messages, compromising their security. Write a program that takes the two ciphertexts and produces a guess at the two plaintexts. It should produce two plaintext messages of the same length as the ciphertexts, hopefully containing many of the correct words in the correct positions. It is unlikely your program will get all of the words correct, since there is not enough information in the ciphertexts to uniquely determine the plaintexts. Strive for a quality of decryption sufficient for a human reader—perhaps with the aid of a web search—to identify the original texts, which happen to be excerpts from classic works of English literature.

Your program may use English text or word lists or dictionaries of your choice to train on, for example to gather tables of letter or word frequencies, but you should not rely on any portion of the messages being in your training material.

Solution

The fundamental observation is this. Given plaintexts P1 and P2, one-time pad K, and respective ciphertexts C1 and C2, so that P1+K = C1 (mod 46) and P2+K = C2 (mod 46), then P2 = P1+(C2−C1) (mod 46). In other words, P1 and P2 have a fixed relationship. Our assumption is that if a guess at a reasonable value for P1 results in a reasonable value for P2 (or vice versa), then it is likely that the guess is correct.

To make "reasonable," high-likelihood guesses we use a probabilistic language model based on N-gram distributions and Witten-Bell smoothing.

An N-gram distribution defines conditional probabilities of letter occurrences relative to prefixes of length N−1. For example, the value of the 2-gram "SN" is the conditional probability that an "N" occurs (out of the universe of possibilities, i.e., out of our 46-letter alphabet) after an "S". Conditional probabilities can then be multiplied to approximate the total and true probability of seeing a word (or any other sequence of letters). For example, if the longest N-grams at our disposal are 3-grams, then P("SNOWY") is approximated using 1-, 2-, and 3-grams as P("S")*P("N"|"S")*P("O"|"SN")*P("W"|"NO")*P("Y"|"OW"). (As will be seen below, our algorithm is such that we need N-gram probabilities for only one value of N, not for a range of lengths.)

To avoid zero probabilities, and on general principle, we employ Witten-Bell smoothing (or "discounting"). Smoothing is applied to each prefix separately. Suppose that S occurrences of a prefix were observed in the training corpus, and out of a vocabulary size of M (46 in our case), T distinct letters were observed to follow the prefix at least once while Z = M−T letters were never observed. If the count for observed letter L is C (so that the sum of such counts equals S), then the conditional probability for L is discounted from C/S to C/(S+T). The count for each unobserved letter is (1/Z)*T/(S+T). If the prefix was never observed, all probabilities are set to 1/Z.

Our overall search strategy is bounded breadth-first search in which we build up solutions to the plaintexts in parallel ("solution pairs") incrementally from beginning to end. Given a set of partial solution pairs, each of which has an associated probability, we append every possible letter to every solution pair, thus increasing the length of each solution pair by one letter (and thus increasing the size of the solution set by a factor of 46). We then retain only the 1,000 highest-probability solution pairs. The output is the single, highest-probability complete solution pair.

By appending a letter to a solution pair we really mean appending letters to each plaintext solution in the pair, where the letters are related as determined by the difference in the ciphertexts at that point. The probability of a solution pair is the product of the probabilities of each plaintext solution. The probability of a plaintext solution (stated more accurately, the probability that results from adding a letter to a plaintext solution) is the product of the solution's prior probability and the N-gram conditional probability of the last N letters. For example, suppose that we are using 3-grams and that a plaintext solution has prior probability V and 2-letter suffix "TH". Then the probability that results from adding letter "E" to the solution is V*P("E"|"TH"). If the other solution in the pair has prior probability W and suffix "AN", and if adding letter "E" to the first solution implies adding letter "Y" to the second solution, then the overall probability of the solution pair is given by V*P("E"|"TH")*W*P("Y"|"AN").

For given N, the search is seeded using every N-gram observed in the training corpus as the first N letters of the first plaintext solution.

Implementation

N-grams of all lengths up to a maximum value for N are stored in a single trie data structure. The root node (which is purely structural) is at level 0; N-grams are stored at level N. The path to a node defines the N-gram's prefix; the node itself stores a letter, a conditional probability that the letter occurs after the prefix, and an "escape probability" that a letter unobserved in the training corpus follows the N-gram. For example:

root, N/A, escape=P(unobserved)
  |
  +---"A", P("A"), escape=P(unobserved|"A")
  |     |
  |     +---"B", P("B"|"A"), escape=P(unobserved|"AB")
  |
  +---"Q", P("Q"), escape=P(unobserved|"Q")

Note that, to avoid underflow, instead of storing probabilities we actually store negative logarithms of probabilities; instead of multiplying probabilities we add them; and instead of maximizing probabilities we minimize the negative logarithms.

We created N-gram conditional probability distributions (technically, maximum likelihood estimators) by training on the following corpus, all downloaded from Project Gutenberg:

Adventures of Huckleberry Finn (#76)
The Adventures of Sherlock Holmes (#1661)
Alice's Adventures in Wonderland (#11)
Moby Dick (#15)
Pride and Prejudice (#1342)
War and Peace (#2600)

Corpus text was uppercased and characters outside the 46-letter problem alphabet were converted to spaces. Sequences of spaces were then collapsed to single spaces.

Results

The decryptions using different N-gram lengths are shown below. The complete and accurate solution was found by Googling for phrases that appear in the 6-gram decryption. It was easily determined that the plaintexts are excerpts from The Mayor of Casterbridge by Thomas Hardy and The Voyage of the Beagle by Charles Darwin, respectively.

Watching the solutions slowly appear was nothing short of magical. Out of all possible messages that might have been sent, which messages were actually sent? Only the most likely!