Dynamic Programming #4 – Edit Distance – Part One

So just how close is s1 = "Vladimir Putin" and s2 = "Donald Trump" really?

My claim is that the distance between them is 12 steps.

What am I talking about?

This post is all about the edit distance between two strings. The edit distance is the minimum number of transformation operations that can turn one string into the other. The edit distance is also called the Levenshtein Distance named after Vladimir Levenshtein who sadly died just a few months ago.

As you can imagine there are many types of transformations you can do to a string but in the edit distance case there are three specific operations that can be applied to single character positions – let’s say at position i in the string:

  1. Insert a character
  2. Delete a character
  3. Substitute a character

So we have three actions at each position we can perform. We also assume that each operation costs exactly the same (one unit) – but it’s easy to change this depending on the specific problem being considered.

My first three posts on dynamic programming have laid out all the basics we need to work out how to compute the edit distance between two strings efficiently. I strongly recommend reading them first because we will be using all the machinery to get to our algorithm! After this post, I’m hoping you will come away with a good appreciation of the real power of applied dynamic programming too.

Let’s recall the four step Dynamic Programming process:

  1. Identify the recursive structure (“optimal substructure”)
  2. Define the States and Actions
  3. Define the memoisation strategy
  4. Optimise the iteration ordering

Steps one and two are the hard ones. And there is usually some trial and error before getting that flowing. Steps three and four are much easier but are the key to getting the algorithm performing efficiently.

Recursive Structure

I think the best way to explain the process is with a small example. Let’s say we want to find the distance between two strings:

  • s1 = "THORN" and
  • s2 = "ROSE"

These two strings are small enough that we can see that if we align the two strings at the letter “O” then by inspection we can see the edit distance is 4. The third row gives an operation to perform at each position – note: “Nop” is a no-operation –
a virtual operation that means don’t do anything. The fourth row is the cost of each operation – so the row sum gives the edit distance.

s1: T H O R N
s2: R O S E
Op: Sub Del Nop Sub Sub
Cost: 1 1 0 1 1

How does this help?

Let’s assume this table is an optimal representation of edit transformations (it is). Consider the prefix (first four columns) and the last column separately. The cost of the optimal transformation between “THORN” and “R*OSE” is the cost of optimal transformations between “THOR” and “R*OS” plus one.

Note: the “*” is just a placeholder for the fact that the two strings are different lengths and some alignment needs to occur.

Like many string algorithms the recursive formulation separates the prefix from the suffix – which in our case is the last column. If you’re a bit lost stick with it – it will be come clearer after you get through the next section.

States and Actions

We’ve already mentioned our set of action operations:

$$A=\{Ins,Del,Sub,Nop\}$$

The State definition is more interesting. The key observation in the previous section is that we separate prefixes of both strings in the recursive formulation. Let’s call both strings \(X\) and \(Y\). So  \(X[1..i]\) denotes the prefix of \(X\) consisting of the first \(i\) characters. Similarly, \(Y[1..j]\) is the prefix of \(Y\) containing the first \(j\) characters.

So state consists of a tuple of the two prefixes:

$$\mathrm{state}: (X[1\ldots i],Y[1\ldots j])$$

Let’s define the length of \(X\) to be \(M\) characters long. Similarly, the length of \(Y\) is \(N\) characters.

Let’s give our optimal edit distance function the following signature:

\(\mathrm{edit}(X[1\ldots M],Y[1\ldots N]) \mapsto \mathbb{N}\)

 

Remember: When we are developing our dynamic programming algorithms they are a mapping from state to value. We need to do this to capture and express mathematically the recursive formulation (“optimal substructure”) in the problem. Also, it allows us to think in terms of a graphical model of the problem: two states (“vertices”) are connected (“edge”) if a single action/operation performed at one state transform it into the second state. You will see this explicitly later.

Finally, let’s denote the null/empty string as \(\emptyset\).

Write It Down

It’s a bit of work but we’re getting there. We want to write down mathematically a recurrence formulae for the edit distance:

First, assume \(X\) and \(Y\) are not null.

Given a specific state \((X[1\ldots i],Y[1\ldots j]) \), there are three possibilities we need to consider:

  1. \(X\) is longer than \(Y\):
    Action: insert a character 
  2. \(X\) is shorter than \(Y\):
    Action: delete a character
  3. \(X\) is the same length as \(Y\):
    Action: either substitute a character or it’s a no-operation.

Each of these possibilities have three different recurrence formulaes for the edit distance:

  1. Insert: \(\mathrm{edit}(X[1\ldots M],Y[1\ldots N]) = 1 +\mathrm{edit}(X[1 \ldots M-1],Y[1\ldots N]) \)
  2. Delete: \(\mathrm{edit}(X[1\ldots M],Y[1 \ldots N]) = 1 +\mathrm{edit}(X[1\ldots M],Y[1\ldots N-1]) \)
  3. Substitute or Nop: \(\mathrm{edit}(X[1\ldots M],Y[1\ldots N]) = \mathbb{I}(X[M] \neq Y[N]) +\mathrm{edit}(X[1\ldots M-1],Y[1\ldots N-1]) \)

Note: \(\mathbb{I}\) is the indicator function – it equals one if the predicate inside is True and zero if the predicate is False. Here it simply tests if a character substitution is required and adds a unit cost if so.

We can now define the edit distance as the following recursion:

\( \mathrm{edit}(X[1\ldots M],Y[1\ldots N])  = \min \begin{cases}
1 + \mathrm{edit}(X[1 \ldots M-1],Y[1\ldots N])\\
1 + \mathrm{edit}(X[1\ldots M],Y[1\ldots N-1])\\
\mathbb{I}(X[M] \neq Y[N]) + \mathrm{edit}(X[1\ldots M-1],Y[1\ldots N-1])
\end{cases} \)

Base Cases

So how do we stop recursing?

There are three separate stopping conditions:

  1. \(X \equiv \emptyset \)
    Action: Insert \(N\) characters to make \(Y\)
  2. \(Y \equiv \emptyset \)
    Action: Insert \(M\) characters to make \(X\)
  3. \(X \equiv \emptyset \mathrm \, {and} \, Y \equiv \emptyset\)
    Action: Define the distance to be zero.

Ok… it’s getting tedious writing out all the indexations for the all strings. Because we are dealing with prefixes let’s just agree to the following convention: it will make our code and maths look simpler!

Define: $$ \mathrm{edit}(X[1 \ldots i], Y[1 \ldots j]) \equiv \mathrm{edit}(i,j)$$

So our recursion with the base cases included looks like this:

\( \mathrm{edit}(i,j)  = \begin{cases}
i & \text{if } j = 0\\
j & \text{if } i = 0 \\
\min \begin{cases}
\mathrm{edit(i-1,j)} + 1\\
\mathrm{edit(i,j-1)} + 1\\
\mathrm{edit(i-1,j-1)} + \mathbb{I}(X[i] \neq Y[j])
\end{cases} & \text{otherwise}
\end{cases} \)

Having got this far I reckon it’s a good place to pause because it’s getting quite long for a single blog post – and then continue in “Edit Distance – Part Two”. We will obviously look at the next two steps in the Dynamic Programming methodology – Memoisation and Iteration Ordering. And of course – some Python code and examples!