1 s2.0 S0306437915001611 Main
1 s2.0 S0306437915001611 Main
Information Systems
journal homepage: www.elsevier.com/locate/infosys
a r t i c l e i n f o abstract
Article history: Hierarchical data are often modelled as trees. An interesting query identifies pairs of
Received 12 August 2014 similar trees. The standard approach to tree similarity is the tree edit distance, which has
Received in revised form successfully been applied in a wide range of applications. In terms of runtime, the state-
25 May 2015
of-the-art algorithm for the tree edit distance is RTED, which is guaranteed to be fast
Accepted 2 August 2015
independent of the tree shape. Unfortunately, this algorithm requires up to twice the
Available online 28 August 2015
memory of its competitors. The memory is quadratic in the tree size and is a bottleneck for
Keywords: the tree edit distance computation.
Tree edit distance In this paper we present a new, memory efficient algorithm for the tree edit distance,
Similarity search
AP-TED (All Path Tree Edit Distance). Our algorithm runs at least as fast as RTED without
Approximate matching
trading in memory efficiency. This is achieved by releasing memory early during the first
step of the algorithm, which computes a decomposition strategy for the actual distance
computation. We show the correctness of our approach and prove an upper bound for the
memory usage. The strategy computed by AP-TED is optimal in the class of all-path
strategies, which subsumes the class of LRH strategies used in RTED. We further present
the AP-TED þ algorithm, which requires less computational effort for very small subtrees
and improves the runtime of the distance computation. Our experimental evaluation
confirms the low memory requirements and the runtime efficiency of our approach.
& 2015 Elsevier Ltd. All rights reserved.
1. Introduction
Data with hierarchical dependencies are often modelled as trees. Tree data appear in many applications, ranging from
hierarchical data formats like JSON or XML to merger trees in astrophysics [33]. An interesting query computes the similarity
between two trees. The standard measure for tree similarity is the tree edit distance, which is defined as the minimum-cost
sequence of node edit operations that transform one tree into another. The tree edit distance has been successfully applied
in bioinformatics (e.g., to find similarities between RNA secondary structures [1,29], neuronal cells [21], or glycan structures
[3]), in image analysis [7], pattern recognition [25], melody recognition [19], natural language processing [28], information
extraction [12,23], and document retrieval [22], and has received considerable attention from the database community
[5,8–11,16–18,26,27].
n
Corresponding author. Tel.: þ 43 66280446348.
E-mail addresses: mateusz.pawlik@sbg.ac.at (M. Pawlik), nikolaus.augsten@sbg.ac.at (N. Augsten).
http://dx.doi.org/10.1016/j.is.2015.08.004
0306-4379/& 2015 Elsevier Ltd. All rights reserved.
158 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173
Fig. 1. Strategy computation requires more memory than actual distance computation. (a) Two-step algorithm for tree edit distance and (b) strategy vs.
distance computation.
The fastest algorithms for the tree edit distance (TED) decompose the input trees into smaller subtrees and use dynamic
programming to build the overall solution from the subtree solutions. The key difference between various TED algorithms is
the decomposition strategy, which has a major impact on the runtime. Early attempts to compute TED [13,24,37] use a
hard-coded strategy, which disregards or only partially considers the shape of the input trees. This may lead to very poor
strategies and asymptotic runtime differences of up to a polynomial degree. The most recent development is the Robust Tree
Edit Distance (RTED) algorithm [30], which operates in two steps (cf. Fig. 1(a)). In the first step, a decomposition strategy is
computed. The strategy adapts to the input trees and is shown to be optimal among all previously proposed strategies. The
actual distance computation is done in the second step, which executes the strategy.
In terms of runtime, the overhead for the strategy computation in RTED is small compared to the gain due to the better
strategy. Unfortunately, this does not hold for the main memory consumption. Fig. 1(b) shows the memory usage for two
example trees (perfect binary trees) of 8191 nodes: the strategy computation requires 1.1 GB of RAM, while the execution of
the strategy (i.e., the actual distance computation) requires only 0.55 GB. Thus, for large instances, the strategy computation
is the bottleneck and the fallback is a hard-coded strategy. This is undesirable since the gain of a good strategy grows
with the instance size. Reducing the memory requirements of the strategy computation affects the maximum tree size that
can be processed. This is crucial especially for large trees like abstract syntax trees of source code repositories [15,20]
(Emacs: 4 10k nodes and MythTV: 4 50k nodes) or merger trees in astrophysics1 [33].
In this paper we propose the AP-TED algorithm, which solves the memory problem of the strategy computation. This is
achieved by computing the strategy bottom-up using dynamic programming and releasing part of the memorization tables
early. We prove that our algorithm requires at most 1/3 of the memory that is needed by RTED's strategy computation [30].
As a result, the memory cost of the strategy computation is never above the cost of the distance computation. Our extensive
experimental evaluation on various tree shapes, which require very different strategies, confirms our analytic memory
bound and shows that our algorithm is often much better than its theoretical upper bound. For some tree shapes, it even
runs in linear space, while the RTED strategy algorithm always requires quadratic space.
In addition to reducing the memory usage, AP-TED computes the optimum in a larger class of strategies than RTED.
Strategies are expressed by root-leaf paths that guide the decomposition of the input trees. A path decomposes a tree into
subtrees by deleting nodes and edges on a root-leaf path. Each resulting subtree is recursively decomposed by a new root-
leaf path. RTED computes the optimal LRH strategy. An LRH strategy considers only left, right, and heavy paths. The left
(right) root-leaf path connects each parent with its first (last) child; the heavy path connects the parent with the rightmost
child that roots the largest subtree. AP-TED considers all root-leaf paths and is not limited to left, right, and heavy paths.
Thus, our strategy is at least as good as the strategies used by RTED. To the best of our knowledge, this is the first algorithm
to compute the optimal all-path strategy. The runtime complexity of our strategy algorithm is Oðn2 Þ as for the RTED strategy.
This result is surprising since in each recursive step we need to consider a linear number of paths compared to only three
paths (left, right, and heavy) in the RTED strategy. Our empirical evaluation suggests that in practice our strategy algorithm
is even slightly faster than the RTED strategy algorithm since it allocates less memory.
On the distance computation side, we observe that a large number of subproblems that result from the tree decom-
positions are very small trees with one or two nodes only. We show that a significant boost can be achieved by treating
these cases separately. We introduce the AP-TED þ algorithm, which leverages that fact and achieves runtime improvements
of more than 50% in some cases.
Summarizing, the contributions of this paper are the following:
Memory efficiency. We substantially reduce the memory requirements w.r.t. previous strategy computation algorithms by
traversing the trees bottom-up and systematically releasing memory early. The resulting AP-TED algorithm always
consumes less memory for the strategy computation than for the actual distance computation and thus breaks the
1
Accessible at http://www.mpa-garching.mpg.de/millennium/.
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 159
bottleneck of previous algorithms. (We show the correctness of our approach and prove an upper bound for the memory
usage.)
Optimal all-path strategy. The decomposition strategy used by AP-TED is optimal in the class of all-path strategies. This
class generalizes LRH strategies and contains all strategies of previous TED algorithms. Although our strategy algorithm
must consider more paths, it is as efficient as the strategy algorithm in RTED (quadratic in the input size).
New single-path functions. We develop AP-TED þ , which leverages two new single-path functions to compute the distance
of subtree pairs when one of the subtrees is small. This case occurs frequently during the decomposition process. Our new
single-path functions run in linear time and at most linear space, which substantially improves over the single-path
functions ΔL , ΔR , and ΔI used in RTED [30]. To take full advantage of the new functions, we integrate them into the strategy
computation to obtain better strategies. Our experiments confirm the significant runtime improvement.
The paper is structured as follows. Section 2 sets the stage for our discussion of strategy algorithms. In Section 3 we
define the problem, and we present our AP-TED algorithm in Section 4. The memory efficient implementation of the
strategy computation in AP-TED is discussed in Section 5. The AP-TED þ algorithm is presented in Section 6. We treat related
work in Section 7, experimentally evaluate our solution in Section 8, and conclude in Section 9.
2.1. Notation
We follow the notation of [30] when possible. A tree F is a directed, acyclic, connected graph with nodes N(F) and edges
EðFÞ D NðFÞ NðFÞ, where each node has at most one incoming edge. Each node has a label, which is not necessarily unique
within the tree. The nodes of a tree F are strictly and totally ordered such that (a) v 4w for any edge ðv; wÞ A EðFÞ, and (b) for
any two nodes f ; g, if f o g and f is not a descendant of g, then f og 0 for all descendants g 0 of g. The tree traversal that visits all
nodes in ascending order is the postorder traversal.
In an edge (v,w), node v is the parent and w is the child, pðwÞ ¼ v. A node with no parent is a root node, a node without
children is a leaf.
Fv is the subtree rooted in node v of F iff Fv is a tree, NðF v Þ ¼ fx: x ¼ v or x is a descendant of v in Fg, and EðF v Þ DEðFÞ. A path
γ in F is a subtree of F in which each node has at most one child. By γ ðFÞ we denote the set of all root-leaf paths in F. The left
path γ L ðFÞ A γ ðFÞ recursively connects a parent to its leftmost child. Similarly, the right path γ R ðFÞ A γ ðFÞ connects a parent to
its rightmost child. The heavy path connects a parent to the child which roots the rightmost largest subtree. An inner path is
a path in γ ðFÞ which is neither left nor right.
We use the following short notation: By jFj ¼ jNðFÞj we denote the size of F, we write v A F for v A NðFÞ. F γ is the set of
relevant subtrees of tree F with respect to the path γ obtained by removing path γ from F: F γ ¼ fF v : (xðx A γ 4 v2 = γ 4 pðxÞ ¼
pðvÞÞg.
Example 1. The nodes of tree F in Fig. 2 are NðFÞ ¼ fv1 ; v2 ; v3 ; v4 ; v5 ; v6 ; v7 ; v8 ; v9 ; v10 ; v11 ; v12 ; v13 g, the edges are
EðFÞ ¼ fðv13 ; v4 Þ, ðv13 ; v10 Þ; ðv13 ; v12 Þ; ðv4 ; v1 Þ; ðv4 ; v3 Þ; ðv3 ; v2 Þ; ðv10 ; v5 Þ; ðv10 ; v9 Þ; ðv9 ; v8 Þ; ðv8 ; v6 Þ; ðv8 ; v7 Þ; ðv12 ; v11 Þg, the node labels
are shown in italics in the figure. The root of F is v13 , and jFj ¼ 13. F v8 with nodes NðF v8 Þ ¼ fv6 ; v7 ; v8 g and edges
EðF v8 Þ ¼ fðv8 ; v6 Þ; ðv8 ; v7 Þg is a subtree of F. The path γ that connects v13 and v7 is the heavy path in F, and
F γ ¼ fF v4 ; F v5 ; F v6 ; F v12 g. γ L ðFÞ is the path from v13 to v1 and γ R ðFÞ is the path from v13 to v11. The paths from v13 to v2, v5, v6,
and v7, respectively, are the inner paths. The postorder traversal visits the nodes of F in the following order:
v1 ; v2 ; v3 ; v4 ; v5 ; v6 ; v7 ; v8 ; v9 ; v10 ; v11 ; v12 ; v13 .
F
v13 , a
v4 , b v10 , f v12 , l
v1 , c v3 , d v5 , g v9 , h v11 , m
v2 , e v8 , i
v6 , j v7 , k
Fig. 2. Example tree.
160 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173
The tree edit distance is the minimum-cost sequence of node edit operations that transforms tree F into G. The standard
edit operations are: delete a node, insert a node, and rename the label of a node. The state-of-the-art algorithms compute
the tree edit distance by implementing a well-known recursive solution [34] using dynamic programming [13,30,34,37].
They devise so-called decomposition strategies to decompose the input trees into smaller subtree pairs for which the distance
is computed first. The results of smaller subproblems are used to compute the distances of larger problems. Pawlik and
Augsten [30] introduce path strategies as a subset of all possible decomposition strategies and propose a two-step process for
the TED computation: (a) strategy computation and (b) distance computation (cf. Fig. 1(a)).
(a) In the first step, a path strategy is computed, which maps each pair of subtrees ðF v ; Gw Þ of two input trees F and G to a
root-leaf path in either Fv or Gw. RTED considers LRH strategies which allow left, right, and heavy paths only. The resulting
strategy is optimal in the sense that it minimizes the number of subproblems (i.e., the number of distances) that must be
computed. In this paper we consider all-path strategies which, in contrast to LRH strategies, do not have limitation on the
path type.
Definition 1. An all-path strategy S for two trees F and G maps each pair of subtrees ðF v ; Gw Þ, v A F, w A G, to a root-leaf path γ
in one of the subtrees, such that γ A γ ðF v Þ [ γ ðGw Þ [30].
(b) In the second step, the tree edit distance is computed. First, the input trees are decomposed into relevant subtree pairs
according to the strategy. Then, the corresponding distances are computed using so-called single-path functions. The single-
path function updates a distance matrix D, which stores distances between subtree pairs and is filled during the distance
computation [30].
Definition 2. Given are two subtrees Fv, Gw, a path γ A γ ðF v Þ, and the distances between all relevant subtrees of Fv with
respect to γ and all subtrees of Gw, i.e., between each pair ðF v0 ; Gw0 Þ: F v0 A F v γ 4 w0 A Gw . A single-path function ΔðF v ; Gw Þ
computes the distances between all subtrees rooted on path γ in Fv and every subtree of Gw, i.e., between all pairs
ðF v0 ; Gw0 Þ: v0 A γ 4 w0 A Gw .
Existing algorithms for single-path functions differ in the path type which they can process (ΔL : left path, ΔR : right path,
and ΔI : inner path), their cost (number of subproblems that are computed), and memory usage [30,32]. Depending on the
strategy, different single-path functions must be used, resulting in a different overall cost for computing the tree edit
distance. Although ΔI can processes any inner path, RTED considers only heavy paths. In this paper, we always assume the
improved version of ΔI as presented in [32], which requires less memory than the version in [30].
3. Problem definition
As outlined in previous sections, the path strategies introduced by Pawlik and Augsten [30] generalize all state-of-the-art
algorithms for computing the tree edit distance. They consider the class of LRH strategies and show optimality. However,
LRH strategies limit the paths to be left, right, or heavy. We observe that allowing all paths leads to less expensive strategies.
Another drawback of the RTED algorithm is the fact that the computation of the optimal strategy requires more space than
executing the strategy, i.e., computing the actual tree edit distance. This limits the maximum size of the tree instances which
can be processed in given memory. We moreover observe that many subtrees resulting from the optimal strategy are small.
This is true for any input. The single-path functions used in RTED are inefficient in such cases and better solutions should be
devised.
Based on the issues in RTED, our goal is an algorithm with the following properties:
Memory efficient. The strategy computation should not be a main memory bottleneck. The optimal strategy must be
computed and stored within the memory bounds of the distance computation.
Allowing all paths in the strategy. The strategies should not be limited to only left, right, and heavy paths. The algorithm for
the optimal strategy should consider all paths and run in Oðn2 Þ time.
Fast for small subtrees. Executing the expensive single-path functions of RTED for small subtrees is a significant overhead
for the entire algorithm. Whenever possible, the distances of small subtrees should be computed efficiently.
4. AP-TED algorithm
Until now, only LRH strategies have been considered in literature [13,24,30,37]. They are limited to left, right and heavy
paths only. LRH strategies are only a fraction of all possible path strategies. There may exist non-LRH path strategies that
lead to better solutions. In principle, all possible path strategies must be checked for the best result. In this section we
present AP-TED, a new algorithm that computes the tree edit distance with the optimal all-path strategy. The core of AP-TED
is a new algorithm to compute the optimal all-path strategy efficiently in small memory. Then, similar to RTED, the strategy
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 161
is executed using single-path functions. We develop a recursive formula that computes the cost of the optimal all-path
strategy and an algorithm which implements the formula efficiently. In Section 5 we present techniques to improve the
memory requirements of the strategy computation in AP-TED.
The tree edit distance is computed by executing a sequence of so-called single-path functions [30]. The sequence is
defined by the strategy. The cost of computing the distance is the cost sum of executing all single-path functions specified by
the strategy. Depending on the path type (left, right, and inner), different single-path functions must be used. Each single-
path function has a different cost. To compute the distance with the lowest cost, all possible path strategies must be
checked.
The cost formula in Fig. 3 performs an exhaustive search in the space of all possible path strategies. At each recursive
step, either F or G is decomposed with a path from γ ðFÞ or γ ðGÞ, respectively. For each of the choices, the cost of the
resulting relevant subtrees must be computed to find the strategy with the minimal cost.
Given a path, the cost is the sum of the optimal costs for the relevant subtrees with respect to the path plus the cost of
executing the single-path function. For each path type we use the single-path function with the lowest processing cost
(number of subproblems which must be computed). For left and right paths we use the costs of ΔL and ΔR , respectively. Let
γ I ðFÞ ¼ fγ A γ ðFÞ: γ a γ L ðFÞ 4 γ aγ R ðFÞg be the set of all inner paths of tree F. For inner paths we use the cost of ΔI [30,32]. We
denote the cost of a single-path function by jΔL=R=I j.
Theorem 1. Given the single path functions ΔL , ΔR , and ΔI , the cost formula in Fig. 3 computes the cost of the optimal all-path
strategy.
Proof. The proof is by induction. Base case: Both trees are single nodes: jΔðF; GÞj ¼ 1 for all path types [30] and costðF; GÞ ¼ 1.
Inductive hypothesis: costðF 0 ; GÞ: F 0 A F γ, γ A γ ðFÞ and costðG0 ; FÞ: G0 A G γ, γ A γ ðGÞ are optimal. The cost for a path γ in F
consists of the execution cost for the respective single path function (jΔL=R=I ðF; GÞj if γ is a left/right/inner path) and the cost
sum of the relevant subtrees F 0 A F γ. The cost computation for a path in G is analogous. All paths in F and G are considered,
and we pick the path with the minimum cost, thus costðF; GÞ is optimal.□
In this section we present the AP-TED strategy algorithm (Algorithm 1), which computes the optimal all-path strategy.
The algorithm implements the all-path cost formula in Fig. 3 and is as efficient as the RTED strategy algorithm in terms of
runtime, but requires significantly less memory. Our early deallocation technique is implemented in lines 6, 7, and 38, and is
described in detail in Section 5.
For the all-path strategy, similarly to the RTED strategy, we compute the optimal path for each pair of subtrees. Different
from [30], however, at each step we have to evaluate a linear number of paths in both trees instead of only left, right, and
heavy. This makes the computation of the optimal all-path strategy much more challenging. A straightforward extension of
the RTED strategy algorithm, which evaluates a linear number of paths for each pair of subtrees, requires Oðn3 Þ time. This is
prohibitive because the strategy computation must not increase the overall complexity of the tree edit distance compu-
tation, which is Oðn2 Þ in the best case.
AP-TED strategy is similar to RTED strategy in that the sums in the cost formula are computed incrementally in a bottom-
up fashion and are stored in the cost arrays. The cost arrays of size jFjjGj are LF, RF, and IF (jFjjGj is the array size without our
early deallocation technique). LF, RF, and IF store the final cost values for each pair ðF v ; Gw Þ; v A F; w A G, for left, right, and best
inner path in Fv, respectively:
P
LF ½v½w ¼ F 0 A F v γ L ðF v Þ costðF
0
; Gw Þ,
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 163
p
c
Fig. 4. Subtree Fp, its children subtrees – solid line, and path subtrees – dotted line.
P
RF ½v½w ¼ F A F γ ðF Þ costðF 0 ; Gw Þ,
0
P v
R
v
IF ½v½w ¼ minγ A γ ðF Þ f F A F γ costðF 0 ; Gw Þg.
0 I
v
0
v
0
The cost arrays of size jGj are LG, RG, and IG. They store the cost sums between all relevant subtrees of G w.r.t. the corre-
sponding path and a specific subtree Fv:
P
LG ½w ¼ G A G γ ðG Þ costðG0 ; F v Þ,
0
P w
L
w
RG ½w ¼ G A G γ ðG Þ costðG0 ; F v Þ,
0
P w
R
w
IG ½w ¼ minγ A γ ðG Þ f G A G γ costðG0 ; F v Þg.
0 I
w
0
w
0
The algorithm iterates over each pair of subtrees ðF v ; Gw Þ in postorder of the nodes; v and w are postorder IDs of nodes in
F and G, respectively. The cost sums stored in the cost arrays are used to compute the minimal cost cmin and the corre-
sponding path γmin (lines 10–16). The cost sums for the parent nodes, i.e., for the pairs ðF pðvÞ ; Gw Þ and ðF v ; GpðwÞ Þ, are updated
þ
in the lines 17–36 (a’b stands for a’a þ b). Finally, the best path, γmin, is stored in the strategy array STR (line 37) which
encodes the optimal strategy.
The challenge is to implement lines 3 and 6 of the cost formula in Fig. 3, which require to find the best inner path in F and
G, respectively. While there is a single heavy path, there can be many inner paths. Our solution is to rewrite lines 3 and 6 of
the cost formula such that they can be computed incrementally while processing the children of a node with a constant time
operation for each child. To simplify the discussion, we focus on line 3, which computes the best inner path for tree F.
The key to our solution is the following observation. The relevant subtrees for a given path γ 0 A γ ðF p Þ in subtree Fp fall into
two disjoint categories of subtrees shown in Fig. 4:
(1) Children subtrees. All relevant subtrees rooted in the children of p, except the subtree rooted in child c0 , which is the child
on path γ 0 .
(2) Path subtrees. All relevant subtrees in F c0 for path γ″, which is the sub-path of γ 0 , i.e., a root-leaf path of F c0 .
We distinguish children and path subtrees and rewrite the cost sum as follows:
Ap Bc0 ;γ″
C c0
X zfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflffl{
X zfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflffl{ zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
X ffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{
costðF 0 ; GÞ ¼ costðF c ; GÞ costðF c0 ; GÞ þ costðF ″ ; GÞ ð1Þ
F 0 A F p γ0 c A chðpÞ F ″ A F c0 γ ″
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
cost for children subtrees
cost for path subtrees
where p is a node in F, ch(p) denotes the children of p, γ A γ ðF p Þ, c0 is the child of p on path γ 0 (i.e., c0 A chðpÞ and c0 A γ 0 ), γ″ is
0
the sub-path of γ 0 that traverses F c0 (i.e., Nðγ 0 Þ⧹Nðγ″Þ ¼ fpg and Eðγ 0 Þ⧹Eðγ″Þ ¼ fðp; c0 Þg).
Using Eq. (1), we express RF ½pðvÞ½w and I F ½pðvÞ½w by formulas that are useful to compute these values incrementally. Let
P P
cr(p) denote the rightmost child of p, Ap ¼ c A chðpÞ costðF c ; GÞ, Bc0 ;γ″ ¼ F″ A F c0 γ″ costðF″; GÞ, C c0 ¼ costðF c0 ; GÞ:
RF ½pðvÞ½w ¼ BpðvÞ;γ R ðF pðvÞ Þ ¼ ApðvÞ C cr ðpðvÞÞ þ Bcr ðpðvÞÞ;γ R ðF cr ðpðvÞÞ Þ ¼ ApðvÞ C cr ðpðvÞÞ þ RF ½cr ðpðvÞÞ½w ð2Þ
I F ½pðvÞ½w ¼ min fBpðvÞ;γ 0 g ¼ ApðvÞ þ min f C c þ min fBc;γ ″ gg ¼ ApðvÞ þ min f C c þI F ½c½wg ð3Þ
γ 0 A γ I ðF pðvÞ Þ c A chðpðvÞÞ γ ″ A γ I ðF c Þ c A chðpðvÞÞ
Eqs. (2) and (3) are used in Algorithm 1 to incrementally compute the cost sums for the parent node p(v) while traversing
its children. The postorder traversal of the nodes ensures that we process all children of p(v) before node p(v) itself. The
child cr ðpðvÞÞ is traversed last. In Eq. (2), we compute the cost sum for the relevant subtrees w.r.t. the right path in F pðvÞ ,
which is stored in RF ½pðvÞ½w. We use the previously computed sum stored in RF ½cr ðpðvÞÞ½w. C cr ðpðvÞÞ is the value cmin for the
node cr ðpðvÞÞ, i.e., the minimal cost of computing the distance for the subtree pair ðF cr ðpðvÞÞ ; Gw Þ. The sum ApðvÞ is computed
incrementally by summing up the values cmin for each child of p(v) (cf. line 18). ApðvÞ is stored temporarily in RF ½pðvÞ½w until
the last child (rightmost child) is processed.
164 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173
In Eq. (3), for each child of node p(v), we compute the value ctmp ¼ C c þ I F ½c½w, which is the sum of (a) a negative cost
cmin for the particular child c and (b) the cost sum of the best inner path in Fc stored in I F ½c½w. While traversing the children
of p(v), we compute the minimal value ctmp and store it in I F ½pðvÞ½w (cf. lines 20–21). The minimum is found when the last
child is traversed, i.e., the node cr ðpðvÞÞ. Then, RF ½pðvÞ½w and I F ½pðvÞ½w are updated with the correct cost values w.r.t. Eqs.
(2) and (3), respectively (cf. lines 24–25).
While traversing the children, the best inner paths in Fv and Gw seen so far are stored in the strategy array STR and in the
array Pathw , respectively, e.g., STR½pðvÞ½w stores the best inner path which corresponds to the minimal cost sum stored in
I F ½pðvÞ½w (line 22), Pathw ½pðwÞ stores the best inner path of the minimal cost sum stored in I G ½pðwÞ (line 32). After
traversing the last child, STR½pðvÞ½w and Pathw ½pðwÞ store the best inner paths corresponding to the minimal cost of
computing the distance between the subtrees ðF pðvÞ ; Gw Þ and ðF v ; GpðwÞ Þ, respectively. While traversing the nodes p(v) and
p(w), the best inner path is compared to the respective left and right paths, and the overall best path is stored in the strategy
array STR (line 37).
With the above analysis we show the correctness of the following theorems.
Theorem 2. AP-TED strategy (Algorithm 1) is correct and computes the optimal path strategy.
Proof. We show by induction that the cost arrays store correct values. Base case: For each pair of leaf nodes (v,w) the cost
arrays are zero (lines 6 and 9). This is correct due to F v γ F ¼ Gw γ G ¼ ∅, where γ F A γ ðFÞ and γ G A γ ðGÞ. Inductive hypothesis:
The values in the cost arrays for all children of node v are correct. We show that after processing all children, the values for v
are correct. We show this for the array IF. The proof for LF ; RF ; LG ; RG ; I G is similar.
With Eq. (3), for node v and its children ch(v),
I F ½v½w ¼ Av þ min f C c þI F ½c½wg:
c A chðvÞ
While processing the children of v we compute the summands as follows: (a) Av is computed by summing up the cost
cmin ¼ C c for every child c A chðvÞ. (b) The minimal value ctmp ¼ C c þ I F ½c½v is computed among all children of v. Due to the
induction hypothesis the value I F ½c½v is correct for every node c A chðvÞ. Thus, the value stored in I F ½v½w is correct.
The strategy array STR maps each pair of subtrees to a root-leaf path and thus is a strategy by Definition 1. Only the
minimum-cost paths are stored in the strategy array, thus STR stores the optimal strategy.□
Theorem 3. Time and space complexity of the AP-TED strategy algorithm are Oðn2 Þ, where n ¼ maxfjFj; jGjg.
Proof. Algorithm 1 iterates over each pair of subtrees ðF v ; Gw Þ, v A F; w A G, hence the innermost loop is executed jFjjGj times.
Inside the innermost loop a constant number of arithmetic operations and array lookups is performed. The costs of the
single-path functions in lines 10–15 are precomputed in OðjFj þ jGjÞ time and space. Four arrays of size jFjjGj and five arrays
of size jGj are used. Thus, the overall time and space complexity is OðjFjjGjÞ.□
The main memory requirement is a bottleneck of the tree edit distance computation. The strategy computation in RTED
exceeds the memory needed for executing the strategy. Our AP-TED strategy algorithm reduces the memory usage by at
least 2/3 and never uses more memory than the execution of the strategy. We achieve that by decreasing the maximum size
of the data structures used for strategy computation.
The asymptotic space complexity is quadratic for both computing the strategy and executing it (cf. Fig. 1(a)). However,
due to different constants and depending on the strategy, the strategy computation may use twice the memory of the
distance computation. This is undesirable since the strategy computation becomes a memory bottleneck of the overall
algorithm.
We analyse the memory usage of the strategy and distance computation steps. For the moment we ignore the lines in
Algorithm 1 that perform early deallocation (lines 6, 7, and 38). Then, the memory requirement of the strategy computation
is dominated by three cost arrays LF ; RF ; I F , and the strategy array STR, which take 4jFjjGj space in total.
The memory of the distance computation depends on the strategy. The tree edit distance is computed by single-path
functions which process a single path from the strategy. Different path types require a different amount of memory.
Independent of the strategy, a distance matrix of size jFjjGj is used.
(a) If only left and/or right paths are used in the strategy, the single path functions require only one array of size jFjjGj. Thus
2jFjjGj space is needed in total.
(b) If an inner path is involved in the strategy, an array of size jFjjGj and an array of size jGj2 are used, amounting to
2jFjjGj þjGj2 space in total.
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 165
We observe that the distance computation always uses less memory than the strategy. Thus, the strategy computation is the
limiting factor for the maximum tree size that can be processed in given memory.
The goal is to use less memory for the strategy computation than for the distance computation. The key observation is
that some of the rows in the cost arrays are no longer needed once they have been read. We develop an early deallocation
technique to minimize the size of the cost arrays.
The AP-TED strategy algorithm uses three cost arrays of quadratic size (LF, RF, IF), where each row corresponds to a node
v A F. The outermost loop (line 5) iterates over the nodes in F in postorder. In each iteration only two rows of the cost arrays are
accessed: the row of node v is read and the row of its parent p(v) is updated. Once the row of node v has been read it will not
be accessed later and thus can be deallocated. In our analysis we count the maximum number of concurrently needed rows.
We observe that a row which corresponds to a leaf node stores only zeros. Thus we store a single read-only row for all
leaf nodes in all cost arrays and call it leafRow (cf. line 6).
We define a set of four operations that we need on the rows of a cost array.
The same operations are synchronously applied to the rows of all cost arrays (LF, RF, IF). To simplify the discussion, we
consider a single cost array and sum up in the end.
AP-TED strategy (Algorithm 1) dynamically allocates and deallocates rows in the cost arrays. Lines 6, 7, and 38 implement
our technique. At each iteration step, i.e., for each node v A F in postorder, the following steps are performed on a cost array:
if v is not root of F
if row for p(v) does not exist: allocate(p(v)) (line 7)
read(v) (lines 10–15)
update(p(v)) (lines 17–36)
if v is not a leaf: deallocate(v) (line 38)
Depending on the position of a node in the tree, a specific sequence of operations must be performed. Fig. 5 shows the
possible sequences and assigns them to the respective node types.
Example 2. We study the number of concurrently need rows for the example tree in Fig. 6. The table in the figure shows the
operations performed for every node of the example tree and the rows stored in memory after each step. The subscript
numbers of the nodes represent their postorder positions. We iterate over the nodes in postorder, i.e., v1 ; v2 ; v3 ; v4 ; v5 ; v6 .
Depending on the node type, different operations are performed that lead to the following operation sequences: v1 S1,
v2 S1, v3 S3, v4 S2, v5 S4, v6 S0.
The cost array for the example tree has six rows (one for each node). From the table in Fig. 6 we see that a maximum of
only three rows (leafRow, rows for v3 and v4) need to be stored at the same time.
We now consider the general case and count the number of rows that must be kept in memory at the same time. Each of
the operation sequences allocates and/or deallocates rows. With sðvÞ A fS0; S1; S2; S3; S4g we denote the operation sequence
required for node v, with d(S), S A fS0; S1; S2; S3; S4g, we denote the difference between the number of allocated and deal-
located rows by a particular operation sequence S, i.e., dðS0Þ ¼ 0, dðS1Þ ¼ 1, dðS2Þ ¼ 0, dðS3Þ ¼ 1, dðS4Þ ¼ 0. We count the
maximum number of concurrently needed rows during the postorder traversal of tree F with the following formula:
( )
X
cnrðFÞ ¼ 1 þ max dðsðwÞÞ
vAF
w A F;w o v
S0 =
leaf non-leaf
S1 = allocate(p(v)), read( leaf Row), update( p(v))
leftmost child S1 S2
S2 = allocate(p(v)), read( v), update( p(v)), deallocate(v) not leftmost child S4 S3
S3 = read(v), update( p(v)), deallocate(v) root S0
S4 = read(leaf Row ), update( p(v))
The leafRow is always kept in memory. In addition, each node contributes with d(S) rows, where S is the operation sequence
required by the node. For the tree in Fig. 6 we have cnrðFÞ ¼ 1 þ maxf1; 1 þ 1; 1 þ1 1; 1 þ 1 1 þ 0; 1 þ 1 1 þ 0 þ 0;
1 þ 1 1 þ 0 þ 0 þ 0g ¼ 3.
We observe that only operation sequence S1 adds more rows than it deallocates, i.e., dðS1Þ 40. An upper bound on the
number of concurrently needed rows is given by the number of nodes that require operation sequence S1. The only type of
nodes that falls into operation sequence S1 is a leaf node which is the leftmost child of its parent.
Lemma 1. The maximum number of concurrently needed rows of each cost array is ⌊jFj=2c.
Proof. The maximum number of the concurrently needed rows for a tree F is the number of its leftmost-child leaf nodes,
which is bound by ⌊jFj=2c: All leftmost-child leaf nodes, must have a different parent. This results in ⌊jFj=2c different
parent nodes for ⌊jFj=2c leftmost-child leaf nodes. If jFj is odd, then ⌊jFj=2c þ⌊jFj=2c ¼ jFj 1 and we can add one more
leftmost-child leaf to the tree. The new node will either become the child or the left sibling of an existing leftmost-child leaf
node. Thus, by adding a new leftmost-child leaf we loose an existing one. If jFj is even, ⌊jFj=2c þ ⌊jFj=2c ¼ jFj. Thus, the
maximum number of the leftmost-child leaf nodes is ⌊jFj=2c.□
An example of a tree with ⌊jFj=2c leftmost-child leaf nodes is the right branch tree (a tree symmetric to the left branch
tree in Fig. 8(a)).
Lemma 2. The memory required for the strategy computation in AP-TED is 2:5jFjjGj, including the output (strategy array STR).
Proof. By expiring rows, with Lemma 1, the sizes of the cost arrays LF, RF, and IF are reduced from jFjjGj to at most 0:5jFjjGj.
The strategy array STR requires jFjjGj memory. The size of STR is not reduced since it stores the final strategy after the
algorithm terminates. Thus, the AP-TED strategy algorithm requires 2:5jFjjGj memory in the worst case.□
Using our early deallocation technique, AP-TED requires only 2:5jFjjGj memory for computing and storing the strategy.
Unfortunately, it is still 0:5jFjjGj above the lower bound for the distance computation. In the following section we show that
we can reduce the memory to 2jFjjGj by traversing the trees in a clever order.
We reduce the number of concurrently needed rows in the cost matrices to ⌈jFj=3⌉. Our solution is based on the fol-
lowing observation. The AP-TED strategy algorithm iterates over the nodes in postorder. One can think of a symmetric
algorithm which iterates over the nodes in right-to-left postorder.2 Then, the maximum number of concurrently needed
rows is equal to the maximum number of the rightmost-child leaves in a tree (instead of leftmost-child leaves). For the right
branch tree, which is the worst case for postorder, the right-to-left postorder algorithm needs only three rows, including the
leafRow. Thus, the direction of the tree traversal matters. Let lðFÞ and rðFÞ be the number of leftmost-child and rightmost-
child leaf nodes in F, respectively. We use the following rule: if lðFÞ orðFÞ, then use (left-to-right) postorder, otherwise use
right-to-left postorder to compute the strategy. We call the order resulting from our rule an adaptive order.
We observe that whenever lðFÞ is high, rðFÞ is low and vice versa. Moreover, in practice minflðFÞ; rðFÞg is typically much
below jFj=2. We show that the number of rows is bounded to ⌈jFj=3⌉.
The first step is to show that for any node v A F at most the rows for v and its ancestors, i.e., the nodes on the path from v
to the root, are concurrently stored in main memory.
Lemma 3. For any node v A F processed by AP-TED strategy (Algorithm 1), at most the rows for v and its ancestors are con-
currently stored in main memory.
Proof. We show the proof for the postorder traversal. The reasoning for the right-to-left postorder is similar.
Given a node v A F, we divide the nodes of F into four disjoint sets visualized in Fig. 7:
2
In right-to-left postorder, children are traversed from right to left before their parents. For example, traversing the tree in Fig. 6 in right-to-left
postorder gives the following node sequence: v5 ; v2 ; v3 ; v1 ; v4 ; v6 .
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 167
L R
v
Fig. 7. Node sets L (left), R (right), A (ancestors), and D (descendants) in the proof of Lemma 3.
We show that while traversing node v during the strategy computation, the only nodes for which a row may be allocated
are v and the nodes in A.
While node v is traversed, its row must be present in memory. All nodes in L and D precede v in postorder and have
already been processed. Thus, their rows have been deallocated and are no longer in memory. The row of a node is first
allocated when its leftmost child is processed. The nodes of R and all their descendants (which are also in R) succeed v in
postorder and no row is allocated. Also the nodes in A succeed v in postorder and have not been processed. However, some
nodes in A may have a leftmost child in L [ fvg, thus the respective row may be allocated.□
Lemma 4. For a given tree F, the strategy computation in AP-TED with adaptive order requires at most ⌈jFj=3⌉ rows in main memory.
Proof. The proof is by contradiction. We assume a node v that has k 4⌈jFj=3⌉ ancestors for which a row must be stored in
both the postorder and right-to-left postorder iteration of the tree.
Postorder. If a row is allocated for an ancestor a of v, then a has a leftmost child which is not an ancestor of v. Node v may
cause the row allocation for its parent p(v). For the remaining k 1 ancestors of v there must be k 1 leftmost children.
Right-to-left postorder. Similarly, for k1 ancestors of v there must be k 1 rightmost children which are not ancestors of v.
Thus, F has at least 3k 1 nodes (node v, k ancestors of v, k 1 leftmost children, k1 rightmost children) and we get the
following contradiction:
Proof. By Lemmas 3 and 4 we know that the maximum number of concurrently needed rows of each cost array is at most
⌈jFj=3⌉. The strategy array STR needs jFjjGj memory. This results in 3ððjFj=3ÞjGjÞ þ jFjjGj ¼ 2jFjjGj memory.□
6. AP-TED þ algorithm
The RTED algorithm computes the tree edit distance by executing the single-path functions for the subtree pairs resulting
from the strategy. We observe that when one of the input trees in a single-path function is small, the distance can be com-
puted more efficiently than with the existing single-path functions. We address two special cases, which are very frequent and
have a high impact on the runtime: one- and two-node trees. We present AP-TED þ , a new algorithm that improves over
previous algorithms, including AP-TED, in two ways. First, AP-TED þ uses a new, efficient single-path functions for the cases of
small subtrees. The new single-path functions significantly reduce memory and runtime compared to previous single-path
functions. Second, AP-TED þ leverages the lower costs of the new single-path functions to compute better strategies. AP-TED þ
first computes the optimal all path strategy considering the lower costs for small subtrees and then executes the strategy using
the new single-path functions for small subtrees and the old single-path functions for all other cases.
168 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173
Let F and G be the input trees to the TED algorithm, cd(a), ci(b), cr ða; bÞ the costs of deleting a A F, inserting b A G, and
renaming the label of a to the label of b, respectively. Let F 0 and G0 be the subtrees of F and G, such that G0 is a one-node tree,
NðG0 Þ ¼ fwg, EðG0 Þ ¼ ∅, and jF 0 jZ 1. We assume that cr ðv; wÞ rcd ðvÞ þ ci ðwÞ for any node v A F 0 (otherwise rename will never
contribute to the minimal distance). Then, the distance between F 0 and G0 is computed with the following formula:
X
δ1node ðF 0 ; G0 Þ ¼ cd ðvÞ þmin0 fcr ðv; wÞ cd ðvÞg ð4Þ
vAF
v A F0
The distance between F and G0 is the cost of deleting all nodes in F 0 except the one for which the rename cost to the label
0
of the node in G0 is minimal. The rename cost is added to the final result.
δ1node ðF 0 ; G0 Þ can be evaluated in a single traversal of the bigger tree and does not require any data structures for
storing intermediate results. In order to implement δ1node ðF 0 ; G0 Þ as a single-path function, we must store the
resulting distances of all subtree pairs ðF 0v ; G0w Þ, v A F 0 , w A G0 , in distance matrix D [30]. Algorithm 2 implements Eq. (4) as a
single-path function.
Algorithm 2. Δ1node ðF 0 ; G0 Þ.
Input: trees F 0 ; G0 , jG0 j ¼ 1, NðG0 Þ ¼ fwg
1 costSum’0
2 for v A F 0 in postorder do
6
6 þ
3 6 costSum’cd ðvÞ
6
4 6
4 match’minfmatch; cr ðv; wÞ cd ðvÞg
5 DðF 0v ; G0w Þ’costSum match
Let now G0 be a two-node tree, NðG0 Þ ¼ fw1 ; w2 g, EðG0 Þ ¼ fðw2 ; w1 Þg. anc(v) is the set of ancestors of node v in F 0 without v.
Then, the distance between F 0 and G0 is computed with Eq. (5). We limit our discussion to settings with the following
properties: (a) jF 0 j Z 2 (otherwise δ1node can be used), (b) cr ðv; wÞ rcd ðvÞ þci ðwÞ for all nodes v A F 0 and w A G0 (otherwise
rename will never contribute to the minimal distance).
! 8
X < cr ðv1 ; w2 Þ ðcd ðv1 Þ þci ðw2 ÞÞ
0 0
δ2node ðF ; G Þ ¼ cd ðvÞ þ ci ðw1 Þ þ ci ðw2 Þ þ min0 cr ðv1 ; w1 Þ ðcd ðv1 Þ þci ðw1 ÞÞ þ min fcr ðv2 ; w2 Þ ðcd ðv2 Þ þci ðw2 ÞÞg ð5Þ
v AF : v2 A ancðv1 Þ
v A F0 1
Eq. (5) substitutes one or two pairs of deletion and insertion with a rename in order to find the minimal distance.
Algorithm 3. Δ2node ðF 0 ; G0 Þ.
Input trees F 0 ; G0 , jG0 j ¼ 2, NðG0 Þ ¼ fw1 ; w2 g, EðG0 Þ ¼ fðw2 ; w1 Þg
1 match: ½1::jF 0 j½0::2: boolean array initialized to false
2 for v∈F 0 in postorder do
6
3 6 DðF 0v ; G0w2 Þ←maxfjF 0v j; 2g
6
4 66 if match½v½2 then
6
5 6 ⌊DðF 0v ; G0w2 Þ←jF 0v j−2
6
6 66 else if match½v½0 ∨ match½v½1 then
6 0
7 6 0 0
6 ⌊DðF v ; Gw2 Þ←jF v j−1
6
8 6 else if v is leaf ∧ðlðvÞ ¼ lðw1 Þ∨lðvÞ ¼ lðw2 ÞÞ then
6
9 6 0 0
6 ⌊DðF v ; Gw2 Þ←1
6 0
10 6 DðF v ; Gw Þ←jF 0v j
0
6 1
11 6
6 if match½v½0∨lðvÞ ¼ lðw1 Þthen
12 6
6 ⌊DðF 0v ; G0w Þ←jF 0v j−1
6 1
13 6
6 if v is not root of F then
0
6
14 6 6
6 match½pðvÞ½0←match½v½0∨lðvÞ ¼ lðw1 Þ∨match½pðvÞ½0
66
15 6 6
4 4 match½pðvÞ½1←match½v½1∨lðpðvÞÞ ¼ lðw2 Þ∨match½pðvÞ½1
16 match½pðvÞ½2←match½v½2∨match½pðvÞ½2∨ððmatch½v½0∨lðvÞ ¼ lðw1 ÞÞ∧lðpðvÞÞ ¼ lðw2 ÞÞ
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 169
Algorithm 3 implements Eq. (5) for the unit cost model, where for any node v A F 0 and w A G0 : cd ðvÞ ¼ ci ðwÞ ¼ 1, cr ðv; wÞ ¼ 0
if the labels of v and w are equal and cr ðv; wÞ ¼ 1 otherwise. Algorithm 3 computes the distance in a single postorder
traversal of tree F 0 . It uses a two-dimensional boolean array match with three columns of length jF 0 j to store the intermediate
label matches:
For each node v A F 0 , the subtree distances ðF 0v ; G0w1 Þ and ðF 0v ; G0w2 Þ are stored in the distance matrix D. Finally, the entries in
array match for the parent of v are updated (lines 13–14): a label match in subtree F 0v is also valid for F 0pðvÞ .
In order to take full advantage of one- and two-node single-path functions, they must be considered in the strategy
computation. The reason is the difference in the computation costs (the number of computed distances). We first show that
our new single-path functions, Δ1node and Δ2node , always have a lower cost than the single-path functions used in RTED (ΔL ,
ΔR , and ΔI ). In the worst case, the costs of ΔL=R=I is cubic in the tree size, whereas the cost of our functions is always linear.
The cost of a single-path function, jΔðF; GÞj, for a pair of trees F and G, is the number of subproblems that must be computed.
Theorem 5. For any two trees, F and G, such that jGj r2, the following holds:
jΔ1=2node ðF; GÞj rjΔL=R=I ðF; GÞj
Proof. Both Δ1node ðF; GÞ and Δ2node ðF; GÞ are computed in a single traversal of tree F. For each node in F only a constant
number of arithmetic operations and table lookups is performed, then jΔ1=2node ðF; GÞj ¼ jFj. Depending on the context, for ΔL ,
ΔR , and ΔI the following lower bounds hold [30]:
P
(a) jΔL=R=I ðF; GÞj ZjFjjGj þ F 0 A F γ jΔL=R=I ðF 0 ; GÞj ZjFj,
(b) jΔL=R=I ðG; FÞj ZjGjjFj ZjFj.□
The second step is to include the costs of the new functions in our AP-TED strategy algorithm to obtain better strategies.
This is straightforward: when the minimal cost is computed in line 10 of Algorithm 1, we check the sizes of the subtrees and
use the costs of our new single-path functions if applicable.
7. Related work
Tree edit distance algorithms. The tree edit distance has a recursive solution, which decomposes the input trees into
smaller subtrees and subforests. The best known algorithms are dynamic programming implementations of this recursive
solution, where small subproblems are computed first. The first tree edit distance algorithm was proposed by Tai [34]. It
runs in Oðn6 Þ time and space where n is the number of tree nodes. The runtime complexity is given by the number of
subproblems that must be solved. Zhang and Shasha [37] significantly improve the complexity and achieve Oðn4 Þ time and
Oðn2 Þ space. Klein [24] uses so-called heavy paths, which guide the dynamic programming solution by ordering the sub-
problems resulting with Oðn3 log nÞ runtime and space complexity. Demaine et al. [13] developed an algorithm which
requires Oðn3 Þ time and Oðn2 Þ space. They moreover show that their solution is worst-case optimal. Pawlik and Augsten [30]
observe that the previous algorithms are efficient only for specific tree shapes and run into their worst case otherwise. They
propose the general framework for computing the tree edit distance shown in Fig. 1(a) and present the RTED algorithm.
RTED computes and executes the optimal LRH strategy and runs in Oðn3 Þ time and Oðn2 Þ space. Compared to previous
algorithms, it does not run into the worst case if a better strategy exists. AP-TED, the solution presented in this paper, enjoys
the same features but in addition (a) requires less memory and (b) executes the optimal all-path strategy, a superclass of
LRH strategies. In [32], Pawlik and Augsten further improve RTED and introduce a novel indexing scheme, called root
encoding, that stores and retrieves intermediate distance results in constant time. They develop a new single-path function,
ΔA , that combines the features of the previous solutions for left, right, and inner paths [13,30,37]. Thanks to the root
encoding, ΔA requires less memory for inner paths than ΔI as proposed by Demaine et al. [13]. We leverage these results in
our AP-TED algorithm and build the optimizations of ΔA for inner paths into our single-path function ΔI .
Path strategies. Each of the state-of-the-art algorithms uses a specific set of paths to decompose trees and order the
computation of subproblems in the dynamic programming solution. The algorithms by Zhang and Shasha [37], Klein [24]
and Demaine et al. [13] use hard-coded strategies which do not require the strategy computation step. The resulting
algorithms are efficient only for specific tree shapes. Dulucq and Touzet [14] compute a decomposition strategy in the first
step, then use the strategy to compute the tree edit distance. Since they only consider strategies that decompose a single
tree, the resulting algorithm runs in Oðn3 log nÞ time and space. On-the-fly strategies that decompose both trees were
170 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173
introduced by Pawlik and Augsten [30]. The resulting algorithm, RTED, requires Oðn3 Þ time and Oðn2 Þ space, and their
strategy is shown to be optimal in the class of LRH strategies, which covers all previous solutions. Unfortunately, the strategy
computation in RTED requires more memory than the actual distance computation.
In a preliminary version of this paper [31], we tackle the memory problem of the strategy computation in RTED [30] and
prove an ⌈n=2⌉ upper bound for the number of rows in the cost matrices. This, however, is not enough to keep the strategy,
which is of quadratic size, in main memory between the strategy and distance computation steps. In this paper, we improve
over [31] as follows. (a) We decrease the upper bound for the number of rows to ⌈n=3⌉. This allows the strategy to be
computed and stored in main memory without increasing the overall memory requirements for the distance computation.
Thus, we fully solve the memory problem of the strategy computation. (b) We present the AP-TED algorithm to compute
TED with the optimal all-path strategy, which subsumes LRH strategies. Our new strategy algorithm runs in Oðn2 Þ time and
implements an early deallocation technique to reduce the memory requirements. (c) We introduce the AP-TED þ algorithm
with two new single-path functions for small subtrees, which are more efficient in terms of runtime and memory than the
single-path functions used in RTED. By considering the new functions in the strategy computation we further improve the
decomposition strategy.
Approximations. More efficient approximation algorithms for the tree edit distance have been proposed. Akutsu et al. [2]
approximate the tree edit distance with the string edit distance δS between the Euler string representation of the input trees.
They show that the tree edit distance is between δS =2 and ð2h þ1ÞδS , where h is the maximum tree height. By encoding small
subtrees into node labels, they further achieve the lower and upper bounds δS =6 and Oðn3=4 ÞδS , respectively. Aratsu et al. [4]
compute the string edit distance δB between so-called binary tree codes of the input trees and achieve the bounds δB =2 and
ðhþ 1ÞδB þ h. The binary tree code is computed from the binary tree representation by traversing the nodes in postorder and
concatenating the labels. Both solutions run in Oðn2 Þ time. Zhang proposes an upper bound, the constrained tree edit dis-
tance, which runs in Oðn2 Þ time and space [36]. Wang and Zhang [35] improve the space complexity of the constrained tree
edit distance to Oðn log nÞ with a technique similar to our approach for the strategy computation. Guha et al. [18] develop a
lower bound algorithm by computing the string edit distance between the preorder and postorder sequences of node labels
in Oðn2 Þ time and space. Augsten et al. [6] decompose the trees into pq-grams and propose a lower bound algorithm which
requires Oðn log nÞ time and O(n) space. Finis et al. [16] develop the RWS-Diff algorithm (Random Walk Similarity Diff) to
compute an approximation of the tree edit distance and an edit mapping in Oðn log nÞ time.
8. Experiments
In this section we experimentally evaluate AP-TED and AP-TED þ and compare them to RTED [30]. Our empirical eva-
luation on real-world and synthetic data confirms our analytical results: computing the strategy in AP-TED is as efficient as
in RTED, but requires significantly less memory. In particular, the strategy computation requires less memory than the actual
tree edit distance computation.
Set-up. All algorithms are implemented as single-thread applications in Java 1.7. We run the experiments on a 4-core Intel
i7 3.70 GHz desktop computer with 32 GB RAM.
The datasets. We test the algorithms on both synthetic and real world data. We generate synthetic trees of three different
shapes: left branch (LB), zigzag (ZZ) and full binary tree (FB) (Fig. 8). We also generate random trees (random) varying in
depth and fanout (with a maximum depth of 15 and a maximum fanout of 6).
We use three real world datasets with different characteristics. SwissProt3 is an XML protein sequence database with
50 000 medium sized and flat trees (average depth 3.8, maximum depth 4, average fanout 1.8, maximum fanout 346, and
average size 187). TreeBank4 is an XML representation of natural language syntax trees with 56 385 small and deep trees
(average depth 10.4, maximum depth 35, average fanout 1.5, maximum fanout 51, and average size 68). TreeFam5 stores
16 138 phylogenetic trees of animal genes (average depth 14, maximum depth 158, average fanout 2, maximum fanout 3,
and average size 95).
Memory measurement. We first measure the memory requirements for the strategy computation and compare it to the
memory needed for executing the strategy. We measure only the heap memory allocated by the Java Virtual Machine. The
non-heap memory varies between 3.5 MB and 5 MB for all test cases, including the tests on the largest trees.
Fig. 9(a)–(d) shows the memory requirements for different tree shapes and sizes. We use three datasets containing
synthetic trees of the same shape (LB, ZZ, and FB) and a dataset with randomly generated trees (random). Each synthetic
dataset contains trees with up to 40 000 nodes. The lines represent the memory usage in MB for strategy computation in
RTED and AP-TED, and the memory needed for the distance computation. In all cases AP-TED strategy requires significantly
less memory than RTED strategy (note the logarithmic scale). RTED uses much more (up to 200%) memory for the strategy
computation than for executing the strategy. On the contrary, AP-TED always uses less memory to compute the strategy. In
particular, we observe that most of the memory allocated by AP-TED strategy is used to store the strategy.
3
http://www.expasy.ch/sprot/
4
http://www.cis.upenn.edu/ treebank/
5
http://www.treefam.org/
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 171
Fig. 8. Shapes of the synthetic trees. (a) Left branch tree (LB), (b) full binary tree (FB) and (c) zig-zag tree (ZZ).
Fig. 9. Memory of strategy computation compared to executing the strategy on synthetic data and real-world datasets (Sprot and TreeBank). (a) Left branch
tree (LB), (b) full binary tree (FB), (c) zig-zag tree (ZZ), (d) random tree (random), (e) Sprot, and (f) TreeBank.
We performed a similar experiment on the real-world datasets from Sport and TreeBank (Fig. 9(e) and (f)). We pick pairs
of similarly sized trees at regular size intervals. We measure the memory of the strategy computation. The plotted data
points correspond to the average size of a tree pair and the memory used in MB. We observe a similar behaviour as in the
case of synthetic trees.
We further compute a similarity self join on a sample subset of similar-size trees from the real-world datasets. The join is
of the form D⋈θ D, where D is a dataset and θ a tree edit distance predicate. The results are gathered in Table 1. The first three
columns show the dataset, the average tree size and the number of trees in the sample. We report the memory for the
strategy computation in RTED, in AP-TED, and for the distance computation step. For each of the datasets, the strategy
computation with RTED requires more memory than the distance computation, whereas AP-TED always uses less memory.
Number of rows. We count the number of concurrently needed rows in the cost arrays during the strategy computation in
AP-TED. We perform an experiment on the real-world datasets. The results are shown in Table 1. For the RTED strategy the
172 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173
Table 1
Average memory usage (in MB) and runtime (in milliseconds) for different datasets and tree sizes.
Dataset Avg. size Trees RTED strategy AP-TED strategy Distance computation
treebank-100 98.7 50 98.7 0.57 0.02 5.3 0.53 0.03 0.57 1.43
treebank-200 195.4 50 195.4 1.10 0.07 6.0 0.72 0.09 0.88 6.27
treebank-400 371.3 10 371.3 2.86 2.14 6.3 1.30 2.31 1.77 26.99
sprot-200 211.0 50 211.0 1.21 0.15 3.0 0.70 0.09 0.89 4.92
sprot-400 395.0 50 395.0 3.12 2.25 3.0 1.31 2.16 1.92 15.44
sprot-1000 987.5 20 987.5 16.64 16.95 3.0 5.13 14.94 9.17 123.79
sprot-2000 1960.1 10 1960.1 63.20 64.85 3.0 17.80 55.11 33.10 502.88
treefam-200 197.7 50 197.7 1.16 0.15 9.6 0.74 0.15 0.88 9.86
treefam-400 402.6 50 402.6 3.33 2.52 12.3 1.46 2.37 2.05 55.48
treefam-1000 981.9 20 981.9 16.73 18.33 14.1 5.58 17.33 9.27 453.51
Fig. 10. Runtime difference between RTED strategy, AP-TED strategy, and the overall distance computation. (a) RTED vs. AP-TED strategy and (b) strategy vs.
distance computation.
Table 2
Runtime (in seconds) of a similarity self join on different datasets.
number of rows is the size of the left-hand input tree (column 2 – Avg. size). For the AP-TED strategy the number of rows
varies from 5% of the tree size for the treebank-100 dataset to 0.15% for the sprot-2000 dataset. Thus, in practice, our
memory deallocation technique is much more effective than suggested by the theoretical bound of 33.3%.
Strategy runtime. We measure the time for computing and executing strategies; the results for different datasets are
shown in the runtime columns of Table 1. In most cases, AP-TED strategy is slightly faster than RTED strategy. Due to its
deallocation mechanism, AP-TED strategy (cf. Section 5) needs to allocate less memory than RTED strategy. The improved
runtime performance of AP-TED strategy stems from reusing memory as compared to allocating new memory on the heap.
Compared to the distance computation, the strategy requires only a small fraction of the overall time. Fig. 10 shows how the
strategy computation scales with the tree size for the Sprot dataset.
Efficiency of the new single-path functions. We measure the efficiency of our new single-path functions, Δ1node and Δ2node .
We perform a similarity self join on our real-world datasets (as in the memory experiment). We measure the runtime of the
entire join on each dataset separately. The results are shown in Table 2. The second column shows the results for RTED, the
third column for our AP-TED þ algorithm. AP-TED þ reduces the runtime from 5% for the treefam-400 dataset to 57% for the
sprot-200 dataset.
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 173
9. Conclusion
In this paper we develop two new algorithms for the tree edit distance: AP-TED and AP-TED þ . The strategy computation
is a main memory bottleneck of the state-of-the-art solution, RTED [30]. The memory required for the strategy computation
can be twice the memory needed for the actual tree edit distance computation. Our AP-TED strategy algorithm reduces the
memory by at least 2/3 compared to the strategy computation in RTED and never uses more memory than the distance
computation. The experimental evaluation confirms our theoretical results. Our algorithm computes the optimal all-path
strategy which, compared to RTED, does not limit the path type and considers all possible paths. Our AP-TED þ algorithm
with new efficient single-path functions reduces the overall runtime of the tree edit distance by up to 57%.
Acknowledgements
This work is partially supported by the SyRA project of the Free University of Bozen-Bolzano, Italy.
References
[1] T. Akutsu, Tree edit distance problems algorithms and applications to bioinformatics, IEICE Trans. Inf. Syst. E 93-D (2) (2010) 208–218.
[2] T. Akutsu, D. Fukagawa, A. Takasu, Approximating tree edit distance through string edit distance, Algorithmica 57 (2) (2010) 325–348.
[3] K.F. Aoki, A. Yamaguchi, Y. Okuno, T. Akutsu, N. Ueda, M. Kanehisa, H. Mamitsuka, Efficient tree-matching methods for accurate carbohydrate database
queries, Genome Inform. 14 (2003) 134–143.
[4] T. Aratsu, K. Hirata, T. Kuboyama, Approximating tree edit distance through string edit distance for binary tree codes, Fundam. Inform. 101 (3) (2010)
157–171.
[5] N. Augsten, D. Barbosa, M. Böhlen, T. Palpanas, Efficient top-k approximate subtree matching in small memory, IEEE Trans. Knowl. Data Eng. (TKDE) 23
(8) (2011) 1123–1137.
[6] N. Augsten, M. Böhlen, J. Gamper, The pq-gram distance between ordered labeled trees, ACM Trans. Database Syst. (TODS) 35 (1) .
[7] J. Bellando, R. Kothari, Region-based modeling and tree edit distance as a basis for gesture recognition, in: International Conference on Image Analysis
and Processing (ICIAP), 1999.
[8] S.S. Chawathe, Comparing hierarchical data in external memory, in: International Conference on Very Large Data Bases (VLDB), 1999.
[9] G. Cobéna, S. Abiteboul, A. Marian, Detecting changes in xml documents, in: International Conference on Data Engineering (ICDE), 2002.
[10] S. Cohen, Indexing for subtree similarity-search using edit distance, in: ACM SIGMOD International Conference on Management of Data (SIGMOD),
2013, pp. 49–60.
[11] T. Dalamagas, T. Cheng, K.-J. Winkel, T. Sellis, A methodology for clustering xml documents by structure, Inf. Syst. 31 (3) (2006) 187–228.
[12] D. de Castro Reis, P.B. Golgher, A.S. da Silva, A.H.F. Laender, Automatic web news extraction using tree edit distance, in: International World Wide Web
Conference (WWW), 2004, pp. 502–511.
[13] E.D. Demaine, S. Mozes, B. Rossman, O. Weimann, An optimal decomposition algorithm for tree edit distance, ACM Trans. Algorithms 6 (1) .
[14] S. Dulucq, H. Touzet, Decomposition algorithms for the tree edit distance problem, J. Discret. Algorithms 3 (2–4) (2005) 448–471.
[15] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, M. Montperrus, Fine-grained and accurate source code differencing, in: ACM/IEEE International
Conference on Automated Software Engineering (ASE), 2014, pp. 313–324.
[16] J.P. Finis, M. Raiber, N. Augsten, R. Brunel, A. Kemper, F. Färber, Rws-diff: flexible and efficient change detection in hierarchical data, in: ACM Inter-
national Conference on Information and Knowledge Management (CIKM), 2013, pp. 339–348.
[17] M. Garofalakis, A. Kumar, Xml stream processing using tree-edit distance embeddings, ACM Trans. Database Syst. (TODS) 30 (1) (2005) 279–332.
[18] S. Guha, H.V. Jagadish, N. Koudas, D.S.T. Yu, Approximate xml joins, in: ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002.
[19] A. Habrard, J.M.I. Quereda, D. Rizo, M. Sebban, Melody recognition with learned edit distances, in: Joint IAPR International Workshop on Structural,
Syntactic, and Statistical Pattern Recognition (SSPR/SPR), 2008, pp. 86–96.
[20] M. Hashimoto, A. Mori, Diff/ts: a tool for fine-grained structural change analysis, in: Working Conference on Reverse Engineering (WCRE), 2008.
[21] H. Heumann, G. Wittum, The tree-edit-distance, a measure for quantifying neuronal morphology, Neuroinformatics 7 (3) (2009) 179–190.
[22] S. Kamali, F.W. Tompa, Retrieving documents with mathematical content, in: ACM SIGIR International Conference on Research and Development in
Information Retrieval 2013, pp. 353–362.
[23] Y. Kim, J. Park, T. Kim, J. Choi, Web information extraction by HTML tree edit distance matching, in: International Conference on Convergence
Information Technology (ICCIT), 2007, pp. 2455–2460.
[24] P.N. Klein, Computing the edit distance between unrooted ordered trees, in: European Symposium on Algorithms (ESA), 1998, pp. 91–102.
[25] P.N. Klein, S. Tirthapura, D. Sharvit, B.B. Kimia, A tree-edit-distance algorithm for comparing simple, closed shapes, in: ACM-SIAM Symposium on
Discrete Algorithms (SODA), 2000.
[26] F. Korn, B. Saha, D. Srivastava, S. Ying, On repairing structural problems in semi-structured data, Proc. VLDB Endow. 6 (9) .
[27] K.-H. Lee, Y.-C. Choy, S.-B. Cho, An efficient algorithm to compute differences between structured documents, IEEE Trans. Knowl. Data Eng. (TKDE) 16
(8) (2004) 965–979.
[28] Z. Lin, H. Wang, S.I. McClean, Measuring tree similarity for natural language processing based information retrieval, in: International Conference on
Applications of Natural Language to Information Systems (NLDB), 2010.
[29] B. Ma, L. Wang, K. Zhang, Computing similarity between RNA structures, Theor. Comput. Sci. 276 (1–2) (2002) 111–132.
[30] M. Pawlik, N. Augsten, RTED: a robust algorithm for the tree edit distance, Proc. VLDB Endow. (PVLDB) 5 (4) (2011) 334–345.
[31] M. Pawlik, N. Augsten, A memory-efficient tree edit distance algorithm, in: International Conference on Database and Expert Systems Applications
(DEXA), 2014.
[32] M. Pawlik, N. Augsten, Efficient computation of the tree edit distance, ACM Trans. Database Syst. (TODS) 40 (1) . (Article 3).
[33] V. Springel, S.D.M. White, A. Jenkins, C.S. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, J.A. Peacock, S. Cole, P. Thomas,
H. Couchman, A. Evrard, J. Colberg, F. Pearce, Simulations of the formation, evolution and clustering of galaxies and quasars, Nature 435 (2005)
629–636.
[34] H.-C. Tai, The tree-to-tree correction problem, J. ACM (JACM) 26 (3) (1979) 422–433.
[35] L. Wang, K. Zhang, Space efficient algorithms for ordered tree comparison, Algorithmica 51 (3) (2008) 283–297.
[36] K. Zhang, Algorithms for the constrained editing distance between ordered labeled trees and related problems, Pattern Recognit. 28 (3) (1995)
463–474.
[37] K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM J. Comput. (SICOMP) 18 (6) (1989)
1245–1262.