0% found this document useful (0 votes)

52 views17 pages

1 s2.0 S0306437915001611 Main

This paper introduces a new algorithm called AP-TED for computing tree edit distance, which significantly reduces memory usage compared to the existing RTED algorithm while maintaining similar runtime efficiency. AP-TED computes an optimal all-path strategy for tree decomposition, allowing for improved handling of hierarchical data structures in various applications. Additionally, the paper presents the AP-TED+ variant, which enhances performance for small subtrees, demonstrating substantial runtime improvements in experimental evaluations.

Uploaded by

suhail

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views17 pages

1 s2.0 S0306437915001611 Main

Uploaded by

suhail

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Information Systems 56 (2016) 157–173

Contents lists available at ScienceDirect

Information Systems
journal homepage: www.elsevier.com/locate/infosys

Tree edit distance: Robust and memory-efﬁcient

Mateusz Pawlik n, Nikolaus Augsten
University of Salzburg, Department of Computer Sciences, Jakob-Haringer-Str. 2, 5020 Salzburg, Austria

a r t i c l e i n f o abstract

Article history: Hierarchical data are often modelled as trees. An interesting query identifies pairs of
Received 12 August 2014 similar trees. The standard approach to tree similarity is the tree edit distance, which has
Received in revised form successfully been applied in a wide range of applications. In terms of runtime, the state-
25 May 2015
of-the-art algorithm for the tree edit distance is RTED, which is guaranteed to be fast
Accepted 2 August 2015
independent of the tree shape. Unfortunately, this algorithm requires up to twice the
Available online 28 August 2015
memory of its competitors. The memory is quadratic in the tree size and is a bottleneck for
Keywords: the tree edit distance computation.
Tree edit distance In this paper we present a new, memory efficient algorithm for the tree edit distance,
Similarity search
AP-TED (All Path Tree Edit Distance). Our algorithm runs at least as fast as RTED without
Approximate matching
trading in memory efficiency. This is achieved by releasing memory early during the first
step of the algorithm, which computes a decomposition strategy for the actual distance
computation. We show the correctness of our approach and prove an upper bound for the
memory usage. The strategy computed by AP-TED is optimal in the class of all-path
strategies, which subsumes the class of LRH strategies used in RTED. We further present
the AP-TED þ algorithm, which requires less computational effort for very small subtrees
and improves the runtime of the distance computation. Our experimental evaluation
confirms the low memory requirements and the runtime efficiency of our approach.
& 2015 Elsevier Ltd. All rights reserved.

1. Introduction

Data with hierarchical dependencies are often modelled as trees. Tree data appear in many applications, ranging from
hierarchical data formats like JSON or XML to merger trees in astrophysics [33]. An interesting query computes the similarity
between two trees. The standard measure for tree similarity is the tree edit distance, which is deﬁned as the minimum-cost
sequence of node edit operations that transform one tree into another. The tree edit distance has been successfully applied
in bioinformatics (e.g., to ﬁnd similarities between RNA secondary structures [1,29], neuronal cells [21], or glycan structures
[3]), in image analysis [7], pattern recognition [25], melody recognition [19], natural language processing [28], information
extraction [12,23], and document retrieval [22], and has received considerable attention from the database community
[5,8–11,16–18,26,27].

n
Corresponding author. Tel.: þ 43 66280446348.
E-mail addresses: mateusz.pawlik@sbg.ac.at (M. Pawlik), nikolaus.augsten@sbg.ac.at (N. Augsten).

http://dx.doi.org/10.1016/j.is.2015.08.004
0306-4379/& 2015 Elsevier Ltd. All rights reserved.
158 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173

Fig. 1. Strategy computation requires more memory than actual distance computation. (a) Two-step algorithm for tree edit distance and (b) strategy vs.
distance computation.

The fastest algorithms for the tree edit distance (TED) decompose the input trees into smaller subtrees and use dynamic
programming to build the overall solution from the subtree solutions. The key difference between various TED algorithms is
the decomposition strategy, which has a major impact on the runtime. Early attempts to compute TED [13,24,37] use a
hard-coded strategy, which disregards or only partially considers the shape of the input trees. This may lead to very poor
strategies and asymptotic runtime differences of up to a polynomial degree. The most recent development is the Robust Tree
Edit Distance (RTED) algorithm [30], which operates in two steps (cf. Fig. 1(a)). In the first step, a decomposition strategy is
computed. The strategy adapts to the input trees and is shown to be optimal among all previously proposed strategies. The
actual distance computation is done in the second step, which executes the strategy.
In terms of runtime, the overhead for the strategy computation in RTED is small compared to the gain due to the better
strategy. Unfortunately, this does not hold for the main memory consumption. Fig. 1(b) shows the memory usage for two
example trees (perfect binary trees) of 8191 nodes: the strategy computation requires 1.1 GB of RAM, while the execution of
the strategy (i.e., the actual distance computation) requires only 0.55 GB. Thus, for large instances, the strategy computation
is the bottleneck and the fallback is a hard-coded strategy. This is undesirable since the gain of a good strategy grows
with the instance size. Reducing the memory requirements of the strategy computation affects the maximum tree size that
can be processed. This is crucial especially for large trees like abstract syntax trees of source code repositories [15,20]
(Emacs: 4 10k nodes and MythTV: 4 50k nodes) or merger trees in astrophysics1 [33].
In this paper we propose the AP-TED algorithm, which solves the memory problem of the strategy computation. This is
achieved by computing the strategy bottom-up using dynamic programming and releasing part of the memorization tables
early. We prove that our algorithm requires at most 1/3 of the memory that is needed by RTED's strategy computation [30].
As a result, the memory cost of the strategy computation is never above the cost of the distance computation. Our extensive
experimental evaluation on various tree shapes, which require very different strategies, confirms our analytic memory
bound and shows that our algorithm is often much better than its theoretical upper bound. For some tree shapes, it even
runs in linear space, while the RTED strategy algorithm always requires quadratic space.
In addition to reducing the memory usage, AP-TED computes the optimum in a larger class of strategies than RTED.
Strategies are expressed by root-leaf paths that guide the decomposition of the input trees. A path decomposes a tree into
subtrees by deleting nodes and edges on a root-leaf path. Each resulting subtree is recursively decomposed by a new root-
leaf path. RTED computes the optimal LRH strategy. An LRH strategy considers only left, right, and heavy paths. The left
(right) root-leaf path connects each parent with its first (last) child; the heavy path connects the parent with the rightmost
child that roots the largest subtree. AP-TED considers all root-leaf paths and is not limited to left, right, and heavy paths.
Thus, our strategy is at least as good as the strategies used by RTED. To the best of our knowledge, this is the first algorithm
to compute the optimal all-path strategy. The runtime complexity of our strategy algorithm is Oðn2 Þ as for the RTED strategy.
This result is surprising since in each recursive step we need to consider a linear number of paths compared to only three
paths (left, right, and heavy) in the RTED strategy. Our empirical evaluation suggests that in practice our strategy algorithm
is even slightly faster than the RTED strategy algorithm since it allocates less memory.
On the distance computation side, we observe that a large number of subproblems that result from the tree decom-
positions are very small trees with one or two nodes only. We show that a significant boost can be achieved by treating
these cases separately. We introduce the AP-TED þ algorithm, which leverages that fact and achieves runtime improvements
of more than 50% in some cases.
Summarizing, the contributions of this paper are the following:

Memory efﬁciency. We substantially reduce the memory requirements w.r.t. previous strategy computation algorithms by
traversing the trees bottom-up and systematically releasing memory early. The resulting AP-TED algorithm always
consumes less memory for the strategy computation than for the actual distance computation and thus breaks the

1
Accessible at http://www.mpa-garching.mpg.de/millennium/.
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 159

bottleneck of previous algorithms. (We show the correctness of our approach and prove an upper bound for the memory
usage.)
Optimal all-path strategy. The decomposition strategy used by AP-TED is optimal in the class of all-path strategies. This
class generalizes LRH strategies and contains all strategies of previous TED algorithms. Although our strategy algorithm
must consider more paths, it is as efficient as the strategy algorithm in RTED (quadratic in the input size).
New single-path functions. We develop AP-TED þ , which leverages two new single-path functions to compute the distance
of subtree pairs when one of the subtrees is small. This case occurs frequently during the decomposition process. Our new
single-path functions run in linear time and at most linear space, which substantially improves over the single-path
functions ΔL , ΔR , and ΔI used in RTED [30]. To take full advantage of the new functions, we integrate them into the strategy
computation to obtain better strategies. Our experiments confirm the significant runtime improvement.

The paper is structured as follows. Section 2 sets the stage for our discussion of strategy algorithms. In Section 3 we
deﬁne the problem, and we present our AP-TED algorithm in Section 4. The memory efﬁcient implementation of the
strategy computation in AP-TED is discussed in Section 5. The AP-TED þ algorithm is presented in Section 6. We treat related
work in Section 7, experimentally evaluate our solution in Section 8, and conclude in Section 9.

2. Notation and background

2.1. Notation

We follow the notation of [30] when possible. A tree F is a directed, acyclic, connected graph with nodes N(F) and edges
EðFÞ D NðFÞ NðFÞ, where each node has at most one incoming edge. Each node has a label, which is not necessarily unique
within the tree. The nodes of a tree F are strictly and totally ordered such that (a) v 4w for any edge ðv; wÞ A EðFÞ, and (b) for
any two nodes f ; g, if f o g and f is not a descendant of g, then f og 0 for all descendants g 0 of g. The tree traversal that visits all
nodes in ascending order is the postorder traversal.
In an edge (v,w), node v is the parent and w is the child, pðwÞ ¼ v. A node with no parent is a root node, a node without
children is a leaf.
Fv is the subtree rooted in node v of F iff Fv is a tree, NðF v Þ ¼ fx: x ¼ v or x is a descendant of v in Fg, and EðF v Þ DEðFÞ. A path
γ in F is a subtree of F in which each node has at most one child. By γ ðFÞ we denote the set of all root-leaf paths in F. The left
path γ L ðFÞ A γ ðFÞ recursively connects a parent to its leftmost child. Similarly, the right path γ R ðFÞ A γ ðFÞ connects a parent to
its rightmost child. The heavy path connects a parent to the child which roots the rightmost largest subtree. An inner path is
a path in γ ðFÞ which is neither left nor right.
We use the following short notation: By jFj ¼ jNðFÞj we denote the size of F, we write v A F for v A NðFÞ. F γ is the set of
relevant subtrees of tree F with respect to the path γ obtained by removing path γ from F: F γ ¼ fF v : (xðx A γ 4 v2 = γ 4 pðxÞ ¼
pðvÞÞg.

Example 1. The nodes of tree F in Fig. 2 are NðFÞ ¼ fv1 ; v2 ; v3 ; v4 ; v5 ; v6 ; v7 ; v8 ; v9 ; v10 ; v11 ; v12 ; v13 g, the edges are
EðFÞ ¼ fðv13 ; v4 Þ, ðv13 ; v10 Þ; ðv13 ; v12 Þ; ðv4 ; v1 Þ; ðv4 ; v3 Þ; ðv3 ; v2 Þ; ðv10 ; v5 Þ; ðv10 ; v9 Þ; ðv9 ; v8 Þ; ðv8 ; v6 Þ; ðv8 ; v7 Þ; ðv12 ; v11 Þg, the node labels
are shown in italics in the ﬁgure. The root of F is v13 , and jFj ¼ 13. F v8 with nodes NðF v8 Þ ¼ fv6 ; v7 ; v8 g and edges
EðF v8 Þ ¼ fðv8 ; v6 Þ; ðv8 ; v7 Þg is a subtree of F. The path γ that connects v13 and v7 is the heavy path in F, and
F γ ¼ fF v4 ; F v5 ; F v6 ; F v12 g. γ L ðFÞ is the path from v13 to v1 and γ R ðFÞ is the path from v13 to v11. The paths from v13 to v2, v5, v6,
and v7, respectively, are the inner paths. The postorder traversal visits the nodes of F in the following order:
v1 ; v2 ; v3 ; v4 ; v5 ; v6 ; v7 ; v8 ; v9 ; v10 ; v11 ; v12 ; v13 .

F
v13 , a

v4 , b v10 , f v12 , l

v1 , c v3 , d v5 , g v9 , h v11 , m

v2 , e v8 , i

v6 , j v7 , k
Fig. 2. Example tree.
160 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173

2.2. Strategy and distance computation in RTED

The tree edit distance is the minimum-cost sequence of node edit operations that transforms tree F into G. The standard
edit operations are: delete a node, insert a node, and rename the label of a node. The state-of-the-art algorithms compute
the tree edit distance by implementing a well-known recursive solution [34] using dynamic programming [13,30,34,37].
They devise so-called decomposition strategies to decompose the input trees into smaller subtree pairs for which the distance
is computed ﬁrst. The results of smaller subproblems are used to compute the distances of larger problems. Pawlik and
Augsten [30] introduce path strategies as a subset of all possible decomposition strategies and propose a two-step process for
the TED computation: (a) strategy computation and (b) distance computation (cf. Fig. 1(a)).
(a) In the ﬁrst step, a path strategy is computed, which maps each pair of subtrees ðF v ; Gw Þ of two input trees F and G to a
root-leaf path in either Fv or Gw. RTED considers LRH strategies which allow left, right, and heavy paths only. The resulting
strategy is optimal in the sense that it minimizes the number of subproblems (i.e., the number of distances) that must be
computed. In this paper we consider all-path strategies which, in contrast to LRH strategies, do not have limitation on the
path type.

Deﬁnition 1. An all-path strategy S for two trees F and G maps each pair of subtrees ðF v ; Gw Þ, v A F, w A G, to a root-leaf path γ
in one of the subtrees, such that γ A γ ðF v Þ [ γ ðGw Þ [30].

(b) In the second step, the tree edit distance is computed. First, the input trees are decomposed into relevant subtree pairs
according to the strategy. Then, the corresponding distances are computed using so-called single-path functions. The single-
path function updates a distance matrix D, which stores distances between subtree pairs and is ﬁlled during the distance
computation [30].

Deﬁnition 2. Given are two subtrees Fv, Gw, a path γ A γ ðF v Þ, and the distances between all relevant subtrees of Fv with
respect to γ and all subtrees of Gw, i.e., between each pair ðF v0 ; Gw0 Þ: F v0 A F v γ 4 w0 A Gw . A single-path function ΔðF v ; Gw Þ
computes the distances between all subtrees rooted on path γ in Fv and every subtree of Gw, i.e., between all pairs
ðF v0 ; Gw0 Þ: v0 A γ 4 w0 A Gw .

Existing algorithms for single-path functions differ in the path type which they can process (ΔL : left path, ΔR : right path,
and ΔI : inner path), their cost (number of subproblems that are computed), and memory usage [30,32]. Depending on the
strategy, different single-path functions must be used, resulting in a different overall cost for computing the tree edit
distance. Although ΔI can processes any inner path, RTED considers only heavy paths. In this paper, we always assume the
improved version of ΔI as presented in [32], which requires less memory than the version in [30].

3. Problem deﬁnition

As outlined in previous sections, the path strategies introduced by Pawlik and Augsten [30] generalize all state-of-the-art
algorithms for computing the tree edit distance. They consider the class of LRH strategies and show optimality. However,
LRH strategies limit the paths to be left, right, or heavy. We observe that allowing all paths leads to less expensive strategies.
Another drawback of the RTED algorithm is the fact that the computation of the optimal strategy requires more space than
executing the strategy, i.e., computing the actual tree edit distance. This limits the maximum size of the tree instances which
can be processed in given memory. We moreover observe that many subtrees resulting from the optimal strategy are small.
This is true for any input. The single-path functions used in RTED are inefﬁcient in such cases and better solutions should be
devised.
Based on the issues in RTED, our goal is an algorithm with the following properties:

Memory efficient. The strategy computation should not be a main memory bottleneck. The optimal strategy must be
computed and stored within the memory bounds of the distance computation.
Allowing all paths in the strategy. The strategies should not be limited to only left, right, and heavy paths. The algorithm for
the optimal strategy should consider all paths and run in Oðn2 Þ time.
Fast for small subtrees. Executing the expensive single-path functions of RTED for small subtrees is a significant overhead
for the entire algorithm. Whenever possible, the distances of small subtrees should be computed efficiently.

4. AP-TED algorithm

Until now, only LRH strategies have been considered in literature [13,24,30,37]. They are limited to left, right and heavy
paths only. LRH strategies are only a fraction of all possible path strategies. There may exist non-LRH path strategies that
lead to better solutions. In principle, all possible path strategies must be checked for the best result. In this section we
present AP-TED, a new algorithm that computes the tree edit distance with the optimal all-path strategy. The core of AP-TED
is a new algorithm to compute the optimal all-path strategy efﬁciently in small memory. Then, similar to RTED, the strategy
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 161

is executed using single-path functions. We develop a recursive formula that computes the cost of the optimal all-path
strategy and an algorithm which implements the formula efﬁciently. In Section 5 we present techniques to improve the
memory requirements of the strategy computation in AP-TED.

4.1. Cost of the optimal all-path strategy

The tree edit distance is computed by executing a sequence of so-called single-path functions [30]. The sequence is
defined by the strategy. The cost of computing the distance is the cost sum of executing all single-path functions specified by
the strategy. Depending on the path type (left, right, and inner), different single-path functions must be used. Each single-
path function has a different cost. To compute the distance with the lowest cost, all possible path strategies must be
checked.
The cost formula in Fig. 3 performs an exhaustive search in the space of all possible path strategies. At each recursive
step, either F or G is decomposed with a path from γ ðFÞ or γ ðGÞ, respectively. For each of the choices, the cost of the
resulting relevant subtrees must be computed to find the strategy with the minimal cost.
Given a path, the cost is the sum of the optimal costs for the relevant subtrees with respect to the path plus the cost of
executing the single-path function. For each path type we use the single-path function with the lowest processing cost
(number of subproblems which must be computed). For left and right paths we use the costs of ΔL and ΔR , respectively. Let
γ I ðFÞ ¼ fγ A γ ðFÞ: γ a γ L ðFÞ 4 γ aγ R ðFÞg be the set of all inner paths of tree F. For inner paths we use the cost of ΔI [30,32]. We
denote the cost of a single-path function by jΔL=R=I j.

Theorem 1. Given the single path functions ΔL , ΔR , and ΔI , the cost formula in Fig. 3 computes the cost of the optimal all-path
strategy.

Proof. The proof is by induction. Base case: Both trees are single nodes: jΔðF; GÞj ¼ 1 for all path types [30] and costðF; GÞ ¼ 1.
Inductive hypothesis: costðF 0 ; GÞ: F 0 A F γ, γ A γ ðFÞ and costðG0 ; FÞ: G0 A G γ, γ A γ ðGÞ are optimal. The cost for a path γ in F
consists of the execution cost for the respective single path function (jΔL=R=I ðF; GÞj if γ is a left/right/inner path) and the cost
sum of the relevant subtrees F 0 A F γ. The cost computation for a path in G is analogous. All paths in F and G are considered,
and we pick the path with the minimum cost, thus costðF; GÞ is optimal.□

4.2. Strategy computation in AP-TED

In this section we present the AP-TED strategy algorithm (Algorithm 1), which computes the optimal all-path strategy.
The algorithm implements the all-path cost formula in Fig. 3 and is as efﬁcient as the RTED strategy algorithm in terms of
runtime, but requires signiﬁcantly less memory. Our early deallocation technique is implemented in lines 6, 7, and 38, and is
described in detail in Section 5.
For the all-path strategy, similarly to the RTED strategy, we compute the optimal path for each pair of subtrees. Different
from [30], however, at each step we have to evaluate a linear number of paths in both trees instead of only left, right, and
heavy. This makes the computation of the optimal all-path strategy much more challenging. A straightforward extension of
the RTED strategy algorithm, which evaluates a linear number of paths for each pair of subtrees, requires Oðn3 Þ time. This is
prohibitive because the strategy computation must not increase the overall complexity of the tree edit distance compu-
tation, which is Oðn2 Þ in the best case.

Fig. 3. Formula for the cost of the optimal all-path strategy.

162 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173

Algorithm 1. AP-TED strategy.

Input: trees F and G
1 LF ; RF ; I F : arrays of size jFj
2 STR: array of arrays of size jFjjGj
3 LG ; RG ; I G ; Pathw ; leafRow: arrays of size jGj
4 ﬁll each cell of leafRow with 0
5 for v’1 to jFj in postorder do
6
6 if v is leaf then L ½v’R ½v’I ½v’leafRow
6 6 F F F
6
7 6 if row for pðvÞ is not allocated then allocate a row for pðvÞ in LF ; RF ; I F
6
6 for w’1 to jGj inpostorder do
8 66
66
9 6 6 if w is leaf then LG ½w’RG ½w’I G ½w’0
66
6 6 C’fðjΔL ðF ; G Þj þ L ½v½w; γ L ðF ÞÞ;
10 66 v w F v
66
11 66 ðjΔI
ðF ; G Þj þI ½v½w; STR½v½wÞ;
66 v w F
66
12 66 ðjΔR ðF v ; Gw Þj þ RF ½v½w; γ R ðF v ÞÞ;
66
13 66 ðjΔL ðGw ; F v Þj þLG ½w; γ L ðGw ÞÞ;
66
66
14 66 ðjΔI ðGw ; F v Þj þI G ½w; Pathw ½wÞ;
66
15 66 ðjΔR ðGw ; F v Þj þ RG ½w; γ R ðGw ÞÞg
66
66
16 6 6 ðcmin ; γ min Þ’ðc; γÞ such that ðc; γÞ A C and c ¼ minfc0 : ðc0 ; γ 0 Þ A Cg
66
17 6 6 if v is not root then
666
666 þ
18 6 6 6 RF ½pðvÞ½w’ cmin
666
19 6 6 6 c ’ c þI ½v½w
6 6 6 tmp min F
666
20 6 6 6 if ctmp oI F ½pðvÞ½w then
666$
666
21 6 6 6 I F ½pðvÞ½w’ctmp
666
22 6 6 6 STR½pðvÞ½w’STR½v½w
666
666
23 6 6 6 if v A γ R ðF Þ then
6666 pðvÞ
6666
6666 þ
24 6 6 6 4 I F ½pðvÞ½w’RF ½pðvÞ½w
666
6 6 6 R ½pðvÞ½w’ þ
RF ½v½w cmin
25 666 F
666 (
666
666 þ LF ½v½w if v A γ L ðF pðvÞ Þ
26 6 6 4 LF ½pðvÞ½w’
66 cmin otherwise
66
66
66
27 666 if w is not root then
666
666 þ
28 6 6 6 RG ½pðwÞ’cmin
666
29 6 6 6 ctmp ’ cmin þI G ½w
666
666
30 6 6 6 if ctmp oI G ½pðwÞ then
666$
31 6 6 6 I G ½pðwÞ’ctmp
666
666
32 6 6 6 Pathw ½pðwÞ’Pathw ½w
666
666
33 6 6 6 if w A γ R ðGpðwÞ Þ then
6666
6666 þ
34 6 6 6 6 I G ½pðwÞ’ RG ½pðwÞ
6664
666 þ
35 6 6 6 RG ½pðwÞ’ RG ½w cmin
666
666 (
666 LG ½w if w A γ L ðGpðwÞ Þ
664 þ
36 6 6 LG ½pðwÞ’
66 cmin otherwise
64
6
6 STR½v; w’γ
37 6 min
4
38 if v is not leaf then deallocate row of v in LF ; RF ; I F
39 return STR

AP-TED strategy is similar to RTED strategy in that the sums in the cost formula are computed incrementally in a bottom-
up fashion and are stored in the cost arrays. The cost arrays of size jFjjGj are LF, RF, and IF (jFjjGj is the array size without our
early deallocation technique). LF, RF, and IF store the ﬁnal cost values for each pair ðF v ; Gw Þ; v A F; w A G, for left, right, and best
inner path in Fv, respectively:
P
LF ½v½w ¼ F 0 A F v γ L ðF v Þ costðF
0
; Gw Þ,
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 163

p

c

Fig. 4. Subtree Fp, its children subtrees – solid line, and path subtrees – dotted line.

P
RF ½v½w ¼ F A F γ ðF Þ costðF 0 ; Gw Þ,
0
P v
R
v
IF ½v½w ¼ minγ A γ ðF Þ f F A F γ costðF 0 ; Gw Þg.
0 I
v
0
v
0

The cost arrays of size jGj are LG, RG, and IG. They store the cost sums between all relevant subtrees of G w.r.t. the corre-
sponding path and a speciﬁc subtree Fv:
P
LG ½w ¼ G A G γ ðG Þ costðG0 ; F v Þ,
0
P w
L
w
RG ½w ¼ G A G γ ðG Þ costðG0 ; F v Þ,
0
P w
R
w
IG ½w ¼ minγ A γ ðG Þ f G A G γ costðG0 ; F v Þg.
0 I
w
0
w
0

The algorithm iterates over each pair of subtrees ðF v ; Gw Þ in postorder of the nodes; v and w are postorder IDs of nodes in
F and G, respectively. The cost sums stored in the cost arrays are used to compute the minimal cost cmin and the corre-
sponding path γmin (lines 10–16). The cost sums for the parent nodes, i.e., for the pairs ðF pðvÞ ; Gw Þ and ðF v ; GpðwÞ Þ, are updated
þ
in the lines 17–36 (a’b stands for a’a þ b). Finally, the best path, γmin, is stored in the strategy array STR (line 37) which
encodes the optimal strategy.
The challenge is to implement lines 3 and 6 of the cost formula in Fig. 3, which require to ﬁnd the best inner path in F and
G, respectively. While there is a single heavy path, there can be many inner paths. Our solution is to rewrite lines 3 and 6 of
the cost formula such that they can be computed incrementally while processing the children of a node with a constant time
operation for each child. To simplify the discussion, we focus on line 3, which computes the best inner path for tree F.
The key to our solution is the following observation. The relevant subtrees for a given path γ 0 A γ ðF p Þ in subtree Fp fall into
two disjoint categories of subtrees shown in Fig. 4:

(1) Children subtrees. All relevant subtrees rooted in the children of p, except the subtree rooted in child c0 , which is the child
on path γ 0 .
(2) Path subtrees. All relevant subtrees in F c0 for path γ″, which is the sub-path of γ 0 , i.e., a root-leaf path of F c0 .

We distinguish children and path subtrees and rewrite the cost sum as follows:
Ap Bc0 ;γ″
C c0
X zfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflffl{
X zfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflffl{ zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
X ffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{
costðF 0 ; GÞ ¼ costðF c ; GÞ costðF c0 ; GÞ þ costðF ″ ; GÞ ð1Þ
F 0 A F p γ0 c A chðpÞ F ″ A F c0 γ ″
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
cost for children subtrees
cost for path subtrees

where p is a node in F, ch(p) denotes the children of p, γ A γ ðF p Þ, c0 is the child of p on path γ 0 (i.e., c0 A chðpÞ and c0 A γ 0 ), γ″ is
0

the sub-path of γ 0 that traverses F c0 (i.e., Nðγ 0 Þ⧹Nðγ″Þ ¼ fpg and Eðγ 0 Þ⧹Eðγ″Þ ¼ fðp; c0 Þg).
Using Eq. (1), we express RF ½pðvÞ½w and I F ½pðvÞ½w by formulas that are useful to compute these values incrementally. Let
P P
cr(p) denote the rightmost child of p, Ap ¼ c A chðpÞ costðF c ; GÞ, Bc0 ;γ″ ¼ F″ A F c0 γ″ costðF″; GÞ, C c0 ¼ costðF c0 ; GÞ:
RF ½pðvÞ½w ¼ BpðvÞ;γ R ðF pðvÞ Þ ¼ ApðvÞ C cr ðpðvÞÞ þ Bcr ðpðvÞÞ;γ R ðF cr ðpðvÞÞ Þ ¼ ApðvÞ C cr ðpðvÞÞ þ RF ½cr ðpðvÞÞ½w ð2Þ

I F ½pðvÞ½w ¼ min fBpðvÞ;γ 0 g ¼ ApðvÞ þ min f C c þ min fBc;γ ″ gg ¼ ApðvÞ þ min f C c þI F ½c½wg ð3Þ
γ 0 A γ I ðF pðvÞ Þ c A chðpðvÞÞ γ ″ A γ I ðF c Þ c A chðpðvÞÞ

Eqs. (2) and (3) are used in Algorithm 1 to incrementally compute the cost sums for the parent node p(v) while traversing
its children. The postorder traversal of the nodes ensures that we process all children of p(v) before node p(v) itself. The
child cr ðpðvÞÞ is traversed last. In Eq. (2), we compute the cost sum for the relevant subtrees w.r.t. the right path in F pðvÞ ,
which is stored in RF ½pðvÞ½w. We use the previously computed sum stored in RF ½cr ðpðvÞÞ½w. C cr ðpðvÞÞ is the value cmin for the
node cr ðpðvÞÞ, i.e., the minimal cost of computing the distance for the subtree pair ðF cr ðpðvÞÞ ; Gw Þ. The sum ApðvÞ is computed
incrementally by summing up the values cmin for each child of p(v) (cf. line 18). ApðvÞ is stored temporarily in RF ½pðvÞ½w until
the last child (rightmost child) is processed.
164 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173

In Eq. (3), for each child of node p(v), we compute the value ctmp ¼ C c þ I F ½c½w, which is the sum of (a) a negative cost
cmin for the particular child c and (b) the cost sum of the best inner path in Fc stored in I F ½c½w. While traversing the children
of p(v), we compute the minimal value ctmp and store it in I F ½pðvÞ½w (cf. lines 20–21). The minimum is found when the last
child is traversed, i.e., the node cr ðpðvÞÞ. Then, RF ½pðvÞ½w and I F ½pðvÞ½w are updated with the correct cost values w.r.t. Eqs.
(2) and (3), respectively (cf. lines 24–25).
While traversing the children, the best inner paths in Fv and Gw seen so far are stored in the strategy array STR and in the
array Pathw , respectively, e.g., STR½pðvÞ½w stores the best inner path which corresponds to the minimal cost sum stored in
I F ½pðvÞ½w (line 22), Pathw ½pðwÞ stores the best inner path of the minimal cost sum stored in I G ½pðwÞ (line 32). After
traversing the last child, STR½pðvÞ½w and Pathw ½pðwÞ store the best inner paths corresponding to the minimal cost of
computing the distance between the subtrees ðF pðvÞ ; Gw Þ and ðF v ; GpðwÞ Þ, respectively. While traversing the nodes p(v) and
p(w), the best inner path is compared to the respective left and right paths, and the overall best path is stored in the strategy
array STR (line 37).
With the above analysis we show the correctness of the following theorems.

Theorem 2. AP-TED strategy (Algorithm 1) is correct and computes the optimal path strategy.

Proof. We show by induction that the cost arrays store correct values. Base case: For each pair of leaf nodes (v,w) the cost
arrays are zero (lines 6 and 9). This is correct due to F v γ F ¼ Gw γ G ¼ ∅, where γ F A γ ðFÞ and γ G A γ ðGÞ. Inductive hypothesis:
The values in the cost arrays for all children of node v are correct. We show that after processing all children, the values for v
are correct. We show this for the array IF. The proof for LF ; RF ; LG ; RG ; I G is similar.
With Eq. (3), for node v and its children ch(v),
I F ½v½w ¼ Av þ min f C c þI F ½c½wg:
c A chðvÞ

While processing the children of v we compute the summands as follows: (a) Av is computed by summing up the cost
cmin ¼ C c for every child c A chðvÞ. (b) The minimal value ctmp ¼ C c þ I F ½c½v is computed among all children of v. Due to the
induction hypothesis the value I F ½c½v is correct for every node c A chðvÞ. Thus, the value stored in I F ½v½w is correct.
The strategy array STR maps each pair of subtrees to a root-leaf path and thus is a strategy by Deﬁnition 1. Only the
minimum-cost paths are stored in the strategy array, thus STR stores the optimal strategy.□

Theorem 3. Time and space complexity of the AP-TED strategy algorithm are Oðn2 Þ, where n ¼ maxfjFj; jGjg.

Proof. Algorithm 1 iterates over each pair of subtrees ðF v ; Gw Þ, v A F; w A G, hence the innermost loop is executed jFjjGj times.
Inside the innermost loop a constant number of arithmetic operations and array lookups is performed. The costs of the
single-path functions in lines 10–15 are precomputed in OðjFj þ jGjÞ time and space. Four arrays of size jFjjGj and ﬁve arrays
of size jGj are used. Thus, the overall time and space complexity is OðjFjjGjÞ.□

5. Memory efﬁciency in AP-TED

The main memory requirement is a bottleneck of the tree edit distance computation. The strategy computation in RTED
exceeds the memory needed for executing the strategy. Our AP-TED strategy algorithm reduces the memory usage by at
least 2/3 and never uses more memory than the execution of the strategy. We achieve that by decreasing the maximum size
of the data structures used for strategy computation.

5.1. Memory analysis

The asymptotic space complexity is quadratic for both computing the strategy and executing it (cf. Fig. 1(a)). However,
due to different constants and depending on the strategy, the strategy computation may use twice the memory of the
distance computation. This is undesirable since the strategy computation becomes a memory bottleneck of the overall
algorithm.
We analyse the memory usage of the strategy and distance computation steps. For the moment we ignore the lines in
Algorithm 1 that perform early deallocation (lines 6, 7, and 38). Then, the memory requirement of the strategy computation
is dominated by three cost arrays LF ; RF ; I F , and the strategy array STR, which take 4jFjjGj space in total.
The memory of the distance computation depends on the strategy. The tree edit distance is computed by single-path
functions which process a single path from the strategy. Different path types require a different amount of memory.
Independent of the strategy, a distance matrix of size jFjjGj is used.

(a) If only left and/or right paths are used in the strategy, the single path functions require only one array of size jFjjGj. Thus
2jFjjGj space is needed in total.
(b) If an inner path is involved in the strategy, an array of size jFjjGj and an array of size jGj2 are used, amounting to
2jFjjGj þjGj2 space in total.
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 165

We observe that the distance computation always uses less memory than the strategy. Thus, the strategy computation is the
limiting factor for the maximum tree size that can be processed in given memory.
The goal is to use less memory for the strategy computation than for the distance computation. The key observation is
that some of the rows in the cost arrays are no longer needed once they have been read. We develop an early deallocation
technique to minimize the size of the cost arrays.

5.2. Early deallocation of rows

The AP-TED strategy algorithm uses three cost arrays of quadratic size (LF, RF, IF), where each row corresponds to a node
v A F. The outermost loop (line 5) iterates over the nodes in F in postorder. In each iteration only two rows of the cost arrays are
accessed: the row of node v is read and the row of its parent p(v) is updated. Once the row of node v has been read it will not
be accessed later and thus can be deallocated. In our analysis we count the maximum number of concurrently needed rows.
We observe that a row which corresponds to a leaf node stores only zeros. Thus we store a single read-only row for all
leaf nodes in all cost arrays and call it leafRow (cf. line 6).
We deﬁne a set of four operations that we need on the rows of a cost array.

read(v)/read(leafRow) – read the row of node v/read leafRow,

allocate(v) – allocate a row of node v,
update(v) – update values in the row of node v,
deallocate(v) – deallocate the row of node v.

The same operations are synchronously applied to the rows of all cost arrays (LF, RF, IF). To simplify the discussion, we
consider a single cost array and sum up in the end.
AP-TED strategy (Algorithm 1) dynamically allocates and deallocates rows in the cost arrays. Lines 6, 7, and 38 implement
our technique. At each iteration step, i.e., for each node v A F in postorder, the following steps are performed on a cost array:
if v is not root of F
if row for p(v) does not exist: allocate(p(v)) (line 7)
read(v) (lines 10–15)
update(p(v)) (lines 17–36)
if v is not a leaf: deallocate(v) (line 38)

Depending on the position of a node in the tree, a speciﬁc sequence of operations must be performed. Fig. 5 shows the
possible sequences and assigns them to the respective node types.

Example 2. We study the number of concurrently need rows for the example tree in Fig. 6. The table in the ﬁgure shows the
operations performed for every node of the example tree and the rows stored in memory after each step. The subscript
numbers of the nodes represent their postorder positions. We iterate over the nodes in postorder, i.e., v1 ; v2 ; v3 ; v4 ; v5 ; v6 .
Depending on the node type, different operations are performed that lead to the following operation sequences: v1 S1,
v2 S1, v3 S3, v4 S2, v5 S4, v6 S0.
The cost array for the example tree has six rows (one for each node). From the table in Fig. 6 we see that a maximum of
only three rows (leafRow, rows for v3 and v4) need to be stored at the same time.

We now consider the general case and count the number of rows that must be kept in memory at the same time. Each of
the operation sequences allocates and/or deallocates rows. With sðvÞ A fS0; S1; S2; S3; S4g we denote the operation sequence
required for node v, with d(S), S A fS0; S1; S2; S3; S4g, we denote the difference between the number of allocated and deal-
located rows by a particular operation sequence S, i.e., dðS0Þ ¼ 0, dðS1Þ ¼ 1, dðS2Þ ¼ 0, dðS3Þ ¼ 1, dðS4Þ ¼ 0. We count the
maximum number of concurrently needed rows during the postorder traversal of tree F with the following formula:
( )
X
cnrðFÞ ¼ 1 þ max dðsðwÞÞ
vAF
w A F;w o v

S0 =
leaf non-leaf
S1 = allocate(p(v)), read( leaf Row), update( p(v))
leftmost child S1 S2
S2 = allocate(p(v)), read( v), update( p(v)), deallocate(v) not leftmost child S4 S3
S3 = read(v), update( p(v)), deallocate(v) root S0
S4 = read(leaf Row ), update( p(v))

Fig. 5. Row operations depending on the node position.

166 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173

v6 node operations rows in memory

v1 allocate(v4 ), read(leaf Row), update(v4 ) leaf Row, v4
v4 v5
v2 allocate(v3 ), read(leafRow), update(v3 ) leaf Row, v3 , v4
v1 v3 v3 read(v3 ), update(v4 ), deallocate(v3 ) leafRow, v4
v4 allocate(v6 ), read(v4 ), update(v6 ), deallocate(v4 ) leaf Row, v6
v2
v5 read(leafRow), update(v6 ) leafRow, v6
v6 - -

Fig. 6. Row operations on an example tree.

The leafRow is always kept in memory. In addition, each node contributes with d(S) rows, where S is the operation sequence
required by the node. For the tree in Fig. 6 we have cnrðFÞ ¼ 1 þ maxf1; 1 þ 1; 1 þ1 1; 1 þ 1 1 þ 0; 1 þ 1 1 þ 0 þ 0;
1 þ 1 1 þ 0 þ 0 þ 0g ¼ 3.
We observe that only operation sequence S1 adds more rows than it deallocates, i.e., dðS1Þ 40. An upper bound on the
number of concurrently needed rows is given by the number of nodes that require operation sequence S1. The only type of
nodes that falls into operation sequence S1 is a leaf node which is the leftmost child of its parent.

Lemma 1. The maximum number of concurrently needed rows of each cost array is ⌊jFj=2c.

Proof. The maximum number of the concurrently needed rows for a tree F is the number of its leftmost-child leaf nodes,
which is bound by ⌊jFj=2c: All leftmost-child leaf nodes, must have a different parent. This results in ⌊jFj=2c different
parent nodes for ⌊jFj=2c leftmost-child leaf nodes. If jFj is odd, then ⌊jFj=2c þ⌊jFj=2c ¼ jFj 1 and we can add one more
leftmost-child leaf to the tree. The new node will either become the child or the left sibling of an existing leftmost-child leaf
node. Thus, by adding a new leftmost-child leaf we loose an existing one. If jFj is even, ⌊jFj=2c þ ⌊jFj=2c ¼ jFj. Thus, the
maximum number of the leftmost-child leaf nodes is ⌊jFj=2c.□

An example of a tree with ⌊jFj=2c leftmost-child leaf nodes is the right branch tree (a tree symmetric to the left branch
tree in Fig. 8(a)).

Lemma 2. The memory required for the strategy computation in AP-TED is 2:5jFjjGj, including the output (strategy array STR).

Proof. By expiring rows, with Lemma 1, the sizes of the cost arrays LF, RF, and IF are reduced from jFjjGj to at most 0:5jFjjGj.
The strategy array STR requires jFjjGj memory. The size of STR is not reduced since it stores the ﬁnal strategy after the
algorithm terminates. Thus, the AP-TED strategy algorithm requires 2:5jFjjGj memory in the worst case.□

Using our early deallocation technique, AP-TED requires only 2:5jFjjGj memory for computing and storing the strategy.
Unfortunately, it is still 0:5jFjjGj above the lower bound for the distance computation. In the following section we show that
we can reduce the memory to 2jFjjGj by traversing the trees in a clever order.

5.3. Tightening the memory upper bound

We reduce the number of concurrently needed rows in the cost matrices to ⌈jFj=3⌉. Our solution is based on the fol-
lowing observation. The AP-TED strategy algorithm iterates over the nodes in postorder. One can think of a symmetric
algorithm which iterates over the nodes in right-to-left postorder.2 Then, the maximum number of concurrently needed
rows is equal to the maximum number of the rightmost-child leaves in a tree (instead of leftmost-child leaves). For the right
branch tree, which is the worst case for postorder, the right-to-left postorder algorithm needs only three rows, including the
leafRow. Thus, the direction of the tree traversal matters. Let lðFÞ and rðFÞ be the number of leftmost-child and rightmost-
child leaf nodes in F, respectively. We use the following rule: if lðFÞ orðFÞ, then use (left-to-right) postorder, otherwise use
right-to-left postorder to compute the strategy. We call the order resulting from our rule an adaptive order.
We observe that whenever lðFÞ is high, rðFÞ is low and vice versa. Moreover, in practice minflðFÞ; rðFÞg is typically much
below jFj=2. We show that the number of rows is bounded to ⌈jFj=3⌉.
The ﬁrst step is to show that for any node v A F at most the rows for v and its ancestors, i.e., the nodes on the path from v
to the root, are concurrently stored in main memory.

Lemma 3. For any node v A F processed by AP-TED strategy (Algorithm 1), at most the rows for v and its ancestors are con-
currently stored in main memory.

Proof. We show the proof for the postorder traversal. The reasoning for the right-to-left postorder is similar.
Given a node v A F, we divide the nodes of F into four disjoint sets visualized in Fig. 7:

L ¼ fl A F: l o v 4 l is not descendant of vg,

2
In right-to-left postorder, children are traversed from right to left before their parents. For example, traversing the tree in Fig. 6 in right-to-left
postorder gives the following node sequence: v5 ; v2 ; v3 ; v1 ; v4 ; v6 .
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 167

L R
v

Fig. 7. Node sets L (left), R (right), A (ancestors), and D (descendants) in the proof of Lemma 3.

R ¼ fr A F: r 4 v 4 r is not ancestor of vg,

A ¼ fa A F: a is an ancestor of vg,
D ¼ fd A F: d is a descendant of vg.

We show that while traversing node v during the strategy computation, the only nodes for which a row may be allocated
are v and the nodes in A.
While node v is traversed, its row must be present in memory. All nodes in L and D precede v in postorder and have
already been processed. Thus, their rows have been deallocated and are no longer in memory. The row of a node is ﬁrst
allocated when its leftmost child is processed. The nodes of R and all their descendants (which are also in R) succeed v in
postorder and no row is allocated. Also the nodes in A succeed v in postorder and have not been processed. However, some
nodes in A may have a leftmost child in L [ fvg, thus the respective row may be allocated.□

Lemma 4. For a given tree F, the strategy computation in AP-TED with adaptive order requires at most ⌈jFj=3⌉ rows in main memory.

Proof. The proof is by contradiction. We assume a node v that has k 4⌈jFj=3⌉ ancestors for which a row must be stored in
both the postorder and right-to-left postorder iteration of the tree.

Postorder. If a row is allocated for an ancestor a of v, then a has a leftmost child which is not an ancestor of v. Node v may
cause the row allocation for its parent p(v). For the remaining k 1 ancestors of v there must be k 1 leftmost children.
Right-to-left postorder. Similarly, for k1 ancestors of v there must be k 1 rightmost children which are not ancestors of v.

Thus, F has at least 3k 1 nodes (node v, k ancestors of v, k 1 leftmost children, k1 rightmost children) and we get the
following contradiction:

k4 ⌈jFj3 ⌉ Z ⌈3k3 1⌉ ¼ ⌈k 13⌉ ¼ k□

Theorem 4. Using an adaptive order, the memory required by AP-TED to compute the strategy is bounded by 2jFjjGj, including
the output (strategy array).

Proof. By Lemmas 3 and 4 we know that the maximum number of concurrently needed rows of each cost array is at most
⌈jFj=3⌉. The strategy array STR needs jFjjGj memory. This results in 3ððjFj=3ÞjGjÞ þ jFjjGj ¼ 2jFjjGj memory.□

6. AP-TED þ algorithm

The RTED algorithm computes the tree edit distance by executing the single-path functions for the subtree pairs resulting
from the strategy. We observe that when one of the input trees in a single-path function is small, the distance can be com-
puted more efficiently than with the existing single-path functions. We address two special cases, which are very frequent and
have a high impact on the runtime: one- and two-node trees. We present AP-TED þ , a new algorithm that improves over
previous algorithms, including AP-TED, in two ways. First, AP-TED þ uses a new, efficient single-path functions for the cases of
small subtrees. The new single-path functions significantly reduce memory and runtime compared to previous single-path
functions. Second, AP-TED þ leverages the lower costs of the new single-path functions to compute better strategies. AP-TED þ
first computes the optimal all path strategy considering the lower costs for small subtrees and then executes the strategy using
the new single-path functions for small subtrees and the old single-path functions for all other cases.
168 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173

6.1. One-node single-path function

Let F and G be the input trees to the TED algorithm, cd(a), ci(b), cr ða; bÞ the costs of deleting a A F, inserting b A G, and
renaming the label of a to the label of b, respectively. Let F 0 and G0 be the subtrees of F and G, such that G0 is a one-node tree,
NðG0 Þ ¼ fwg, EðG0 Þ ¼ ∅, and jF 0 jZ 1. We assume that cr ðv; wÞ rcd ðvÞ þ ci ðwÞ for any node v A F 0 (otherwise rename will never
contribute to the minimal distance). Then, the distance between F 0 and G0 is computed with the following formula:
X
δ1node ðF 0 ; G0 Þ ¼ cd ðvÞ þmin0 fcr ðv; wÞ cd ðvÞg ð4Þ
vAF
v A F0

The distance between F and G0 is the cost of deleting all nodes in F 0 except the one for which the rename cost to the label
0

of the node in G0 is minimal. The rename cost is added to the ﬁnal result.
δ1node ðF 0 ; G0 Þ can be evaluated in a single traversal of the bigger tree and does not require any data structures for
storing intermediate results. In order to implement δ1node ðF 0 ; G0 Þ as a single-path function, we must store the
resulting distances of all subtree pairs ðF 0v ; G0w Þ, v A F 0 , w A G0 , in distance matrix D [30]. Algorithm 2 implements Eq. (4) as a
single-path function.

Algorithm 2. Δ1node ðF 0 ; G0 Þ.
Input: trees F 0 ; G0 , jG0 j ¼ 1, NðG0 Þ ¼ fwg
1 costSum’0
2 for v A F 0 in postorder do
6
6 þ
3 6 costSum’cd ðvÞ
6
4 6
4 match’minfmatch; cr ðv; wÞ cd ðvÞg
5 DðF 0v ; G0w Þ’costSum match

6.2. Two-node single-path function

Let now G0 be a two-node tree, NðG0 Þ ¼ fw1 ; w2 g, EðG0 Þ ¼ fðw2 ; w1 Þg. anc(v) is the set of ancestors of node v in F 0 without v.
Then, the distance between F 0 and G0 is computed with Eq. (5). We limit our discussion to settings with the following
properties: (a) jF 0 j Z 2 (otherwise δ1node can be used), (b) cr ðv; wÞ rcd ðvÞ þci ðwÞ for all nodes v A F 0 and w A G0 (otherwise
rename will never contribute to the minimal distance).
! 8
X < cr ðv1 ; w2 Þ ðcd ðv1 Þ þci ðw2 ÞÞ
0 0
δ2node ðF ; G Þ ¼ cd ðvÞ þ ci ðw1 Þ þ ci ðw2 Þ þ min0 cr ðv1 ; w1 Þ ðcd ðv1 Þ þci ðw1 ÞÞ þ min fcr ðv2 ; w2 Þ ðcd ðv2 Þ þci ðw2 ÞÞg ð5Þ
v AF : v2 A ancðv1 Þ
v A F0 1

Eq. (5) substitutes one or two pairs of deletion and insertion with a rename in order to ﬁnd the minimal distance.

Algorithm 3. Δ2node ðF 0 ; G0 Þ.
Input trees F 0 ; G0 , jG0 j ¼ 2, NðG0 Þ ¼ fw1 ; w2 g, EðG0 Þ ¼ fðw2 ; w1 Þg
1 match: ½1::jF 0 j½0::2: boolean array initialized to false
2 for v∈F 0 in postorder do
6
3 6 DðF 0v ; G0w2 Þ←maxfjF 0v j; 2g
6
4 66 if match½v½2 then
6
5 6 ⌊DðF 0v ; G0w2 Þ←jF 0v j−2
6
6 66 else if match½v½0 ∨ match½v½1 then
6 0
7 6 0 0
6 ⌊DðF v ; Gw2 Þ←jF v j−1
6
8 6 else if v is leaf ∧ðlðvÞ ¼ lðw1 Þ∨lðvÞ ¼ lðw2 ÞÞ then
6
9 6 0 0
6 ⌊DðF v ; Gw2 Þ←1
6 0
10 6 DðF v ; Gw Þ←jF 0v j
0
6 1
11 6
6 if match½v½0∨lðvÞ ¼ lðw1 Þthen
12 6
6 ⌊DðF 0v ; G0w Þ←jF 0v j−1
6 1
13 6
6 if v is not root of F then
0
6
14 6 6
6 match½pðvÞ½0←match½v½0∨lðvÞ ¼ lðw1 Þ∨match½pðvÞ½0
66
15 6 6
4 4 match½pðvÞ½1←match½v½1∨lðpðvÞÞ ¼ lðw2 Þ∨match½pðvÞ½1
16 match½pðvÞ½2←match½v½2∨match½pðvÞ½2∨ððmatch½v½0∨lðvÞ ¼ lðw1 ÞÞ∧lðpðvÞÞ ¼ lðw2 ÞÞ
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 169

Algorithm 3 implements Eq. (5) for the unit cost model, where for any node v A F 0 and w A G0 : cd ðvÞ ¼ ci ðwÞ ¼ 1, cr ðv; wÞ ¼ 0
if the labels of v and w are equal and cr ðv; wÞ ¼ 1 otherwise. Algorithm 3 computes the distance in a single postorder
traversal of tree F 0 . It uses a two-dimensional boolean array match with three columns of length jF 0 j to store the intermediate
label matches:

match½v½0: lðw1 Þ matches the label of some non-root node in F 0v ,

match½v½1: lðw2 Þ matches the label of some non-leaf node in F 0v ,
match½v½2: there is a pair of nodes ðv1 ; v2 Þ in F 0v such that v2 is an ancestor of v1, lðv1 Þ ¼ lðw1 Þ, and lðv2 Þ ¼ lðw2 Þ.

For each node v A F 0 , the subtree distances ðF 0v ; G0w1 Þ and ðF 0v ; G0w2 Þ are stored in the distance matrix D. Finally, the entries in
array match for the parent of v are updated (lines 13–14): a label match in subtree F 0v is also valid for F 0pðvÞ .

6.3. Better strategy with new single-path functions

In order to take full advantage of one- and two-node single-path functions, they must be considered in the strategy
computation. The reason is the difference in the computation costs (the number of computed distances). We ﬁrst show that
our new single-path functions, Δ1node and Δ2node , always have a lower cost than the single-path functions used in RTED (ΔL ,
ΔR , and ΔI ). In the worst case, the costs of ΔL=R=I is cubic in the tree size, whereas the cost of our functions is always linear.
The cost of a single-path function, jΔðF; GÞj, for a pair of trees F and G, is the number of subproblems that must be computed.

Theorem 5. For any two trees, F and G, such that jGj r2, the following holds:
jΔ1=2node ðF; GÞj rjΔL=R=I ðF; GÞj

Proof. Both Δ1node ðF; GÞ and Δ2node ðF; GÞ are computed in a single traversal of tree F. For each node in F only a constant
number of arithmetic operations and table lookups is performed, then jΔ1=2node ðF; GÞj ¼ jFj. Depending on the context, for ΔL ,
ΔR , and ΔI the following lower bounds hold [30]:
P
(a) jΔL=R=I ðF; GÞj ZjFjjGj þ F 0 A F γ jΔL=R=I ðF 0 ; GÞj ZjFj,
(b) jΔL=R=I ðG; FÞj ZjGjjFj ZjFj.□

The second step is to include the costs of the new functions in our AP-TED strategy algorithm to obtain better strategies.
This is straightforward: when the minimal cost is computed in line 10 of Algorithm 1, we check the sizes of the subtrees and
use the costs of our new single-path functions if applicable.

7. Related work

Tree edit distance algorithms. The tree edit distance has a recursive solution, which decomposes the input trees into
smaller subtrees and subforests. The best known algorithms are dynamic programming implementations of this recursive
solution, where small subproblems are computed first. The first tree edit distance algorithm was proposed by Tai [34]. It
runs in Oðn6 Þ time and space where n is the number of tree nodes. The runtime complexity is given by the number of
subproblems that must be solved. Zhang and Shasha [37] significantly improve the complexity and achieve Oðn4 Þ time and
Oðn2 Þ space. Klein [24] uses so-called heavy paths, which guide the dynamic programming solution by ordering the sub-
problems resulting with Oðn3 log nÞ runtime and space complexity. Demaine et al. [13] developed an algorithm which
requires Oðn3 Þ time and Oðn2 Þ space. They moreover show that their solution is worst-case optimal. Pawlik and Augsten [30]
observe that the previous algorithms are efficient only for specific tree shapes and run into their worst case otherwise. They
propose the general framework for computing the tree edit distance shown in Fig. 1(a) and present the RTED algorithm.
RTED computes and executes the optimal LRH strategy and runs in Oðn3 Þ time and Oðn2 Þ space. Compared to previous
algorithms, it does not run into the worst case if a better strategy exists. AP-TED, the solution presented in this paper, enjoys
the same features but in addition (a) requires less memory and (b) executes the optimal all-path strategy, a superclass of
LRH strategies. In [32], Pawlik and Augsten further improve RTED and introduce a novel indexing scheme, called root
encoding, that stores and retrieves intermediate distance results in constant time. They develop a new single-path function,
ΔA , that combines the features of the previous solutions for left, right, and inner paths [13,30,37]. Thanks to the root
encoding, ΔA requires less memory for inner paths than ΔI as proposed by Demaine et al. [13]. We leverage these results in
our AP-TED algorithm and build the optimizations of ΔA for inner paths into our single-path function ΔI .
Path strategies. Each of the state-of-the-art algorithms uses a specific set of paths to decompose trees and order the
computation of subproblems in the dynamic programming solution. The algorithms by Zhang and Shasha [37], Klein [24]
and Demaine et al. [13] use hard-coded strategies which do not require the strategy computation step. The resulting
algorithms are efficient only for specific tree shapes. Dulucq and Touzet [14] compute a decomposition strategy in the first
step, then use the strategy to compute the tree edit distance. Since they only consider strategies that decompose a single
tree, the resulting algorithm runs in Oðn3 log nÞ time and space. On-the-fly strategies that decompose both trees were
170 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173

introduced by Pawlik and Augsten [30]. The resulting algorithm, RTED, requires Oðn3 Þ time and Oðn2 Þ space, and their
strategy is shown to be optimal in the class of LRH strategies, which covers all previous solutions. Unfortunately, the strategy
computation in RTED requires more memory than the actual distance computation.
In a preliminary version of this paper [31], we tackle the memory problem of the strategy computation in RTED [30] and
prove an ⌈n=2⌉ upper bound for the number of rows in the cost matrices. This, however, is not enough to keep the strategy,
which is of quadratic size, in main memory between the strategy and distance computation steps. In this paper, we improve
over [31] as follows. (a) We decrease the upper bound for the number of rows to ⌈n=3⌉. This allows the strategy to be
computed and stored in main memory without increasing the overall memory requirements for the distance computation.
Thus, we fully solve the memory problem of the strategy computation. (b) We present the AP-TED algorithm to compute
TED with the optimal all-path strategy, which subsumes LRH strategies. Our new strategy algorithm runs in Oðn2 Þ time and
implements an early deallocation technique to reduce the memory requirements. (c) We introduce the AP-TED þ algorithm
with two new single-path functions for small subtrees, which are more efﬁcient in terms of runtime and memory than the
single-path functions used in RTED. By considering the new functions in the strategy computation we further improve the
decomposition strategy.
Approximations. More efﬁcient approximation algorithms for the tree edit distance have been proposed. Akutsu et al. [2]
approximate the tree edit distance with the string edit distance δS between the Euler string representation of the input trees.
They show that the tree edit distance is between δS =2 and ð2h þ1ÞδS , where h is the maximum tree height. By encoding small
subtrees into node labels, they further achieve the lower and upper bounds δS =6 and Oðn3=4 ÞδS , respectively. Aratsu et al. [4]
compute the string edit distance δB between so-called binary tree codes of the input trees and achieve the bounds δB =2 and
ðhþ 1ÞδB þ h. The binary tree code is computed from the binary tree representation by traversing the nodes in postorder and
concatenating the labels. Both solutions run in Oðn2 Þ time. Zhang proposes an upper bound, the constrained tree edit dis-
tance, which runs in Oðn2 Þ time and space [36]. Wang and Zhang [35] improve the space complexity of the constrained tree
edit distance to Oðn log nÞ with a technique similar to our approach for the strategy computation. Guha et al. [18] develop a
lower bound algorithm by computing the string edit distance between the preorder and postorder sequences of node labels
in Oðn2 Þ time and space. Augsten et al. [6] decompose the trees into pq-grams and propose a lower bound algorithm which
requires Oðn log nÞ time and O(n) space. Finis et al. [16] develop the RWS-Diff algorithm (Random Walk Similarity Diff) to
compute an approximation of the tree edit distance and an edit mapping in Oðn log nÞ time.

8. Experiments

In this section we experimentally evaluate AP-TED and AP-TED þ and compare them to RTED [30]. Our empirical eva-
luation on real-world and synthetic data confirms our analytical results: computing the strategy in AP-TED is as efficient as
in RTED, but requires significantly less memory. In particular, the strategy computation requires less memory than the actual
tree edit distance computation.
Set-up. All algorithms are implemented as single-thread applications in Java 1.7. We run the experiments on a 4-core Intel
i7 3.70 GHz desktop computer with 32 GB RAM.
The datasets. We test the algorithms on both synthetic and real world data. We generate synthetic trees of three different
shapes: left branch (LB), zigzag (ZZ) and full binary tree (FB) (Fig. 8). We also generate random trees (random) varying in
depth and fanout (with a maximum depth of 15 and a maximum fanout of 6).
We use three real world datasets with different characteristics. SwissProt3 is an XML protein sequence database with
50 000 medium sized and flat trees (average depth 3.8, maximum depth 4, average fanout 1.8, maximum fanout 346, and
average size 187). TreeBank4 is an XML representation of natural language syntax trees with 56 385 small and deep trees
(average depth 10.4, maximum depth 35, average fanout 1.5, maximum fanout 51, and average size 68). TreeFam5 stores
16 138 phylogenetic trees of animal genes (average depth 14, maximum depth 158, average fanout 2, maximum fanout 3,
and average size 95).
Memory measurement. We first measure the memory requirements for the strategy computation and compare it to the
memory needed for executing the strategy. We measure only the heap memory allocated by the Java Virtual Machine. The
non-heap memory varies between 3.5 MB and 5 MB for all test cases, including the tests on the largest trees.
Fig. 9(a)–(d) shows the memory requirements for different tree shapes and sizes. We use three datasets containing
synthetic trees of the same shape (LB, ZZ, and FB) and a dataset with randomly generated trees (random). Each synthetic
dataset contains trees with up to 40 000 nodes. The lines represent the memory usage in MB for strategy computation in
RTED and AP-TED, and the memory needed for the distance computation. In all cases AP-TED strategy requires significantly
less memory than RTED strategy (note the logarithmic scale). RTED uses much more (up to 200%) memory for the strategy
computation than for executing the strategy. On the contrary, AP-TED always uses less memory to compute the strategy. In
particular, we observe that most of the memory allocated by AP-TED strategy is used to store the strategy.

3
http://www.expasy.ch/sprot/
4
http://www.cis.upenn.edu/ treebank/
5
http://www.treefam.org/
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 171

Fig. 8. Shapes of the synthetic trees. (a) Left branch tree (LB), (b) full binary tree (FB) and (c) zig-zag tree (ZZ).

Fig. 9. Memory of strategy computation compared to executing the strategy on synthetic data and real-world datasets (Sprot and TreeBank). (a) Left branch
tree (LB), (b) full binary tree (FB), (c) zig-zag tree (ZZ), (d) random tree (random), (e) Sprot, and (f) TreeBank.

We performed a similar experiment on the real-world datasets from Sport and TreeBank (Fig. 9(e) and (f)). We pick pairs
of similarly sized trees at regular size intervals. We measure the memory of the strategy computation. The plotted data
points correspond to the average size of a tree pair and the memory used in MB. We observe a similar behaviour as in the
case of synthetic trees.
We further compute a similarity self join on a sample subset of similar-size trees from the real-world datasets. The join is
of the form D⋈θ D, where D is a dataset and θ a tree edit distance predicate. The results are gathered in Table 1. The ﬁrst three
columns show the dataset, the average tree size and the number of trees in the sample. We report the memory for the
strategy computation in RTED, in AP-TED, and for the distance computation step. For each of the datasets, the strategy
computation with RTED requires more memory than the distance computation, whereas AP-TED always uses less memory.
Number of rows. We count the number of concurrently needed rows in the cost arrays during the strategy computation in
AP-TED. We perform an experiment on the real-world datasets. The results are shown in Table 1. For the RTED strategy the
172 M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173

Table 1
Average memory usage (in MB) and runtime (in milliseconds) for different datasets and tree sizes.

Dataset Avg. size Trees RTED strategy AP-TED strategy Distance computation

rows Memory Runtime rows Memory Runtime Memory Runtime

treebank-100 98.7 50 98.7 0.57 0.02 5.3 0.53 0.03 0.57 1.43
treebank-200 195.4 50 195.4 1.10 0.07 6.0 0.72 0.09 0.88 6.27
treebank-400 371.3 10 371.3 2.86 2.14 6.3 1.30 2.31 1.77 26.99
sprot-200 211.0 50 211.0 1.21 0.15 3.0 0.70 0.09 0.89 4.92
sprot-400 395.0 50 395.0 3.12 2.25 3.0 1.31 2.16 1.92 15.44
sprot-1000 987.5 20 987.5 16.64 16.95 3.0 5.13 14.94 9.17 123.79
sprot-2000 1960.1 10 1960.1 63.20 64.85 3.0 17.80 55.11 33.10 502.88
treefam-200 197.7 50 197.7 1.16 0.15 9.6 0.74 0.15 0.88 9.86
treefam-400 402.6 50 402.6 3.33 2.52 12.3 1.46 2.37 2.05 55.48
treefam-1000 981.9 20 981.9 16.73 18.33 14.1 5.58 17.33 9.27 453.51

Fig. 10. Runtime difference between RTED strategy, AP-TED strategy, and the overall distance computation. (a) RTED vs. AP-TED strategy and (b) strategy vs.
distance computation.

Table 2
Runtime (in seconds) of a similarity self join on different datasets.

Dataset RTED AP-TED þ

treebank-100 3.87 2.66

treebank-200 17.15 12.42
treebank-400 1.44 1.02
sprot-200 14.24 6.07
sprot-400 45.01 27.56
sprot-1000 56.74 31.54
sprot-2000 59.22 34.04
treefam-200 30.67 25.70
treefam-400 160.44 151.43
treefam-1000 211.00 190.92

number of rows is the size of the left-hand input tree (column 2 – Avg. size). For the AP-TED strategy the number of rows
varies from 5% of the tree size for the treebank-100 dataset to 0.15% for the sprot-2000 dataset. Thus, in practice, our
memory deallocation technique is much more effective than suggested by the theoretical bound of 33.3%.
Strategy runtime. We measure the time for computing and executing strategies; the results for different datasets are
shown in the runtime columns of Table 1. In most cases, AP-TED strategy is slightly faster than RTED strategy. Due to its
deallocation mechanism, AP-TED strategy (cf. Section 5) needs to allocate less memory than RTED strategy. The improved
runtime performance of AP-TED strategy stems from reusing memory as compared to allocating new memory on the heap.
Compared to the distance computation, the strategy requires only a small fraction of the overall time. Fig. 10 shows how the
strategy computation scales with the tree size for the Sprot dataset.
Efﬁciency of the new single-path functions. We measure the efﬁciency of our new single-path functions, Δ1node and Δ2node .
We perform a similarity self join on our real-world datasets (as in the memory experiment). We measure the runtime of the
entire join on each dataset separately. The results are shown in Table 2. The second column shows the results for RTED, the
third column for our AP-TED þ algorithm. AP-TED þ reduces the runtime from 5% for the treefam-400 dataset to 57% for the
sprot-200 dataset.
M. Pawlik, N. Augsten / Information Systems 56 (2016) 157–173 173

9. Conclusion

In this paper we develop two new algorithms for the tree edit distance: AP-TED and AP-TED þ . The strategy computation
is a main memory bottleneck of the state-of-the-art solution, RTED [30]. The memory required for the strategy computation
can be twice the memory needed for the actual tree edit distance computation. Our AP-TED strategy algorithm reduces the
memory by at least 2/3 compared to the strategy computation in RTED and never uses more memory than the distance
computation. The experimental evaluation conﬁrms our theoretical results. Our algorithm computes the optimal all-path
strategy which, compared to RTED, does not limit the path type and considers all possible paths. Our AP-TED þ algorithm
with new efﬁcient single-path functions reduces the overall runtime of the tree edit distance by up to 57%.

Acknowledgements

This work is partially supported by the SyRA project of the Free University of Bozen-Bolzano, Italy.

References

[1] T. Akutsu, Tree edit distance problems algorithms and applications to bioinformatics, IEICE Trans. Inf. Syst. E 93-D (2) (2010) 208–218.
[2] T. Akutsu, D. Fukagawa, A. Takasu, Approximating tree edit distance through string edit distance, Algorithmica 57 (2) (2010) 325–348.
[3] K.F. Aoki, A. Yamaguchi, Y. Okuno, T. Akutsu, N. Ueda, M. Kanehisa, H. Mamitsuka, Efficient tree-matching methods for accurate carbohydrate database
queries, Genome Inform. 14 (2003) 134–143.
[4] T. Aratsu, K. Hirata, T. Kuboyama, Approximating tree edit distance through string edit distance for binary tree codes, Fundam. Inform. 101 (3) (2010)
157–171.
[5] N. Augsten, D. Barbosa, M. Böhlen, T. Palpanas, Efficient top-k approximate subtree matching in small memory, IEEE Trans. Knowl. Data Eng. (TKDE) 23
(8) (2011) 1123–1137.
[6] N. Augsten, M. Böhlen, J. Gamper, The pq-gram distance between ordered labeled trees, ACM Trans. Database Syst. (TODS) 35 (1) .
[7] J. Bellando, R. Kothari, Region-based modeling and tree edit distance as a basis for gesture recognition, in: International Conference on Image Analysis
and Processing (ICIAP), 1999.
[8] S.S. Chawathe, Comparing hierarchical data in external memory, in: International Conference on Very Large Data Bases (VLDB), 1999.
[9] G. Cobéna, S. Abiteboul, A. Marian, Detecting changes in xml documents, in: International Conference on Data Engineering (ICDE), 2002.
[10] S. Cohen, Indexing for subtree similarity-search using edit distance, in: ACM SIGMOD International Conference on Management of Data (SIGMOD),
2013, pp. 49–60.
[11] T. Dalamagas, T. Cheng, K.-J. Winkel, T. Sellis, A methodology for clustering xml documents by structure, Inf. Syst. 31 (3) (2006) 187–228.
[12] D. de Castro Reis, P.B. Golgher, A.S. da Silva, A.H.F. Laender, Automatic web news extraction using tree edit distance, in: International World Wide Web
Conference (WWW), 2004, pp. 502–511.
[13] E.D. Demaine, S. Mozes, B. Rossman, O. Weimann, An optimal decomposition algorithm for tree edit distance, ACM Trans. Algorithms 6 (1) .
[14] S. Dulucq, H. Touzet, Decomposition algorithms for the tree edit distance problem, J. Discret. Algorithms 3 (2–4) (2005) 448–471.
[15] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, M. Montperrus, Fine-grained and accurate source code differencing, in: ACM/IEEE International
Conference on Automated Software Engineering (ASE), 2014, pp. 313–324.
[16] J.P. Finis, M. Raiber, N. Augsten, R. Brunel, A. Kemper, F. Färber, Rws-diff: flexible and efficient change detection in hierarchical data, in: ACM Inter-
national Conference on Information and Knowledge Management (CIKM), 2013, pp. 339–348.
[17] M. Garofalakis, A. Kumar, Xml stream processing using tree-edit distance embeddings, ACM Trans. Database Syst. (TODS) 30 (1) (2005) 279–332.
[18] S. Guha, H.V. Jagadish, N. Koudas, D.S.T. Yu, Approximate xml joins, in: ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002.
[19] A. Habrard, J.M.I. Quereda, D. Rizo, M. Sebban, Melody recognition with learned edit distances, in: Joint IAPR International Workshop on Structural,
Syntactic, and Statistical Pattern Recognition (SSPR/SPR), 2008, pp. 86–96.
[20] M. Hashimoto, A. Mori, Diff/ts: a tool for fine-grained structural change analysis, in: Working Conference on Reverse Engineering (WCRE), 2008.
[21] H. Heumann, G. Wittum, The tree-edit-distance, a measure for quantifying neuronal morphology, Neuroinformatics 7 (3) (2009) 179–190.
[22] S. Kamali, F.W. Tompa, Retrieving documents with mathematical content, in: ACM SIGIR International Conference on Research and Development in
Information Retrieval 2013, pp. 353–362.
[23] Y. Kim, J. Park, T. Kim, J. Choi, Web information extraction by HTML tree edit distance matching, in: International Conference on Convergence
Information Technology (ICCIT), 2007, pp. 2455–2460.
[24] P.N. Klein, Computing the edit distance between unrooted ordered trees, in: European Symposium on Algorithms (ESA), 1998, pp. 91–102.
[25] P.N. Klein, S. Tirthapura, D. Sharvit, B.B. Kimia, A tree-edit-distance algorithm for comparing simple, closed shapes, in: ACM-SIAM Symposium on
Discrete Algorithms (SODA), 2000.
[26] F. Korn, B. Saha, D. Srivastava, S. Ying, On repairing structural problems in semi-structured data, Proc. VLDB Endow. 6 (9) .
[27] K.-H. Lee, Y.-C. Choy, S.-B. Cho, An efficient algorithm to compute differences between structured documents, IEEE Trans. Knowl. Data Eng. (TKDE) 16
(8) (2004) 965–979.
[28] Z. Lin, H. Wang, S.I. McClean, Measuring tree similarity for natural language processing based information retrieval, in: International Conference on
Applications of Natural Language to Information Systems (NLDB), 2010.
[29] B. Ma, L. Wang, K. Zhang, Computing similarity between RNA structures, Theor. Comput. Sci. 276 (1–2) (2002) 111–132.
[30] M. Pawlik, N. Augsten, RTED: a robust algorithm for the tree edit distance, Proc. VLDB Endow. (PVLDB) 5 (4) (2011) 334–345.
[31] M. Pawlik, N. Augsten, A memory-efficient tree edit distance algorithm, in: International Conference on Database and Expert Systems Applications
(DEXA), 2014.
[32] M. Pawlik, N. Augsten, Efficient computation of the tree edit distance, ACM Trans. Database Syst. (TODS) 40 (1) . (Article 3).
[33] V. Springel, S.D.M. White, A. Jenkins, C.S. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, J.A. Peacock, S. Cole, P. Thomas,
H. Couchman, A. Evrard, J. Colberg, F. Pearce, Simulations of the formation, evolution and clustering of galaxies and quasars, Nature 435 (2005)
629–636.
[34] H.-C. Tai, The tree-to-tree correction problem, J. ACM (JACM) 26 (3) (1979) 422–433.
[35] L. Wang, K. Zhang, Space efficient algorithms for ordered tree comparison, Algorithmica 51 (3) (2008) 283–297.
[36] K. Zhang, Algorithms for the constrained editing distance between ordered labeled trees and related problems, Pattern Recognit. 28 (3) (1995)
463–474.
[37] K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM J. Comput. (SICOMP) 18 (6) (1989)
1245–1262.

T1L2 Classification Trees
No ratings yet
T1L2 Classification Trees
58 pages
Eth 48401 01
No ratings yet
Eth 48401 01
102 pages
Wang - 2021 - Steiner Tree A Deep Reinforcement Learning Approach
No ratings yet
Wang - 2021 - Steiner Tree A Deep Reinforcement Learning Approach
37 pages
Binary Search Tree
No ratings yet
Binary Search Tree
177 pages
Interview Bit
No ratings yet
Interview Bit
7 pages
Dm-Classtrees-2-2018 PDF
No ratings yet
Dm-Classtrees-2-2018 PDF
46 pages
Simple Fast Algorithms For The Editing Distance Be
No ratings yet
Simple Fast Algorithms For The Editing Distance Be
19 pages
Dom Tree Edit Distance
No ratings yet
Dom Tree Edit Distance
41 pages
A Parallel GED Algorithm - Zeina
No ratings yet
A Parallel GED Algorithm - Zeina
33 pages
Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
No ratings yet
Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
18 pages
CHAZELLE
No ratings yet
CHAZELLE
20 pages
LG-Tree An Efficient Labeled Index For Shortest Distance Search On Massive Road Networks
No ratings yet
LG-Tree An Efficient Labeled Index For Shortest Distance Search On Massive Road Networks
15 pages
Riesen and Bunke - Approximate Graph Edit Distance
No ratings yet
Riesen and Bunke - Approximate Graph Edit Distance
10 pages
WMC Lab
No ratings yet
WMC Lab
34 pages
Longest Path
No ratings yet
Longest Path
6 pages
Kahng Et Al. - 2024 - NN-Steiner A Mixed Neural-Algorithmic Approach For The Rectilinear Steiner Minimum Tree Problem
No ratings yet
Kahng Et Al. - 2024 - NN-Steiner A Mixed Neural-Algorithmic Approach For The Rectilinear Steiner Minimum Tree Problem
9 pages
Report
No ratings yet
Report
9 pages
Week11-Chap5-Graph-Algorithms Sum Path On Tree
No ratings yet
Week11-Chap5-Graph-Algorithms Sum Path On Tree
14 pages
Improvement in Semantic Address Matching Using Natural Language Processing
No ratings yet
Improvement in Semantic Address Matching Using Natural Language Processing
6 pages
Applying Neural Network To Reconstruction of Phylogenetic Tree
No ratings yet
Applying Neural Network To Reconstruction of Phylogenetic Tree
7 pages
Rush Hour Puzzle
No ratings yet
Rush Hour Puzzle
4 pages
Editorial Runda Finala
No ratings yet
Editorial Runda Finala
8 pages
York University CSE 3101 Summer 2012 - Exam
No ratings yet
York University CSE 3101 Summer 2012 - Exam
10 pages
Editorial en
No ratings yet
Editorial en
10 pages
Towards The Construction of RPCs
No ratings yet
Towards The Construction of RPCs
7 pages
ADS2 HW2 SS25 (Solution)
No ratings yet
ADS2 HW2 SS25 (Solution)
7 pages
Week11-Chap5-Graph-Algorithms Sum Path On Tree
No ratings yet
Week11-Chap5-Graph-Algorithms Sum Path On Tree
4 pages
Deep Learning Rooted Potential Piloted RRT For Expeditious Path Planning
No ratings yet
Deep Learning Rooted Potential Piloted RRT For Expeditious Path Planning
8 pages
CS 240 Tutorial 10 Notes: Lo Hi Lo Hi
No ratings yet
CS 240 Tutorial 10 Notes: Lo Hi Lo Hi
4 pages
Practical Assignment AI 1
No ratings yet
Practical Assignment AI 1
2 pages
Faang Dsa
No ratings yet
Faang Dsa
17 pages
E-Commerce Considered Harmful
No ratings yet
E-Commerce Considered Harmful
4 pages
Prosort 86 Editorial
No ratings yet
Prosort 86 Editorial
5 pages
Tutorial
No ratings yet
Tutorial
6 pages
AI Module 2
No ratings yet
AI Module 2
22 pages
DSA Learning Roadmap C++
No ratings yet
DSA Learning Roadmap C++
3 pages
Microsoft Engineer SDE2 Handbook
No ratings yet
Microsoft Engineer SDE2 Handbook
59 pages
PDF Download
No ratings yet
PDF Download
158 pages
DAA Assignment1
No ratings yet
DAA Assignment1
8 pages
02 2 Access Manager
No ratings yet
02 2 Access Manager
79 pages
Data Structures.
No ratings yet
Data Structures.
471 pages
Heaps
No ratings yet
Heaps
27 pages
DS Bca (Sci)
No ratings yet
DS Bca (Sci)
11 pages
DMS Unit V GraphTheory
No ratings yet
DMS Unit V GraphTheory
159 pages
An Array Can Be Converted Into A Binary Tree
No ratings yet
An Array Can Be Converted Into A Binary Tree
23 pages
Unit - 4 Recursion
No ratings yet
Unit - 4 Recursion
9 pages
CSE548 Lecture 9
No ratings yet
CSE548 Lecture 9
52 pages
Data Structure 4
No ratings yet
Data Structure 4
2 pages
SDT 2025
No ratings yet
SDT 2025
48 pages
Data Flow Analysis Theory and Practice 1st Edition Uday Khedker - Download The Ebook Now For The Best Reading Experience
100% (1)
Data Flow Analysis Theory and Practice 1st Edition Uday Khedker - Download The Ebook Now For The Best Reading Experience
77 pages
Data Structures and Algorithms Implementation Through C 1st Edition Dr. Brijesh Bakariya. Download
No ratings yet
Data Structures and Algorithms Implementation Through C 1st Edition Dr. Brijesh Bakariya. Download
22 pages
cs5002 Lect12 Fall18 Slides
No ratings yet
cs5002 Lect12 Fall18 Slides
150 pages
Buy Ebook C Data Structures and Algorithms Explore The Possibilities of C For Developing A Variety of Efficient Applications Marcin Jamro Cheap Price
100% (2)
Buy Ebook C Data Structures and Algorithms Explore The Possibilities of C For Developing A Variety of Efficient Applications Marcin Jamro Cheap Price
55 pages
m2 Summer2017
No ratings yet
m2 Summer2017
12 pages
16-17 Final Question Solve
No ratings yet
16-17 Final Question Solve
12 pages
Algorithms - CS3401 - Notes - Unit 4 - State Space Search Algorithms Stud
No ratings yet
Algorithms - CS3401 - Notes - Unit 4 - State Space Search Algorithms Stud
29 pages
Unit 12 Algorithms Teaching Guide
No ratings yet
Unit 12 Algorithms Teaching Guide
23 pages
Binary Search Tree
No ratings yet
Binary Search Tree
23 pages
KDTree and BallTree
No ratings yet
KDTree and BallTree
14 pages
02-Binary Trees
No ratings yet
02-Binary Trees
13 pages
Nep2020-2024-Bca2-Ma24 - Data Structure Using C (Major)
No ratings yet
Nep2020-2024-Bca2-Ma24 - Data Structure Using C (Major)
1 page
Iat 2 QP Ii Cse
No ratings yet
Iat 2 QP Ii Cse
1 page
B-Epsilor Tree
No ratings yet
B-Epsilor Tree
8 pages
Practical Machine Learning: Learn how to build Machine Learning applications to solve real-world data analysis challenges with this Machine Learning book – packed with practical tutorials
From Everand
Practical Machine Learning: Learn how to build Machine Learning applications to solve real-world data analysis challenges with this Machine Learning book – packed with practical tutorials
Sunila Gollapudi
3/5 (2)
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
From Everand
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
The Comprehensive Guide to Machine Learning Algorithms and Techniques
From Everand
The Comprehensive Guide to Machine Learning Algorithms and Techniques
Mohammed Ahmed
5/5 (1)
Mastering Algorithms for Competitive Programming: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Algorithms for Competitive Programming: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Ray Tune for Scalable Hyperparameter Optimization: The Complete Guide for Developers and Engineers
From Everand
Ray Tune for Scalable Hyperparameter Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Efficient Memory Optimization for IoT Intrusion Detection
From Everand
Efficient Memory Optimization for IoT Intrusion Detection
Ethan Evelyn
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
From Everand
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
From Everand
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
Jyh-Horng Jeng
No ratings yet
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
OpenTelemetry in Practice: Definitive Reference for Developers and Engineers
From Everand
OpenTelemetry in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

1 s2.0 S0306437915001611 Main

Uploaded by

1 s2.0 S0306437915001611 Main

Uploaded by

Information Systems 56 (2016) 157–173

Contents lists available at ScienceDirect

Tree edit distance: Robust and memory-efﬁcient

2. Notation and background

2.2. Strategy and distance computation in RTED

4.1. Cost of the optimal all-path strategy

4.2. Strategy computation in AP-TED

Fig. 3. Formula for the cost of the optimal all-path strategy.

Algorithm 1. AP-TED strategy.

5. Memory efﬁciency in AP-TED

5.1. Memory analysis

5.2. Early deallocation of rows

read(v)/read(leafRow) – read the row of node v/read leafRow,

Fig. 5. Row operations depending on the node position.

v6 node operations rows in memory

Fig. 6. Row operations on an example tree.

5.3. Tightening the memory upper bound

L ¼ fl A F: l o v 4 l is not descendant of vg,

R ¼ fr A F: r 4 v 4 r is not ancestor of vg,

k4 ⌈jFj3 ⌉ Z ⌈3k3 1⌉ ¼ ⌈k  13⌉ ¼ k□

6.1. One-node single-path function

6.2. Two-node single-path function

match½v½0: lðw1 Þ matches the label of some non-root node in F 0v ,

6.3. Better strategy with new single-path functions

rows Memory Runtime rows Memory Runtime Memory Runtime

Dataset RTED AP-TED þ

treebank-100 3.87 2.66

You might also like

k4 ⌈jFj3 ⌉ Z ⌈3k3 1⌉ ¼ ⌈k 13⌉ ¼ k□

match½v½0: lðw1 Þ matches the label of some non-root node in F 0v ,