-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Bug
When converting HTML to Markdown, Docling doesn't preserve anchor links and heading IDs. This means any table of contents or internal cross-references in the original HTML get lost in the conversion. It would be great if anchor links could be converted to Markdown link syntax and heading IDs could be preserved so navigation within the document still works!
Specifically, links like <a href="#section1">Section 1</a> should convert to [Section 1](#section1), and headings with IDs like <h2 id="section1"> should convert to ## Section 1 {#section1} using Markdown's extended syntax. This would maintain the document structure and allow readers to jump between sections just like in the original HTML.
Input (anchors.html)
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>Anchors Example</title>
</head>
<body>
<p>
TOC:
<a href="#section1">Section 1</a> |
<a href="#section2">Section 2</a> |
<a href="#section3">Section 3</a>
</p>
<h2 id="section1">Section 1</h2>
<p>Content for section 1.</p>
<h2 id="section2">Section 2</h2>
<p>Content for section 2.</p>
<h2 id="section3">Section 3</h2>
<p>Content for section 3.</p>
</body>
</html>Actual output
## Section 1
Content for section 1.
## Section 2
Content for section 2.
## Section 3
Content for section 3.Expected output
TOC:
[Section 1](#section1) |
[Section 2](#section2) |
[Section 3](#section3)
## Section 1 {#section1}
Content for section 1.
## Section 2 {#section2}
Content for section 2.
## Section 3 {#section3}
Content for section 3.Steps to reproduce
$ docling anchors.html --to mdDocling version
Docling version: 2.70.0
Docling Core version: 2.61.0
Docling IBM Models version: 3.11.0
Docling Parse version: 4.7.3
Python: cpython-314 (3.14.2)
Platform: Windows-11-10.0.26200-SP0
Python version
Python 3.14.2