8000 add ability to modify markdown files for image url to exported image … · homeylab/bookstack-file-exporter@b596668 · GitHub
[go: up one dir, main page]

Skip to content

Commit b596668

Browse files
committed
add ability to modify markdown files for image url to exported image path replacing
1 parent f80ed8f commit b596668

File tree

4 files changed

+145
-42
lines changed

4 files changed

+145
-42
lines changed

README.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ Table of Contents
1111
- [Options and descriptions](#options-and-descriptions)
1212
- [Environment variables](#valid-environment-variables)
1313
- [Backup Behavior](#backup-behavior)
14+
- [Images](#images)
15+
- [Modify Markdown Files](#modify-markdown-files)
1416
- [Object Storage](#object-storage)
1517
- [Minio](#minio-backups)
1618

@@ -27,6 +29,7 @@ What it does:
2729
- Discover and build relationships between Bookstack `Shelves/Books/Chapters/Pages` to create a relational parent-child layout
2830
- Export Bookstack pages and their content to a `.tgz` archive
2931
- Additional content for pages like their images and metadata and can be exported
32+
- The exporter can also [Modify Markdown Files](#modify-markdown-files) to replace image links with local exported image paths for a more portable backup
3033
- YAML configuration file for repeatable and easy runs
3134
- Can be run via [Python](#run-via-pip) or [Docker](#run-via-docker)
3235
- Can push archives to remote object storage like [Minio](https://min.io/)
@@ -244,6 +247,7 @@ More descriptions can be found for each section below:
244247
| `output_path` | `str` | `false` | Optional (default: `cwd`) which directory (relative or full path) to place exports. User who runs the command should have access to read/write to this directory. If not provided, will use current run directory by default |
245248
| `assets` | `object` | `false` | Optional section to export additional assets from pages. |
246249
| `assets.export_images` | `bool` | `false` | Optional (default: `false`), export all images for a page to an `image` directory within page directory. See [Backup Behavior](#backup-behavior) for more information on layout |
250+
| `assets.modify_markdown` | `bool` | `false` | Optional (default: `false`), modify markdown files to replace image links with local exported image paths. This requires `assets.export_images` to be `true` in order to work. See [Modify Markdown Files](#modify-markdown-files) for more information.
247251
| `assets.export_meta` | `bool` | `false` | Optional (default: `false`), export of metadata about the page in a json file |
248252
| `assets.verify_ssl` | `bool` | `false` | Optional (default: `true`), whether or not to check ssl certificates when requesting content from Bookstack host |
249253
| `keep_last` | `int` | `false` | Optional (default: `None`), if exporter can delete older archives. valid values are:<br>- set to `-1` if you want to delete all archives after each run (useful if you only want to upload to object storage)<br>- set to `1+` if you want to retain a certain number of archives<br>- `0` will result in no action done |
@@ -261,9 +265,12 @@ General
261265
- `MINIO_ACCESS_KEY`
262266
- `MINIO_SECRET_KEY`
263267

264-
### Backup Behavior
268+
## Backup Behavior
269+
270+
### Export File
265271
Backups are exported in `.tgz` format and generated based off timestamp. Export names will be in the format: `%Y-%m-%d_%H-%M-%S` (Year-Month-Day_Hour-Minute-Second). *Files are first pulled locally to create the tarball and then can be sent to object storage if needed*. Example file name: `bookstack_export_2023-09-22_07-19-54.tgz`.
266272

273+
### General
267274
The exporter can also do housekeeping duties and keep a configured number of archives and delete older ones. See `keep_last` property in the [Configuration](#options-and-descriptions) section. Object storage provider configurations include their own `keep_last` property for flexibility.
268275

269276
For file names, `slug` names (from Bookstack API) are used, as such certain characters like `!`, `/` will be ignored and spaces replaced from page names/titles.
@@ -349,6 +356,37 @@ Empty/New Pages will be ignored since they have not been modified yet from creat
349356
350357
You may notice some directories (books) and/or files (pages) in the archive have a random string at the end, example - `nKA`: `user-and-group-management-nKA`. This is expected and is because there were resources with the same name created in another shelve and bookstack adds a string at the end to ensure uniqueness.
351358
359+
### Images
360+
361+
### General
362+
Images will be dumped in a separate directory, `images` within the page directory it belongs to. As shown earlier:
363+
364+
```
365+
bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/YKvimage.png
366+
bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/dwwimage.png
367+
bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/NzZimage.png
368+
bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/Mymimage.png
369+
```
370+
371+
> **Note you may see old images in your exports. This is because, by default, Bookstack retains images/drawings that are uploaded even if no longer referenced on an active page. Admins can run `Cleanup Images` in the Maintenance Settings or via [CLI](https://www.bookstackapp.com/docs/admin/commands/#cleanup-unused-images) to remove them.
372+
373+
### Modify Markdown Files
374+
**To use this feature, `assets.export_images` should be set to `true`**
375+
376+
The configuration item, `assets.modify_markdown`, can be set to `true` to modify markdown files to replace image url links with local exported image paths. This feature allows for you to make your `markdown` exports much more portable.
377+
378+
Page (parent) -> Images (Children) relationships are created and then each image url is replaced with its own respective local export path. Example:
379+
```
380+
## before
381+
[![pool-topology-1.png](https://demo.bookstack/uploads/images/gallery/2023-07/scaled-1680-/pool-topology-1.png)](https://demo.bookstack/uploads/images/gallery/2023-07/pool-topology-1.png)
382+
383+
## after
384+
[![pool-topology-1.png](./images/pool-topology-1.png)](https://demo.bookstack/uploads/images/gallery/2023-07/pool-topology-1.png)
385+
```
386+
This allows the image to be found locally within the export files and allow your `markdown` docs to have all the images display properly like it would normally would.
387+
388+
**Note: This will work properly if your pages are using the notation used by Bookstack for Markdown image links, example: ` [![image alt text](Bookstack Markdown image URL link)](anchor/url link)` The `(anchor/url link)` is optional.**
389+
352390
## Object Storage
353391
Optionally, target(s) can be specified to upload generated archives to a remote location. Supported object storage providers can be found below:
354392
- [Minio](#minio-backups)

bookstack_file_exporter/archiver/archiver.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
from bookstack_file_exporter.exporter.node import Node
77
from bookstack_file_exporter.archiver import util
8-
from bookstack_file_exporter.archiver.page_archiver import PageArchiver
8+
from bookstack_file_exporter.archiver.page_archiver import PageArchiver, ImageNode
99
from bookstack_file_exporter.archiver.minio_archiver import MinioArchiver
1010
from bookstack_file_exporter.config_helper.remote import StorageProviderConfig
1111
from bookstack_file_exporter.config_helper.config_helper import ConfigNode
@@ -49,23 +49,23 @@ def get_bookstack_exports(self, page_nodes: Dict[int, Node]):
4949
self._get_page_files(page, page_image_meta)
5050
self._get_page_images(page.file_path, page_image_meta)
5151

52-
def _get_page_files(self, page_node: Node, image_meta: List[str]):
52+
def _get_page_files(self, page_node: Node, image_meta: List[ImageNode]):
5353
"""pull all bookstack pages into local files/tar"""
5454
log.debug("Exporting bookstack page data")
5555
self._page_archiver.archive_page(page_node, image_meta)
5656

57-
def _get_page_image_map(self) -> Dict[int, List[str]]:
57+
def _get_page_image_map(self) -> Dict[int, ImageNode]:
5858
if not self._page_archiver.export_images:
5959
log.debug("skipping image export based on user input")
6060
return {}
6161
return self._page_archiver.get_image_meta()
6262

63-
def _get_page_images(self, page_path: str, urls: List[str]):
64-
if not urls:
63+
def _get_page_images(self, page_path: str, img_nodes: List[ImageNode]):
64+
if not img_nodes:
6565
log.debug("page has no images to pull")
6666
return
6767
log.debug("Exporting bookstack page images")
68-
self._page_archiver.archive_page_images(page_path, urls)
68+
self._page_archiver.archive_page_images(page_path, img_nodes)
6969

7070
def create_archive(self):
7171
"""create tgz archive"""

bookstack_file_exporter/archiver/minio_archiver.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,6 @@ def _get_stale_objects(self, file_extension: str) -> List[MinioObject]:
9696
# last copy that remains if local is deleted
9797
log.debug("Minio 'keep_last' set to negative number, ignoring")
9898
return []
99-
# keep_last > 0 condition
10099
to_delete = []
101100
if len(minio_objects) > self.keep_last:
102101
log.debug("Number of minio objects is greater than 'keep_last'")

bookstack_file_exporter/archiver/page_archiver.py

Lines changed: 100 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,61 @@
3131
# _MARKDOWN_IMAGE_REGEX= re.compile(r"\[\!\[^$|.*\].*\]")
3232
_MARKDOWN_STR_CHECK = "markdown"
3333

34+
class ImageNode:
35+
"""
36+
ImageNode provides metadata and convenience for Bookstack images.
37+
38+
Args:
39+
:img_meta_data: <Dict[str, Union[int, str]> = image meta data
40+
41+
Returns:
42+
:ImageNode: instance with attributes to help handle images.
43+
"""
44+
def __init__(self, img_meta_data: Dict[str, Union[int, str]]):
45+
self.id: int = img_meta_data['id']
46+
self.page_id: int = img_meta_data['uploaded_to']
47+
self.url: str = img_meta_data['url']
48+
self.name: str = self._get_image_name()
49+
self._markdown_str = ""
50+
self.image_relative_path: str = f"./{_IMAGE_DIR_NAME}/{self.name}"
51+
52+
def _get_image_name(self) -> str:
53+
return self.url.split('/')[-1]
54+
55+
# def _get_markdown_str(self, img_details: Dict[str, Union[int, str]]) -> str:
56+
# if 'content' in img_details:
57+
# if _MARKDOWN_STR_CHECK in img_details['content']:
58+
# print(img_details['content'][_MARKDOWN_STR_CHECK])
59+
# return self._get_md_url_str(img_details['content'][_MARKDOWN_STR_CHECK])
60+
# return ""
61+
62+
@property
63+
def markdown_str(self):
64+
return self._markdown_str
65+
66+
def set_markdown_content(self, img_details: Dict[str, Union[int, str]]):
67+
self._markdown_str = self._get_md_url_str(img_details)
68+
# @markdown_str.setter
69+
# def markdown_str(self, img_details: Dict[str, Union[int, str]]) -> str:
70+
# self._markdown_str = self._get_md_url_str(img_details)
71+
72+
73+
def get_replace_str(self) -> str:
74+
"""return str for regex replace in page md content"""
75+
# return f"[![{self.name}]({self.image_relative_path})]"
76+
return self.image_relative_path
77+
78+
@staticmethod
79+
def _get_md_url_str(img_data: Dict[str, Union[int, str]]) -> str:
80+
url_str = ""
81+
if 'content' in img_data:
82+
if _MARKDOWN_STR_CHECK in img_data['content']:
83+
url_str = img_data['content'][_MARKDOWN_STR_CHECK]
84+
# check to see if empty before doing find
85+
if not url_str:
86+
return ""
87+
return url_str[url_str.find("(")+1:url_str.find(")")]
88+
3489
# pylint: disable=too-many-instance-attributes
3590
class PageArchiver:
3691
"""
@@ -75,13 +130,12 @@ def archive_page(self, page: Node,
75130
self._archive_page_meta(page.name, page.file_path, page.meta)
76131

77132
def _archive_page(self, page: Node, export_format: str, data: bytes,
78-
image_urls: List[str] = None):
133+
image_nodes: List[ImageNode] = None):
79134
page_file_name = f"{self.archive_base_path}/" \
80135
f"{page.file_path}/{page.name}{_FILE_EXTENSION_MAP[export_format]}"
81-
82-
# note yet implemented
83-
# if export_format == _MARKDOWN_STR_CHECK and image_urls and self.modify_md:
84-
# data = self._update_image_links(data, image_urls)
136+
# not yet implemented
137+
if self.modify_md and export_format == _MARKDOWN_STR_CHECK and image_nodes:
138+
data = self._update_image_links(data, image_nodes)
85139
self.write_data(page_file_name, data)
86140

87141
def _get_page_data(self, page_id: int, export_format: str):
@@ -96,7 +150,7 @@ def _archive_page_meta(self, page_name: str, page_path: str,
96150
bytes_meta = archiver_util.get_json_bytes(meta_data)
97151
self.write_data(file_path=meta_file_name, data=bytes_meta)
98152

99-
def get_image_meta(self) -> Dict[int, List[str]]:
153+
def get_image_meta(self) -> Dict[int, List[ImageNode]]:
100154
"""Get all image metadata into a {page_number: [image_url]} format"""
101155
img_meta_response: Response = common_util.http_get_request(
102156
self.api_urls['images'],
@@ -105,28 +159,14 @@ def get_image_meta(self) -> Dict[int, List[str]]:
105159
img_meta_json = img_meta_response.json()['data']
106160
return self._create_image_map(img_meta_json)
107161

108-
@staticmethod
109-
def _create_image_map(json_data: List[Dict[str, Union[str,int]]]) -> Dict[int, List[str]]:
110-
image_page_map = {}
111-
for image_node in json_data:
112-
image_page_id = image_node['uploaded_to']
113-
image_url = image_node['url']
114-
if image_page_id in image_page_map:
115-
image_page_map[image_page_id].append(image_url)
116-
else:
117-
image_page_map[image_page_id] = [image_url]
118-
return image_page_map
119-
120-
def archive_page_images(self, page_path: str, image_urls: List[str]):
162+
def archive_page_images(self, page_path: str, image_nodes: List[ImageNode]):
121163
"""pull images locally into a directory based on page"""
122164
# image_base_path = f"{self.archive_base_path}/{page_path}{_IMAGE_DIR_SUFFIX}"
123165
image_base_path = f"{self.archive_base_path}/{page_path}/{_IMAGE_DIR_NAME}"
124-
for image_url in image_urls:
125-
img_data: bytes = archiver_util.get_byte_response(image_url, self._headers,
166+
for img_node in image_nodes:
167+
img_data: bytes = archiver_util.get_byte_response(img_node.url, self._headers,
126168
self.verify_ssl)
127-
# seems safer to use this instead of image['name'] field
128-
img_file_name = image_url.split('/')[-1]
129-
image_path = f"{image_base_path}/{img_file_name}"
169+
image_path = f"{image_base_path}/{img_node.name}"
130170
self.write_data(image_path, img_data)
131171

132172
def write_data(self, file_path: str, data: bytes):
@@ -142,19 +182,32 @@ def gzip_archive(self):
142182
"""provide the tar to gzip and the name of the gzip output file"""
143183
archiver_util.create_gzip(self.tar_file, self.archive_file)
144184

145-
def _update_image_links(self, page_data: bytes, urls: List[str]) -> bytes:
185+
def _update_image_links(self, page_data: bytes, image_nodes: List[ImageNode]) -> bytes:
146186
"""regex replace links to local created directories"""
147-
# 1 - what to replace, 2 - replace with, 3 is the data to replace
148-
# re.sub(b'pfsense', b'lol', x.content)
187+
for img_node in image_nodes:
188+
img_meta_url = f"{self.api_urls["images"]}/{img_node.id}"
189+
img_details = common_util.http_get_request(img_meta_url,
190+
self._headers, self.verify_ssl)
191+
192+
img_node.set_markdown_content(img_details.json())
193+
if not img_node.markdown_str:
194+
continue
195+
196+
# re_pattern_bytes = self._get_regex_expr(img_node.markdown_str)
197+
198+
# re_pattern_bytes = self._get_regex_expr(img_node.url)
199+
200+
# 1 - what to replace, 2 - replace with, 3 is the data to replace
201+
# re.sub(b'pfsense', b'lol', x.content)
202+
print(img_node.markdown_str)
203+
print(img_node.get_replace_str())
204+
page_data = re.sub(img_node.markdown_str.encode(), img_node.get_replace_str().encode(), page_data)
205+
# print(page_data)
206+
return page_data
149207

150208
# string to bytes
151209
# >>> k = 'lol'
152210
# >>> k.encode()
153-
pass
154-
155-
def _valid_image_link(self):
156-
"""should contain bookstack host"""
157-
pass
158211

159212
@property
160213
def file_extension_map(self) -> Dict[str, str]:
@@ -171,6 +224,19 @@ def verify_ssl(self) -> bool:
171224
"""return whether or not to verify ssl for http requests"""
172225
return self.asset_config.verify_ssl
173226

227+
# @staticmethod
228+
# def _get_regex_expr(image_str: str) -> bytes:
229+
# # regex_str = fr"\[\!\[^$|.*\]\({url}\)\]"
230+
# # print(regex_str)
231+
# return re.compile(image_str.encode())
232+
174233
@staticmethod
175-
def _get_regex_expr(url: str) -> re.Pattern:
176-
return re.compile(fr"\[\!\[^$|.*\].*{url}.*\]")
234+
def _create_image_map(json_data: List[Dict[str, Union[str,int]]]) -> Dict[int, List[ImageNode]]:
235+
image_page_map = {}
236+
for img_meta in json_data:
237+
img_node = ImageNode(img_meta)
238+
if img_node.page_id in image_page_map:
239+
image_page_map[img_node.page_id].append(img_node)
240+
else:
241+
image_page_map[img_node.page_id] = [img_node]
242+
return image_page_map

0 commit comments

Comments
 (0)
0