homeylab
diff --git a/‎README.md
Lines changed: 39 additions & 1 deletion b/‎README.md
Lines changed: 39 additions & 1 deletion
diff --git a/‎bookstack_file_exporter/archiver/archiver.py
Lines changed: 6 additions & 6 deletions b/‎bookstack_file_exporter/archiver/archiver.py
Lines changed: 6 additions & 6 deletions
diff --git a/‎bookstack_file_exporter/archiver/minio_archiver.py
Lines changed: 0 additions & 1 deletion b/‎bookstack_file_exporter/archiver/minio_archiver.py
Lines changed: 0 additions & 1 deletion
diff --git a/‎bookstack_file_exporter/archiver/page_archiver.py
Lines changed: 100 additions & 34 deletions b/‎bookstack_file_exporter/archiver/page_archiver.py
Lines changed: 100 additions & 34 deletions
@@ -11,6 +11,8 @@ Table of Contents
     - [Options and descriptions](#options-and-descriptions)
     - [Environment variables](#valid-environment-variables)
 - [Backup Behavior](#backup-behavior)
+    - [Modify Markdown Files](#modify-markdown-files)
 - [Object Storage](#object-storage)
     - [Minio](#minio-backups)
 
@@ -27,6 +29,7 @@ What it does:
 - Discover and build relationships between Bookstack `Shelves/Books/Chapters/Pages` to create a relational parent-child layout
 - Export Bookstack pages and their content to a `.tgz` archive
 - Additional content for pages like their images and metadata and can be exported
+- The exporter can also [Modify Markdown Files](#modify-markdown-files) to replace image links with local exported image paths for a more portable backup
 - YAML configuration file for repeatable and easy runs
 - Can be run via [Python](#run-via-pip) or [Docker](#run-via-docker)
 - Can push archives to remote object storage like [Minio](https://min.io/)
@@ -244,6 +247,7 @@ More descriptions can be found for each section below:
 | `output_path` | `str` | `false` | Optional (default: `cwd`) which directory (relative or full path) to place exports. User who runs the command should have access to read/write to this directory. If not provided, will use current run directory by default |
 | `assets` | `object` | `false` | Optional section to export additional assets from pages. |
 | `assets.export_images` | `bool` | `false` | Optional (default: `false`), export all images for a page to an `image` directory within page directory. See [Backup Behavior](#backup-behavior) for more information on layout |
+| `assets.modify_markdown` | `bool` | `false` | Optional (default: `false`), modify markdown files to replace image links with local exported image paths. This requires `assets.export_images` to be `true` in order to work. See [Modify Markdown Files](#modify-markdown-files) for more information.
 | `assets.export_meta` | `bool` | `false` | Optional (default: `false`), export of metadata about the page in a json file |
 | `assets.verify_ssl` | `bool` | `false` | Optional (default: `true`), whether or not to check ssl certificates when requesting content from Bookstack host |
 | `keep_last` | `int` | `false` | Optional (default: `None`), if exporter can delete older archives. valid values are:<br>- set to `-1` if you want to delete all archives after each run (useful if you only want to upload to object storage)<br>- set to `1+` if you want to retain a certain number of archives<br>- `0` will result in no action done |
@@ -261,9 +265,12 @@ General
 - `MINIO_ACCESS_KEY`
 - `MINIO_SECRET_KEY`
 
-### Backup Behavior
+## Backup Behavior
+
+### Export File
 Backups are exported in `.tgz` format and generated based off timestamp. Export names will be in the format: `%Y-%m-%d_%H-%M-%S` (Year-Month-Day_Hour-Minute-Second). *Files are first pulled locally to create the tarball and then can be sent to object storage if needed*. Example file name: `bookstack_export_2023-09-22_07-19-54.tgz`.
 
+### General
 The exporter can also do housekeeping duties and keep a configured number of archives and delete older ones. See `keep_last` property in the [Configuration](#options-and-descriptions) section. Object storage provider configurations include their own `keep_last` property for flexibility. 
 
 For file names, `slug` names (from Bookstack API) are used, as such certain characters like `!`, `/` will be ignored and spaces replaced from page names/titles.
@@ -349,6 +356,37 @@ Empty/New Pages will be ignored since they have not been modified yet from creat
 
 You may notice some directories (books) and/or files (pages) in the archive have a random string at the end, example - `nKA`: `user-and-group-management-nKA`. This is expected and is because there were resources with the same name created in another shelve and bookstack adds a string at the end to ensure uniqueness.
 
+### Images
+
+### General
+Images will be dumped in a separate directory, `images` within the page directory it belongs to. As shown earlier:
 
+```
+bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/YKvimage.png
+bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/dwwimage.png
+bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/NzZimage.png
+bookstack_export_2023-11-20_08-00-29/programming/react/basics/images/Mymimage.png
+```
+
+> **Note you may see old images in your exports. This is because, by default, Bookstack retains images/drawings that are uploaded even if no longer referenced on an active page. Admins can run `Cleanup Images` in the Maintenance Settings or via [CLI](https://www.bookstackapp.com/docs/admin/commands/#cleanup-unused-images) to remove them.
+
+### Modify Markdown Files
+**To use this feature, `assets.export_images` should be set to `true`**
+
+The configuration item, `assets.modify_markdown`, can be set to `true` to modify markdown files to replace image url links with local exported image paths. This feature allows for you to make your `markdown` exports much more portable.
+
+Page (parent) -> Images (Children) relationships are created and then each image url is replaced with its own respective local export path. Example:
+```
+## before
+[![pool-topology-1.png](https://demo.bookstack/uploads/images/gallery/2023-07/scaled-1680-/pool-topology-1.png)](https://demo.bookstack/uploads/images/gallery/2023-07/pool-topology-1.png)
+
+## after
+[![pool-topology-1.png](./images/pool-topology-1.png)](https://demo.bookstack/uploads/images/gallery/2023-07/pool-topology-1.png)
+```
+This allows the image to be found locally within the export files and allow your `markdown` docs to have all the images display properly like it would normally would.
+
+**Note: This will work properly if your pages are using the notation used by Bookstack for Markdown image links, example: ` [![image alt text](Bookstack Markdown image URL link)](anchor/url link)` The `(anchor/url link)` is optional.**
+
 ## Object Storage
 Optionally, target(s) can be specified to upload generated archives to a remote location. Supported object storage providers can be found below:
 - [Minio](#minio-backups)
 
@@ -5,7 +5,7 @@
 
 from bookstack_file_exporter.exporter.node import Node
 from bookstack_file_exporter.archiver import util
-from bookstack_file_exporter.archiver.page_archiver import PageArchiver
+from bookstack_file_exporter.archiver.page_archiver import PageArchiver, ImageNode
 from bookstack_file_exporter.archiver.minio_archiver import MinioArchiver
 from bookstack_file_exporter.config_helper.remote import StorageProviderConfig
 from bookstack_file_exporter.config_helper.config_helper import ConfigNode
@@ -49,23 +49,23 @@ def get_bookstack_exports(self, page_nodes: Dict[int, Node]):
             self._get_page_files(page, page_image_meta)
             self._get_page_images(page.file_path, page_image_meta)
 
-    def _get_page_files(self, page_node: Node, image_meta: List[str]):
+    def _get_page_files(self, page_node: Node, image_meta: List[ImageNode]):
         """pull all bookstack pages into local files/tar"""
         log.debug("Exporting bookstack page data")
         self._page_archiver.archive_page(page_node, image_meta)
 
-    def _get_page_image_map(self) -> Dict[int, List[str]]:
+    def _get_page_image_map(self) -> Dict[int, ImageNode]:
         if not self._page_archiver.export_images:
             log.debug("skipping image export based on user input")
             return {}
         return self._page_archiver.get_image_meta()
 
-    def _get_page_images(self, page_path: str, urls: List[str]):
-        if not urls:
+    def _get_page_images(self, page_path: str, img_nodes: List[ImageNode]):
+        if not img_nodes:
             log.debug("page has no images to pull")
             return
         log.debug("Exporting bookstack page images")
-        self._page_archiver.archive_page_images(page_path, urls)
+        self._page_archiver.archive_page_images(page_path, img_nodes)
 
     def create_archive(self):
         """create tgz archive"""
 
@@ -96,7 +96,6 @@ def _get_stale_objects(self, file_extension: str) -> List[MinioObject]:
             # last copy that remains if local is deleted
             log.debug("Minio 'keep_last' set to negative number, ignoring")
             return []
-        # keep_last > 0 condition
         to_delete = []
         if len(minio_objects) > self.keep_last:
             log.debug("Number of minio objects is greater than 'keep_last'")
 
@@ -31,6 +31,61 @@
 # _MARKDOWN_IMAGE_REGEX= re.compile(r"\[\!\[^$|.*\].*\]")
 _MARKDOWN_STR_CHECK = "markdown"
 
+class ImageNode:
+    """
+    ImageNode provides metadata and convenience for Bookstack images.
+
+    Args:
+        :img_meta_data: <Dict[str, Union[int, str]> = image meta data
+
+    Returns:
+        :ImageNode: instance with attributes to help handle images.
+    """
+    def __init__(self, img_meta_data: Dict[str, Union[int, str]]):
+        self.id: int = img_meta_data['id']
+        self.page_id:  int = img_meta_data['uploaded_to']
+        self.url: str = img_meta_data['url']
+        self.name: str = self._get_image_name()
+        self._markdown_str = ""
+        self.image_relative_path: str = f"./{_IMAGE_DIR_NAME}/{self.name}"
+
+    def _get_image_name(self) -> str:
+        return self.url.split('/')[-1]
+    
+    # def _get_markdown_str(self, img_details: Dict[str, Union[int, str]]) -> str:
+    #     if 'content' in img_details:
+    #         if _MARKDOWN_STR_CHECK in img_details['content']:
+    #             print(img_details['content'][_MARKDOWN_STR_CHECK])
+    #             return self._get_md_url_str(img_details['content'][_MARKDOWN_STR_CHECK])
+    #     return ""
+
+    @property
+    def markdown_str(self):
+        return self._markdown_str
+    
+    def set_markdown_content(self, img_details: Dict[str, Union[int, str]]):
+        self._markdown_str = self._get_md_url_str(img_details)
+    # @markdown_str.setter
+    # def markdown_str(self, img_details: Dict[str, Union[int, str]]) -> str:
+    #     self._markdown_str = self._get_md_url_str(img_details)
+        
+
+        """return str for regex replace in page md content"""
+        # return f"[![{self.name}]({self.image_relative_path})]"
+        return self.image_relative_path
+
+    @staticmethod
+    def _get_md_url_str(img_data: Dict[str, Union[int, str]]) -> str:
+        url_str = ""
+        if 'content' in img_data:
+            if _MARKDOWN_STR_CHECK in img_data['content']:
+                url_str = img_data['content'][_MARKDOWN_STR_CHECK]
+        # check to see if empty before doing find
+        if not url_str:
+            return ""
+        return url_str[url_str.find("(")+1:url_str.find(")")]
+
 # pylint: disable=too-many-instance-attributes
 class PageArchiver:
     """
@@ -75,13 +130,12 @@ def archive_page(self, page: Node,
             self._archive_page_meta(page.name, page.file_path, page.meta)
 
     def _archive_page(self, page: Node, export_format: str, data: bytes,
-                      image_urls: List[str] = None):
+                      image_nodes: List[ImageNode] = None):
         page_file_name = f"{self.archive_base_path}/" \
             f"{page.file_path}/{page.name}{_FILE_EXTENSION_MAP[export_format]}"
-        
-        # note yet implemented
-        # if export_format == _MARKDOWN_STR_CHECK and image_urls and self.modify_md:
-        #     data = self._update_image_links(data, image_urls)
+        # not yet implemented
+        if self.modify_md and export_format == _MARKDOWN_STR_CHECK and image_nodes:
+            data = self._update_image_links(data, image_nodes)
         self.write_data(page_file_name, data)
 
     def _get_page_data(self, page_id: int, export_format: str):
@@ -96,7 +150,7 @@ def _archive_page_meta(self, page_name: str, page_path: str,
         bytes_meta = archiver_util.get_json_bytes(meta_data)
         self.write_data(file_path=meta_file_name, data=bytes_meta)
 
-    def get_image_meta(self) -> Dict[int, List[str]]:
+    def get_image_meta(self) -> Dict[int, List[ImageNode]]:
         """Get all image metadata into a {page_number: [image_url]} format"""
         img_meta_response: Response = common_util.http_get_request(
             self.api_urls['images'],
@@ -105,28 +159,14 @@ def get_image_meta(self) -> Dict[int, List[str]]:
         img_meta_json = img_meta_response.json()['data']
         return self._create_image_map(img_meta_json)
 
-    @staticmethod
-    def _create_image_map(json_data: List[Dict[str, Union[str,int]]]) -> Dict[int, List[str]]:
-        image_page_map = {}
-        for image_node in json_data:
-            image_page_id = image_node['uploaded_to']
-            image_url = image_node['url']
-            if image_page_id in image_page_map:
-                image_page_map[image_page_id].append(image_url)
-            else:
-                image_page_map[image_page_id] = [image_url]
-        return image_page_map
-
-    def archive_page_images(self, page_path: str, image_urls: List[str]):
+    def archive_page_images(self, page_path: str, image_nodes: List[ImageNode]):
         """pull images locally into a directory based on page"""
         # image_base_path = f"{self.archive_base_path}/{page_path}{_IMAGE_DIR_SUFFIX}"
         image_base_path = f"{self.archive_base_path}/{page_path}/{_IMAGE_DIR_NAME}"
-        for image_url in image_urls:
-            img_data: bytes = archiver_util.get_byte_response(image_url, self._headers,
+        for img_node in image_nodes:
+            img_data: bytes = archiver_util.get_byte_response(img_node.url, self._headers,
                                                               self.verify_ssl)
-            # seems safer to use this instead of image['name'] field
-            img_file_name = image_url.split('/')[-1]
-            image_path = f"{image_base_path}/{img_file_name}"
+            image_path = f"{image_base_path}/{img_node.name}"
             self.write_data(image_path, img_data)
 
     def write_data(self, file_path: str, data: bytes):
@@ -142,19 +182,32 @@ def gzip_archive(self):
         """provide the tar to gzip and the name of the gzip output file"""
         archiver_util.create_gzip(self.tar_file, self.archive_file)
 
-    def _update_image_links(self, page_data: bytes, urls: List[str]) -> bytes:
+    def _update_image_links(self, page_data: bytes, image_nodes: List[ImageNode]) -> bytes:
         """regex replace links to local created directories"""
-        # 1 - what to replace, 2 - replace with, 3 is the data to replace
-        # re.sub(b'pfsense', b'lol', x.content)
+        for img_node in image_nodes:
+            img_meta_url = f"{self.api_urls["images"]}/{img_node.id}"
+            img_details = common_util.http_get_request(img_meta_url,
+                                                         self._headers, self.verify_ssl)
+            
+            img_node.set_markdown_content(img_details.json())
+            if not img_node.markdown_str:
+                continue
+
+            # re_pattern_bytes = self._get_regex_expr(img_node.markdown_str)
+
+            # re_pattern_bytes = self._get_regex_expr(img_node.url)
+
+            # 1 - what to replace, 2 - replace with, 3 is the data to replace
+            # re.sub(b'pfsense', b'lol', x.content)
+            print(img_node.markdown_str)
+            print(img_node.get_replace_str())
+            page_data = re.sub(img_node.markdown_str.encode(), img_node.get_replace_str().encode(), page_data)
+        # print(page_data)
+        return page_data
 
         # string to bytes
         # >>> k = 'lol'
         # >>> k.encode()
-        pass
-
-    def _valid_image_link(self):
-        """should contain bookstack host"""
-        pass
 
     @property
     def file_extension_map(self) -> Dict[str, str]:
@@ -171,6 +224,19 @@ def verify_ssl(self) -> bool:
         """return whether or not to verify ssl for http requests"""
         return self.asset_config.verify_ssl
 
+    # @staticmethod
+    # def _get_regex_expr(image_str: str) -> bytes:
+    #     # regex_str = fr"\[\!\[^$|.*\]\({url}\)\]"
+    #     # print(regex_str)
+    #     return re.compile(image_str.encode())
+
     @staticmethod
-    def _get_regex_expr(url: str) -> re.Pattern:
-        return re.compile(fr"\[\!\[^$|.*\].*{url}.*\]")
+    def _create_image_map(json_data: List[Dict[str, Union[str,int]]]) -> Dict[int, List[ImageNode]]:
+        image_page_map = {}
+        for img_meta in json_data:
+            img_node = ImageNode(img_meta)
+            if img_node.page_id in image_page_map:
+                image_page_map[img_node.page_id].append(img_node)
+            else:
+                image_page_map[img_node.page_id] = [img_node]
+        return image_page_map