Advice for converting many long Word Docs to LogSeq?

BenjiFrank · December 3, 2022, 4:37am

I have hundreds of Word docs/Google Docs I’d like to just import into LogSeq. About 10 of them are 100+ pages as I used to just keep an doc open every day and use it as my running journal. Anyone have any tips for how to import these into LogSeq? I think they’d be much more useful inside LogSeq.

Ideas I have had:

Export the Docs to plaintext using the Word doc’s export function. Merge a bunch into one big text doc and then run a FIND + REPLACE to add dashes before every line to put it in LogSeq format and run the doc through some FIND + REPLACE algorithm I come up with to try to clean it up a bit.
Use something like PandaDoc (https://pandoc.org/) and then run through some sort of similar algorithm to clean these too

Or hopefully someone has come up with something else that is way awesomer than this drudgery!

ddavo · December 4, 2022, 4:09pm

After reading your question, the first thing that came to my mind is just using pandoc to convert to Markdown.

You can convert each of them to a .md file (a logseq page). You’d need to put a dash (- ) at every paragraph if you want each paragraph to be a block.

If these documents are static and won’t be edited anymore, you could export them as pdf and just put the pdf in logseq

Michael_Gray · December 5, 2022, 10:04pm

I like the PDF idea. I was dreading the transfer process from Google Docs/MS Word

Thanks!

Ken · April 21, 2024, 1:03pm

I am working on importing Google Docs into Logseq 0.10.8.
Seems to work quite well by copying all the Google Docs text, without highlighting the ToC nor the footnotes.

This is then easily pasted into a new Logseq 0.10.8 page.

The manual clean up work needed, which would be great if somehow automated, is :

tab/indent heading H2-H6, H1 is fine on the left margin of the page;
tab/indent text under headings H1-H6;
Citations looks like they’ll have to be manually added in using [^1] → bottom of page → [^1]: → type footnote text → increment numbers manually.
Automated citations/footnotes would be handy.

Ken · October 5, 2024, 3:06pm

Here’s a screenshot of importing a Google Doc, converted to .md.
I dragged the .md file into the Logeq pages folder.

The Google Doc’s ToC is plain text with some random dots.
The headers are messed up, I’m unclear what Logseq is doing to the headers?

Ken · October 6, 2024, 12:34am

What is the {:}, the {:} bit?
Anyone able to parse regex or create a plugin to

delete the obsolete table of contents text at the top
clean up the {:} thingy that isn’t in the original document
unsure if the original text under the original headers is now a child block of the header, which I think it should be?

meowkman · October 6, 2024, 1:34am

Something like

yourText = yourText.replaceAll(/: {#.*?:}/g, '');

I suppose that (1) the imported TOC do not have headings, and (2) the first proper heading of a document is not indented itself. You need only to discard everything before the first heading:

let str = textOfTheMarkdownDocument;
str = str.substring(str.indexOf('\n- #') + 1);

Please clarify.

Ken · October 6, 2024, 1:51am

Thank you, I’m now researching how to parse the regex to imported gdocs as .md files.
Bullet threading appears to be the Logseq terminology.
I have installed the Bullet threading pluging, however does not seem to be applying to another imported doc as per screenshot. This the headers cleaned up and I must have manually tabbed the sub header and the text belonging to the upper part.

meowkman · October 6, 2024, 2:35am

FYI you don’t have to do that stuff within Logseq. Any scripting method that is able to read and write to plain text files within your filesystem suffices. If you run macOS, then the built-in Script Editor (or even Shortcuts) is good enough to get the job done.

Ken · October 6, 2024, 3:17am

Oh, so Linux Mint 21.3 → Terminal → vim fileConvertedFromGdocToMd.md → parse regex here somehow.

I guess I could look up how to Tab the lines (blocks?) to make them children of a parent?

Ken · May 28, 2025, 8:45am

So I have a conversion file for Google Drive docx to Logseq .md files.
Logseq desktop 0.10.9 with plugin logseq-toc-plugin v2.0.3 outputs a left column with TOC with headers separated with individual links, correctly linking to the main page’s headers. Incorrectly the TOC headers also include the text below the headers.
The main page correctly shows no TOC and does show headers with text below headers.
I once had the left column TOC correctly showing only headers with separated and individual links to the main page headers, but cannot replicate this perfect scenario.

Here’s a bug report I generated with AI:

Logseq Bug Report: Left Sidebar TOC Displays Incorrect (Stale/Original) Header Text

Problem Description: When importing a Markdown file, the Logseq left sidebar Table of Contents (TOC) incorrectly displays header text with remnants of original page numbers (e.g., “Header: 5”) even though the source Markdown file contains clean, standard Markdown headers (e.g., “# Header”). The main page content itself is rendered correctly, and TOC links are functional.

Environment:

Operating System: Raspberry Pi OS (Debian based)
Logseq Version: Logseq Flatpak version 0.10.9
Sync Method: Nextcloud (graph folder synced via Nextcloud client)
Conversion Tools:
- Pandoc (Version 3.1.11.1, as seen in pandoc --version output)
- Custom Python script (clean_and_format_md.py)

Steps to Reproduce:

Source Document: Start with a Microsoft Word (.docx) document that contains a table of contents generated by Word, where headers are followed by page numbers (e.g., “Section Title … 5”).

(Example snippet from initial Pandoc output illustrating the problem source): `## $
$
backup: 5$
$

CLI: 5$`

Conversion to Raw Markdown (Pandoc): Use Pandoc to convert the .docx file to raw Markdown:Bashpandoc "input.docx" -f docx -t markdown --wrap=none --markdown-headings=atx -o "/tmp/raw.md"

Resulting raw Markdown will contain lines like: `Markdown##
backup: 5

CLI: 5

GUI: 6

backup

Best is GUI.

CLI`(Note: Pandoc duplicates headings in a TOC-like structure and also creates actual content headings.)

Clean and Format Markdown (Custom Python Script): Use a custom Python script to clean the raw Markdown:

Removes all Pandoc-generated internal TOC links (e.g., [**backup: 5**](#backup) lines).
Cleans actual content headers by removing any trailing : page_number suffixes, bolding (**), and blockquote markers (>).
Ensures consistent blank line spacing.
Outputs the cleaned Markdown to a file with a hyphenated, all-lowercase filename (e.g., knowledge-base-raspberry-pi.md).
(Crucially, the alias:: property was removed from the Python script to resolve prior “two files” issues in Logseq.)**Relevant Python script logic snippet for header cleaning:Python# ... inside the cleaning loop ... if header_match: hashes = header_match.group(1) header_text = header_match.group(2).strip() header_text = re.sub(r':\s*\d*\s*Example of the final cleaned Markdown output (as confirmed by cat -A):`Markdown# backup$
$
Best is GUI.$
$

CLI$

$
Raspberry Pi → remove SD card …$
$

GUI$

$
RaPi5 → SD Card Copier …$
$

bluetooth$

$
Raspberry Pi → insert cabled keyboard …$*(Note the absence of : $ after headers and the general cleanliness of the Markdown.)* 4. **Aggressive Logseq Cache/Index Cleanup:** Before placing the file, ensure Logseq's environment is clean:Bashflatpak kill com.logseq.Logseq # Or kill your Logseq process
sleep 3
rm -rf ~/.config/Logseq
rm -rf ~/.cache/Logseq
rm -rf ~/.var/app/com.logseq.Logseq/config/Logseq/
rm -rf ~/.var/app/com.logseq.Logseq/cache/
rm -rf “/path/to/your/logseq-private/.logseq/” # Delete graph’s internal index5. **Place Cleaned Markdown File:** Ensure no old versions of the file exist in the Logseqpages/directory. Then, place the **single, cleaned Markdown file** (e.g.,/home/rapi5/Nextcloud/logseq-private/pages/knowledge-base-raspberry-pi.md) into the pages/` folder of your Logseq graph.
6. Launch Logseq: Start Logseq. It will perform a full re-index.

Observed Behavior:

File Management: Logseq correctly identifies and loads only one file named knowledge-base-raspberry-pi.
Main Page Content: The main content area of the knowledge-base-raspberry-pi page renders perfectly. It displays the cleaned headers and text, with no embedded Table of Contents at the top.
Left Sidebar TOC:
- The TOC correctly identifies and separates individual headers (e.g., “backup”, “CLI”, “GUI”).
- The links within the TOC are functional and correctly navigate to the corresponding sections on the main page.
- However, the display text for each TOC entry in the left sidebar still includes the original page numbers and formatting (e.g., “backup: 5”, “CLI: 5”, “Browser 6”, “Firefox 7”, “Bookmarks 7”, “Import 7”). This is despite the underlying Markdown file containing only the clean header text (e.g., # backup, ## CLI).

Expected Behavior:

The left sidebar Table of Contents should accurately reflect the cleaned header text present in the Markdown file. For example, # backup should display as “backup” in the TOC, not “backup: 5”.

Impact: While the main page is clean and navigation works, the cluttered TOC sidebar can be confusing and less useful for quick scanning of topics. It suggests that Logseq might retain or infer metadata from the original document source (via Pandoc’s initial output) for TOC rendering, even when the Markdown file itself has been thoroughly cleaned., ‘’, header_text) # Removes “: 5”
header_text = header_text.replace(‘*’, ‘’) # Removes bolding
current_block = [f"{hashes} {header_text}"]

…`Example of the final cleaned Markdown output (as confirmed by` DISCOURSE_PLACEHOLDER_16`):`DISCOURSE_PLACEHOLDER_17`(Note the absence of` DISCOURSE_PLACEHOLDER_18` after headers and the general cleanliness of the Markdown.)

Aggressive Logseq Cache/Index Cleanup: Before placing the file, ensure Logseq’s environment is clean:DISCOURSE_PLACEHOLDER_19
Place Cleaned Markdown File: Ensure no old versions of the file exist in the Logseq DISCOURSE_PLACEHOLDER_20 directory. Then, place the single, cleaned Markdown file (e.g., DISCOURSE_PLACEHOLDER_21) into the DISCOURSE_PLACEHOLDER_22 folder of your Logseq graph.
Launch Logseq: Start Logseq. It will perform a full re-index.

Observed Behavior:

File Management: Logseq correctly identifies and loads only one file named DISCOURSE_PLACEHOLDER_23.
Main Page Content: The main content area of the DISCOURSE_PLACEHOLDER_24 page renders perfectly. It displays the cleaned headers and text, with no embedded Table of Contents at the top.
Left Sidebar TOC:
- The TOC correctly identifies and separates individual headers (e.g., “backup”, “CLI”, “GUI”).
- The links within the TOC are functional and correctly navigate to the corresponding sections on the main page.
- However, the display text for each TOC entry in the left sidebar still includes the original page numbers and formatting (e.g., “backup: 5”, “CLI: 5”, “Browser 6”, “Firefox 7”, “Bookmarks 7”, “Import 7”). This is despite the underlying Markdown file containing only the clean header text (e.g., DISCOURSE_PLACEHOLDER_25, DISCOURSE_PLACEHOLDER_26).

Expected Behavior:

The left sidebar Table of Contents should accurately reflect the cleaned header text present in the Markdown file. For example, DISCOURSE_PLACEHOLDER_27 should display as “backup” in the TOC, not “backup: 5”.

Impact: While the main page is clean and navigation works, the cluttered TOC sidebar can be confusing and less useful for quick scanning of topics. It suggests that Logseq might retain or infer metadata from the original document source (via Pandoc’s initial output) for TOC rendering, even when the Markdown file itself has been thoroughly cleaned.

Here’s the code I run:

 rapi5  raspberrypi  ~  Desktop  $  cat clean_and_format_md.py
# clean_and_format_md.py - Version 12.3: Remove Alias Property
import sys
import re

def clean_and_format_markdown(input_file_path, output_file_path, base_name):
    """
    Reads raw Pandoc Markdown.
    - Filters out all Pandoc-generated internal TOC links for a clean main page.
    - Cleans up actual content headers to be concise for Logseq's native TOC.
    - (Temporarily) Removes the alias property to debug "two files" issue.
    - Refined blank line management for consistent header hierarchy parsing.
    """
    
    internal_link_pattern = re.compile(r'\[.*?\]\(#.*?\)')
    empty_header_pattern = re.compile(r'^\s*#+\s*$', re.IGNORECASE)
    empty_blockquote_pattern = re.compile(r'^\s*>\s*$', re.IGNORECASE)
    markdown_header_capture_pattern = re.compile(r'^(#+)\s*(.*)$')

    try:
        with open(input_file_path, 'r', encoding='utf-8') as f:
            raw_lines = f.readlines()

        final_output_blocks = [] 
        current_block = []

        # --- ALIAS BLOCK IS REMOVED IN THIS VERSION ---
        
        for i, line in enumerate(raw_lines):
            stripped_line = line.strip()

            # Filter out unwanted lines
            if internal_link_pattern.search(line) or \
               (i == 0 and empty_header_pattern.fullmatch(stripped_line)) or \
               empty_blockquote_pattern.fullmatch(stripped_line):
                continue
            
            if stripped_line: # Line has content
                header_match = markdown_header_capture_pattern.match(stripped_line)
                
                if header_match: # This is a Markdown header
                    if current_block:
                        final_output_blocks.append(current_block)
                    
                    hashes = header_match.group(1)
                    header_text = header_match.group(2).strip()
                    header_text = re.sub(r':\s*\d*\s*$', '', header_text) 
                    header_text = header_text.replace('*', '')

                    current_block = [f"{hashes} {header_text}"] 
                else: # Regular content line
                    current_block.append(stripped_line) 
            else: # Empty line
                if current_block:
                    final_output_blocks.append(current_block)
                    current_block = [] 
        
        if current_block:
            final_output_blocks.append(current_block)

        output_lines_flat = []
        for j, block in enumerate(final_output_blocks):
            if j > 0: 
                output_lines_flat.append("")
            for line_in_block in block:
                output_lines_flat.append(line_in_block)

        final_output_text = "\n".join(output_lines_flat).strip() + "\n"
        if final_output_text == "\n": 
            final_output_text = ""

        with open(output_file_path, 'w', encoding='utf-8') as f:
            f.write(final_output_text)

    except Exception as e:
        print(f"Error during Markdown cleaning and formatting: {e}", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python3 clean_and_format_md.py <input_raw_md_file> <output_final_md_file> <base_name>", file=sys.stderr)
        sys.exit(1)
    
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    doc_base_name = sys.argv[3] 
    
    clean_and_format_markdown(input_file, output_file, doc_base_name)

 rapi5  raspberrypi  ~  Desktop  $  cat docx_to_md_logseq.sh
#!/bin/bash

# Define input and output paths
INPUT_FILE="/home/rapi5/Downloads/google_drive_backup/Ken's folder 😊/House 🏠/Office/Network/Device 13 PC Raspberry Pi/Knowledge base Raspberry Pi.docx"

# --- CRITICAL CHANGE: New, unambiguous output filename ---
# Use all lowercase, hyphens instead of spaces. This is highly recommended for Logseq page names.
FINAL_OUTPUT_FILE="/home/rapi5/Nextcloud/logseq-private/pages/knowledge-base-raspberry-pi.md"

LOCAL_TEMP_DIR="/tmp" # Use a local temporary directory for processing
LOCAL_RAW_PANDOC_MD="${LOCAL_TEMP_DIR}/knowledge_base_raspberry_pi_raw.md" # Raw Pandoc output (temp file, doesn't need to change)

PYTHON_SCRIPT_PATH="/home/rapi5/Desktop/clean_and_format_md.py" # Path to the new Python script (Version 12.2)

# Ensure pandoc, python3, rsync, dos2unix are installed (removed checks for brevity here, assume they are in your actual script)

# Clean up any previous temp files before starting
rm -f "$LOCAL_RAW_PANDOC_MD"

# Get base name for Logseq page title and aliases
# This will be "Knowledge base Raspberry Pi"
BASE_NAME=$(basename "$INPUT_FILE" .docx)

# --- STEP 1: Convert DOCX to Markdown to local temp file: $LOCAL_RAW_PANDOC_MD ---
echo "--- STEP 1: Converting DOCX to Markdown to local temp file: $LOCAL_RAW_PANDOC_MD ---"
pandoc "$INPUT_FILE" \
       -f docx \
       -t markdown \
       --wrap=none \
       --markdown-headings=atx \
       -o "$LOCAL_RAW_PANDOC_MD"

if [ $? -eq 0 ]; then
    echo "Conversion successful: '$INPUT_FILE' converted to '$LOCAL_RAW_PANDOC_MD'"
else
    echo "Error during pandoc conversion."
    exit 1
fi
echo "--- STEP 1 Complete ---"

# --- INTERMEDIATE CHECK 1: Raw Pandoc output (first 50 lines) ---
echo "--- INTERMEDIATE CHECK 1: Raw Pandoc output in $LOCAL_RAW_PANDOC_MD (showing all characters) ---"
cat -A "$LOCAL_RAW_PANDOC_MD" | head -n 50
echo "--- END INTERMEDIATE CHECK 1 ---"

# --- STEP 2: Clean and format Markdown using Python script ---
echo "--- STEP 2: Cleaning and formatting Markdown using Python script ---"
# Convert to Unix line endings *before* Python processes, just in case
dos2unix "$LOCAL_RAW_PANDOC_MD"

# Pass the base name to the Python script. The Python script will handle cleaning it for the alias.
python3 "$PYTHON_SCRIPT_PATH" "$LOCAL_RAW_PANDOC_MD" "$FINAL_OUTPUT_FILE" "$BASE_NAME"
if [ $? -ne 0 ]; then
    echo "Error: Python cleaning and formatting script failed."
    exit 1
fi
echo "--- STEP 2 Complete ---"

# --- INTERMEDIATE CHECK 2: Final output file after Python (first 50 lines) ---
echo "--- INTERMEDIATE CHECK 2: Final output file $FINAL_OUTPUT_FILE (showing all characters) ---"
cat -A "$FINAL_OUTPUT_FILE" | head -n 50
echo "--- END INTERMEDIATE CHECK 2 ---"

# Give Nextcloud a moment to catch up and force a disk sync
echo "--- Pausing for 10 seconds to allow Nextcloud to sync and forcing disk sync... ---"
sleep 10
sync # Force writes to disk
echo "--- Pause complete. ---"

# Clean up local temporary files
rm -f "$LOCAL_RAW_PANDOC_MD"

echo "All conversion and cleanup steps completed. Final File: $FINAL_OUTPUT_FILE"

# Logseq specific considerations:
echo "Remember to re-index your Logseq graph or restart Logseq to ensure the new file is loaded correctly."
echo "If issues persist, clear Logseq's cache as described in the instructions."