Advice for converting many long Word Docs to LogSeq?

I have hundreds of Word docs/Google Docs I’d like to just import into LogSeq. About 10 of them are 100+ pages as I used to just keep an doc open every day and use it as my running journal. Anyone have any tips for how to import these into LogSeq? I think they’d be much more useful inside LogSeq.

Ideas I have had:

  • Export the Docs to plaintext using the Word doc’s export function. Merge a bunch into one big text doc and then run a FIND + REPLACE to add dashes before every line to put it in LogSeq format and run the doc through some FIND + REPLACE algorithm I come up with to try to clean it up a bit.
  • Use something like PandaDoc (https://pandoc.org/) and then run through some sort of similar algorithm to clean these too

Or hopefully someone has come up with something else that is way awesomer than this drudgery!

1 Like

After reading your question, the first thing that came to my mind is just using pandoc to convert to Markdown.

You can convert each of them to a .md file (a logseq page). You’d need to put a dash (- ) at every paragraph if you want each paragraph to be a block.

If these documents are static and won’t be edited anymore, you could export them as pdf and just put the pdf in logseq

1 Like

I like the PDF idea. I was dreading the transfer process from Google Docs/MS Word

Thanks!

I am working on importing Google Docs into Logseq 0.10.8.
Seems to work quite well by copying all the Google Docs text, without highlighting the ToC nor the footnotes.

This is then easily pasted into a new Logseq 0.10.8 page.

The manual clean up work needed, which would be great if somehow automated, is :

  1. tab/indent heading H2-H6, H1 is fine on the left margin of the page;

  2. tab/indent text under headings H1-H6;

  3. Citations looks like they’ll have to be manually added in using [^1] → bottom of page → [^1]: → type footnote text → increment numbers manually.
    Automated citations/footnotes would be handy.

Here’s a screenshot of importing a Google Doc, converted to .md.
I dragged the .md file into the Logeq pages folder.

The Google Doc’s ToC is plain text with some random dots.
The headers are messed up, I’m unclear what Logseq is doing to the headers?

What is the {:}, the {:} bit?
Anyone able to parse regex or create a plugin to

  1. delete the obsolete table of contents text at the top

  2. clean up the {:} thingy that isn’t in the original document

  3. unsure if the original text under the original headers is now a child block of the header, which I think it should be?

Something like

yourText = yourText.replaceAll(/: {#.*?:}/g, '');

I suppose that (1) the imported TOC do not have headings, and (2) the first proper heading of a document is not indented itself. You need only to discard everything before the first heading:

let str = textOfTheMarkdownDocument;
str = str.substring(str.indexOf('\n- #') + 1);

Please clarify.

1 Like

Thank you, I’m now researching how to parse the regex to imported gdocs as .md files.
Bullet threading appears to be the Logseq terminology.
I have installed the Bullet threading pluging, however does not seem to be applying to another imported doc as per screenshot. This the headers cleaned up and I must have manually tabbed the sub header and the text belonging to the upper part.

FYI you don’t have to do that stuff within Logseq. Any scripting method that is able to read and write to plain text files within your filesystem suffices. If you run macOS, then the built-in Script Editor (or even Shortcuts) is good enough to get the job done.

1 Like

Oh, so Linux Mint 21.3 → Terminal → vim fileConvertedFromGdocToMd.md → parse regex here somehow.

I guess I could look up how to Tab the lines (blocks?) to make them children of a parent?