Option to Make Parser Respect Standard Markdown

ramen · March 22, 2021, 12:22am

Hi,

as far as I’ve seen, logseq saves its files with extension .md but the content of the files, even though being human-readable text, does not really respect the markdown conventions in its formatting.

E.g., each start of a “bullet” seems to be indicated by the string "## ", but standard markdown already has a way to indicate these bullets: Interpreting them as paragraphs, a blank line indicates their start. As per John Gruber’s markdown syntax documentation:

A paragraph is simply one or more consecutive lines of text, separated by one or more blank lines.

(I.e. a double press of return creates a new block/paragraph)

I’ve seen that the #'s make higher levels of indentation work nicely, but I think maybe using a number of dashes “-” to indicate the indentation level would not deviate as much from other flavors of markdown. (This would also free up #'s to indicate headers that separate bullet points which could be nice to have.)

I get that logseq will likely need its own flavor of markdown but as of right now I don’t really think the files it saves should end in .md, as bullets aren’t even just markdown headers if they contain more than a single line of text.

Zhiyuan_Chen · March 22, 2021, 3:38am

Hi, ramen！Thanks for your advice!
we’re currently working on the exporting-markdown feature, which exports heading(’##’) as bullet list(’-’) and ignore properties in output. It’ll respect markdown conventions much better. And we’re also considering bullet lists(’-’) as acceptable headings.

andrea · March 22, 2021, 12:58pm

Would the markdown only be generated during a manual export or also used for the content of the files?
The latter would be preferable as it means we can work with the same directory from multiple applications without the need to export manually.

ramen · March 22, 2021, 4:14pm

@Zhiyuan_Chen Hi, thanks for your reply! The export feature sounds very nice to have! Still, I think that @andrea is right. It would be great to have true markdown for the underlying data, as one would then be able to make use of the vast library of already available markdown parsers and processors (e.g. pandoc for converting markdown to beamer presentations).

Maybe one could modularize the logseq parser? I.e. allow the option to use any token to indicate new bullet points. I’d strongly favor blank lines to indicate new bullet points because for my workflow it is key to interpret bullet points as paragraphs of continuous text. Anyway, I’m very impressed by logseq so far and I’m hopeful for its future!

Zhiyuan_Chen · March 23, 2021, 2:47am

For now, the underlying data is still the heading(’##’) format with some PROPERTIES syntax(from org-mode). We have plans to have a true markdown for underlying data. so currently, only manual export files are true-markdown(this feature will be released soon)

Zhiyuan_Chen · March 23, 2021, 2:51am

good idea! Customizable parser sounds great, we’ll do some research on this!

andrea · March 23, 2021, 8:30am

Makes sense, in the future it would be nice that the underlying data was in markdown and the properties as comments e.g.

<!--- comment -->

Having the underlying data not reflecting the exported data reduces the benefits of having the data as actual files and it’s instead more similar to an internal database that needs to be manually exported.
Anyway, I’m still excited about this feature and happy that it will be released soon. Great work!

psynikal · January 5, 2023, 7:48pm

What is the state of this now? Context is in trying to import a few thousand markdown documents and would like to have them mostly as one giant block (not attempt to be interpreted using logseq’s custom markdown format).

lewisia · February 18, 2025, 1:38am

I’m also curious about this.

I see marked in the package.json yet idk enough about the Logseq codebase yet to know if it’s used as the universal parser. Also there’s the mldoc parser in the Logseq Github organization (mentioned as being used here and in the main readme), yet it looks nearly abandoned.

Is there a main parser used by Logseq internals and frontend, and if so is there a place in the repo that code is centered? I’d love to contribute if I can, I’m still getting up and running with the codebase and Clojurescript :/.

Specifically, I noticed a bug in Logseq’s parsing of inline code blocks that I wanted to fix (like this GFM example yet with more backticks): can't escape 3 backticks with a 4 backtick wrapper: ```lang \n code \n ``` (trailing text)

Having a compartmentalized parser that follows a spec like GFM with Logseq-specific behavior encapsulated in an extension could make Logseq easier to develop by separating/externalizing general Markdown parsing issues while making it easier to isolate and address Logseq specific ones.

I know that could mean major refactoring, depending upon the current methodology Logseq has for parsing. It’s also a massive ask for an outsider as a first comment, really sorry about that . Any which way, I’d be curious to hear the latest on how Logseq is doing parsing!