Knowledge Management for Tags / Tag Hierarchies

Let’s continue the discussion on the other thread Would a rich commitment to hierarchies and classification be an anathema to Logseq culture? - #10 by gax
I’ll post there shortly.

I realise that it’s been some time since this thread opened, but if the topic of tagging is still being discussed may I add a comment.

I’ve searched the entire forum, but it appears nobody has thought of using the model employed by WordNet as a structure for defining tags; I would recommend taking a look, not least because it is very well tested.

For those who may not have encountered it, WordNet is in essence a structure for defining natural language concepts – words, phrases – and relationships between them. Think of it as a “metamodel” for language (not just English, any language). We describe everything using language, so the meta model concepts are generalisable to all fields of endeavour. WordNet is (indeed cannot be) complete in respect of every discipline – history, medicine, biology, chemistry etc. – but it can be extended into them without having to add to its “language metamodel”.

From the point of view of tagging, the key elements of WordNet that I would focus on are synonyms, parts of speech, hyper- and hyponyms, and mero- / homonyms. There is more, but all tags I have come across relate to either things (the tag is a noun) or actions (the tag is a *verb) and these concepts suffice. If there are live examples of parts of speech other than nouns and verbs being used as tags I’d like to hear about them.

WordNet recognises that the same concept may be described different words, i.e. synonyms. In fact its core construct is the “synset”, the group of words that all describe the same thing. A big part of the problem with tagging as an information discovery tool is that different people at different times use different synonyms for the same concept. When we are looking to find something we are basically asking the question, what do I know about X? X is a concept, like a ship or a fight or a battle, not necessarily any specific term (word). In Logseq a tag is a note, and a note can easily define synset (i.e. a list of words), it ought to be relatively straight forward to identify the note containing the given “tag word” and link to it rather than straight to a note having that word as its title.

Words, of course, may have multiple meanings. English (for example) freely uses nouns as verbs; there is a big conceptual difference between a fight and to fight or a fly and to fly. Synsets have descriptions of the concept they define (there are four for fly as a noun and 14 for fly as a verb!). Tagging to the synset rather than the word resolves the ambiguity making subsequent query more precise.

Hyper- and hyponyms are respectively generalisations and specialisations of concepts. Thus “dog” (in the sense of a member of the genus Canis that has been domesticated by man since prehistoric times; occurs in many breeds) has “canine” (any of various fissiped mammals with nonretractile claws and typically long muzzles) as a hypernym and various breeds of dog as hyponyms. Importantly, WordNet differentiates between a concept and an instance of that concept. “Labrador Retriever” is a hyponym of “dog”; my dog “Pepper” is an instance hyponym of “Labrador Retriever”.

Finally – at least from the perspective of tagging – WordNet describes mereological (i.e. part-part of) relationships; thus the concept of an (internal combustion) engine has part meronyms of camshaft and cylinder. Cylinder and camshaft have engine as their holonym.

Anyway, a suggestion that I feel is worth investigating further.

3 Likes

Interesting! I’d never heard of WordNet before.
I found the wikipedia article helpful WordNet - Wikipedia .

There has been another suggestion by @menelic to add properties to every link:

This is something of a superset of this proposal, I wonder if both could be merged.

1 Like

Cosma looks very interesting, I’d not come across it before or Jugglr for that matter. Just goes to show the value of this forum in cross pollinating ideas. Thanks.

1 Like

I think there’s a conceptual problem here in how people store information and take notes rather than a technological one to be solved by features.

It’s key to understand that extensive keyword tagging in PKM graphs is detrimental. Exactly the kind of keyword tagging that you use in databases lke Zotero to find stuff will completely ruin your PKM system (been there, done that :flushed:).

Here is a discussions about that topic that I found extremely helpful: https://www.reddit.com/r/ObsidianMD/comments/n7m5gx/backlinks_vs_tags/

Key quote:

Extensive content-based tagging is a known anti-pattern because tags create a weak association at best between notes.
By using content-based tags you are making yourself feel that you are creating associations but you are still really shifting the burden to your future self to figure out why the notes are associated.
This becomes a significant problem when you have a larger corpus and your tag ontology begins growing beyond control.

That said, I’d totally vote for qualified semantic links in the form of something like [[qualifier::note]], e. g. [[because::tags only create a weak association between notes]] or [[parent::product management MOC]] Currently I am sometimes using qualifier tags (paired with tag-specific CSS for making them visually stand out) as a workaround. (Disadvantage: separate qualifier tags are not coupled with the referred note, just weakly associated by position.)

PS: tags already are hierarchical if you want to. Because tags and pages are equal in Logseq. That feature is already there.

2 Likes

Isn’t this proposal enough to encode more relational information? The references we have now would be weak connections while relations specified by properties would be stronger connections in the knowledge graph.

1 Like

Yes and no. For some of the use cases I read here it totally seems to make sense (as far as I can judge). From my own perspective not so much, though. Here’s why:

In contrast to qualified links page properties stored in the header are not contextual and not specific. I can neither create nor discover them in writing. Moreover, if there is more than one relationship modelled via page properties I cannot know or indicate which of these is meaningful in my note’s context. You need to model, manage, and keep all relationships in the specific page header rather than having them spelled out where they actually matter. This obstructs one of Logseq’s core features, namely emerging connections between notes within context.

I think that we have two significantly different – and moreover: conceptionally incompatible – mental models at play here in how we approach working with knowledge:

  1. Relational Database (RDB): distinct (semi-)structured objects of more or less similar types with specific key/value fields (modeled as tables) storing data and modelling relationships
  2. Graph: unstructured objects of varying types with relationships expressed as edges between graph nodes. (Nodes can have key/value properties, ideally storable anywhere in the node)

Several of the contributors here seem to approach Logseq from a RDB perspective and not from a note graph perspective. Fair enough. But if you approach a graph-based tool with a RDB perspective you won’t get far (and vice versa). You can observe similar tendencies in the Obsidian neighbourhood where people are using DataView to tweak the system into a classical database (with all relevant data stored as key/value table in the header rather than the note body). To me that’s like trying to swim faster by putting on the best running shoes because they work so well on land :wink:

TL;DR: If you need a RDB, use a RDB and not a graph. The only tool I know that does both fairly well is Notion. And it does so by encapsulating the RDB blocks and keeping RDB records and graph nodes separate from each other. That way you can interact with both using an appropriate interface etc. without shoehorning one concept badly into the other.

PS: I totally acknowledge that there’s a warranted need combining both models in one ‘Integrated Thinking Environment’, and being able to use the fitting one depending on which kind of content you deal with. I just strongly object against turning any decidedly graph-based tool into a mishmash that tries to do both at once – and consequently neither well because the fundamental models involuntarily are clashing. That said: how about following Notion in that respect and implementing an RDB model on a block basis (perhaps using CSV instead of MD for storage)?

3 Likes

So if I understand it correctly you want to enrich references with a property. But the syntax you mentioned would create a page for a single statement.

Instead I would extend my approach to blocks:

- property:: <block-id>
  A statement.

for example:

- because:: 1234-5678-1234-5678
  A statement.

The block with ID 1234-5678-1234-5678 contains another statement that in this example is the reasoning that justify the first statement.

What do you think?

Yes, Logseq derives a graph from references but I don’t see why Logseq should prefer this model over the RDB one. I think Logseq just provides very general and abstract tools and each user tries to come up with a data structure that work for them.

But I think both graph UI/UX and RDB UI/UX like queries are not yet ripe in Logseq. I hope Logseq will develop both and the tools to integrate them.

Isn’t this already the case? Queries work on a block basis by default: the results of queries are blocks and their properties can be used in queries to filter or in the visualization (columns of a table at the moment).

No necessarily (see Introduction to Semantic MediaWiki - semantic-mediawiki.org – where I borrowed the syntax): the first part before the :: separator would actually create a page or block property with the second part as the property’s value.

The significant difference lies in the storage format of the data (and consequently how you can edit/mangle/query/… it). In an RDB you have a fixed set of fields in a fixed tabular structure. When you add or delete a field to the DB schema this impacts all records in the table. And you can only store content that adheres to the RDB table schema (a massive benefit if your data is structured). The only common aspect is that in both cases you can store and query key/value pairs per record.

1 Like

Which is most probably why the database engine on which Logseq is built is a graph database: It needs to be highly malleable :slight_smile:

The first answer here (sql - Comparison of Relational Databases and Graph Databases - Stack Overflow) is explains it better than I could quickly:

There actually is conceptual reasoning behind both styles. Wikipedia on the relational model and graph databases gives good overviews of this.

The primary difference is that in a graph database, the relationships are stored at the individual record level, while in a relational database, the structure is defined at a higher level (the table definitions).

This has important ramifications:

  • A relational database is much faster when operating on huge numbers of records. In a graph database, each record has to be examined individually during a query in order to determine the structure of the data, while this is known ahead of time in a relational database.
  • Relational databases use less storage space, because they don’t have to store all of those relationships.

Storing all of the relationships at the individual-record level only makes sense if there is going to be a lot of variation in the relationships; otherwise you are just duplicating the same things over and over. This means that :point_right:graph databases are well-suited to irregular, complex structures.:point_left: But in the real world, most databases require regular, relatively simple structures. This is why relational databases predominate.

(note the highlight :))

2 Likes

Oh OK, so from the linked PDF:

The page is about the book “The Picture of Dorian Gray” and somewhere in the page the author “Oscar Wilde” is mentioned.

So I suppose the syntax for “Oscar Wilde” would be something like:

[[author::Oscar Wilde]]

That would be treated by Logseq as a link like [[Oscar Wilde]].

But in this example author:: clearly refers to the page The Picture of Dorian Gray. But I think this wouldn’t work in Logseq that is block-based.

I think a block like this in Logseq would make sense:

- [[title::The Picture of Dorian Gray]] is a [[type::book]]
  written by [[author::Oscar Wilde]].

Rendered, with links, as:

The Picture of Dorian Gray is a book written by Oscar Wilde.

But at the same time that block gain the properties:

type:: book
title:: The Picture of Dorian Gray
author:: Oscar Wilde

This is brilliant! Now I really want this implemented!

7 Likes

A bit better, isn’t it? :relaxed:

Still, while the format above works well for SMW’s page-centric model it does not yet feel completely fit for Logseq’s block model to me. But I have a hunch that with another round of deeper hands-on thinking we could actually come up with a model & syntax that’s truly block-based.

Basically we would somehow need to explicitly declare the left-hand side of the relation, as well. Your solution of using title:: for that purpose might come close. On first sight it’s utterly elegant from my perspective :slight_smile:

But there’s a catch (sigh…): does [[title:: The Picture of Dorian Gray]] then create a title:: property for the block (I assume not, because it would break the current file storage model) or for the page [[The Picture of Dorian Gray]]. How do we know then that the other properties declared in that block are either properties of the page [[The Picture of Dorian Gray]] or of the block? This is tricky.

They are all properties of a block, that’s just how we add data to the database in Logseq. If a block is the first one of a page its properties become the properties of the page (basically a page is a block with a name). Here there is an example of a page with properties and a block used to add an entry to the database:

This is the title of the page.md
- title:: This is the title if the page
  another-page-property:: ...
- type:: book
  title:: The Picture of Dorian Gray
  author:: Oscar Wilde

So you can just use blocks with all the properties you want, even title::
If you want to use a page to define a DB entry you can do so: but if you use the special property title:: you must be OK with that being the title of the page (and the associated file name).


At the moment if I write this block:

- [[The Picture of Dorian Gray]] is a [[book]] written by [[Oscar Wilde]]

This block link together three pages without specify any kind of relation, they are all just mentions.


If I want to store the above info with the RDB model I would write this block instead:

- type:: book
  title:: The Picture of Dorian Gray
  author:: Oscar Wilde

These two methods are redundant; the former is better for human reading and fits into a longer text; the latter is a RDB entry.

I’m saying that with the proposed syntax it would be possible to specify a DB entry with a sentence, so just one block that fits in a longer text and also encode the relations:

- [[title::The Picture of Dorian Gray]] is a [[type::book]] written by [[author::Oscar Wilde]].

So basically “Oscar Wilde” is the author of a block, this block is defined as a “book” and again this block has a property, “title”, that is “The Picture of Dorian Gray”.

I hope this makes clear how we use Logseq with a RDB model.

Anyway thank you very much for making me aware of this!

2 Likes

P.S.

It would be really cool if an algorithm (maybe some Machine Learning) could automatically suggest references and properties.

For example if I write:

- The Picture of Dorian Gray is a book written by Oscar Wilde

it would ask me if I want to turn it into this:

- [[title::The Picture of Dorian Gray]] is a [[type::book]] written by [[Oscar Wilde]]

So that with a query I can for example display a table with all the blocks with type::book and see the above as an entry:

Title Author
The Picture of Dorian Gray Oscar Wilde

Even the opposite would be cool: if I have some kind of structured data (MySQL DB, CSV, JSON etc) I could programmatically generate Markdown files, for example:

The Picture of Dorian Gray.md
- [[title::The Picture of Dorian Gray]] is a [[type::book]] written by [[Oscar Wilde]].

and

Oscar Wilde.md
- type:: author
- ## Books
  - Oscar Wilde wrote the following books:
    {{ query (and (property type [[book]]) (property author [[Oscar Wilde]]) }}
3 Likes

jq can do this easily :slight_smile: See GitHub - 0dB/dayone-json-to-obsidian: Update Obsidian vault from Day One (“DayOne”) JSON using command line scripts. for an example how to. I recently converted the JSON export of my pinboard bookmarks to markdown files that way. Having ca 23k bookmarks from ca. 15 years of bookmarking with pinboard lying around as individual MD files turned out to be a somewhat dead end approach, though :grimacing:

2 Likes

@brsma can I write a proposal for [[property::reference]] syntax or do you have any other doubt?

Before any proposals are written, may I make a couple of points?

First and foremost, giving a link a type using the syntax [[property::reference]] will break interoperability between Logseq and other Markdown editors that interpret [[ … ]] as a Wikilink. Logseq uses (fairly standard) markdown files as persistent storage backing a more transient graph database. Moreover, Logseq continuously monitors the markdown files so that if you edit one using an external editor it updates immediately, not on the next restart. Clearly Logseq’s designers thought interoperability sufficiently important to forego some of the benefits of just using a database. If we’re not going to stick with standard markdown, Logseq might as well just use a database.

Second, is the suggestion that the syntax [[property:reference]] constrains the reference to be to a note that is somehow defined to be an appropriate target? In a relational database this can be implemented using referential integrity, i.e. a constraint requiring that a foreign key in one table be an extant primary key in some other table, but this requires a type system and, at least as yet, Logseq is entirely untyped. There are, for example, no constraints on the values you give to purportedly the same property in different notes; it is entirely legitimate, albeit nonsensical, to define the author:: [[Oscar Wilde]] (a link) in one note, author:: Oscar Wilde (a string) in a second; and author:: 42 (a string that look like a number) in a third.

Logseq might well benefit from a type system, but I suggest that typing links and properties are closely related and any proposal should consider both. That said, I don’t see adding a type system as inordinately difficult. In a functional language like Clojure types are ultimately defined as functions; fundamentally all that is needed is a mechanism for calling Clojure code when certain events occur on a block.

1 Like

That’s a good point and my reply is don’t use this syntax if you want to keep interoperability, just as we do with tons of custom syntax that Logseq adds to Markdown.

It doesn’t seem the case at all for me. If I open my Logseq graph with Obsidian the pages are a mess of custom syntax. Basically only wikilinks are in common with Obsidian. By your reasoning we shouldn’t use queries or block references for example. Sorry but Logseq is not built for interoperability at all, you need to export your pages in standard Markdown manually.

But I agree that if an user adopts the proposed syntax then even compatibility with Obsidian’s wikilinks is lost and I haven’t a solution other than the following, if it’s possible to implement it without ambiguities and UI/UX issues:

- title:: [[The Picture of Dorian Gray]] is a class:: [[book]] by author:: [[Oscar Wilde]].

But the above properties:: must take as value only the following [[wikilink]] and ignore , that usually separates multiple values. This can be confusing to the user.

The idea is that [[property::value]] is exactly the same as property:: [[value]] but inline. There wouldn’t be a way to write the equivalent of property:: value and I don’t see why it would be needed since Logseq interprets [[value]] and value as the same, as you explained when mentioning types.

And about types, I don’t see what would be the advantage and it would make everything more complicated for the user. For me Logseq does a good job in understanding what the user means (42 is an integer etc).

Yours is an interesting reply anyway, thanks.

1 Like

I suspect we use Logseq in different ways, I like using Typora as an editor. Whilst it is odd to see every block as a bullet point, it is at least correctly rendered and links work. Altering the link syntax will break that, which may be perfectly acceptable, and as you say one doesn’t have to use it.

Regarding Logseq’s syntax for queries, you’re right. In my mind it should use the fairly widely accepted triple backtick+language syntax, which is interpreted as a code block for inline execution (see, for example, R Markdown). The advantage is that it’s backwardly compatible with the double backtick syntax for rendering a code block; editors that can’t launch an interpreter just render the code. Given that queries are written as a data structure defining the operations to be undertaken that is passed to a function, it wouldn’t be too difficult to convert {{query (….) }} to (query (…)) — i.e. a Clojure form — within a triple backtick code block. I suspect that’s what’s going on under the hood in any case.

I know this is slightly off topic, but giving Logseq the capability to execute arbitrary Clojure code blocks on the occurrence of various internal events would be enormously useful. It is, however, a matter for another thread I feel.