Different ways to structure data

and_yet_it_moves · May 8, 2023, 6:18am

I’m struggeling with this as well.

I agree with previous points that there seem to be some convergence in the note-taking field to a “mixed strategy” including

bidirectional [[wikitextlike-links]]
categorization hierarchy
tagging

I have observed this particularly within the Obsidian community.

Goal, desired properties

One way to frame the problem would be: With an increasing number of notes, it’s not that easy to find a convention or structure that

is low friction
- meaning: fast entry and low maintenance
- would be helped by: allowing for adding structure/semantics incrementally[1][2]
fights duplication
- e.g. “do I already have a note related to what I’m about to write now?”
- relates to entity linking, single point of truth, linked data vocabulary
allows heavy linking
- e.g. in LogSeq, linking to blocks has many drawbacks vs. using ordinary [[page links]]
allows some rudimentary (at least) strategy for invariant enforcement/schema adherence/integrity checks (be it, to a part or fully, manual)

…this list goes on and on…

…and here, far down - with regards to short-term feasibility, definitely not desirability! - we’re approaching holy grails such as

connecting our personal knowledge base, with full structure and semantics, to a global, distributed one
- prototype example: https://graph.global (by Mek of Open Library/Internet Archive)
- implemented, somewhat messy, example: https://anagora.org
applying machine reasoning to our knowledge base

This is elaborated on further in the closing section - but first: LogSeq.

LogSeq functionality considerations

I’m not sure more “special features” of LogSeq would take us further in this regard. As has already been pointed out: structure is a very personal preference.

I think the best LogSeq can do right now is providing as much general, use case-agnostic, capabilities as possible, and doing it as well as possible - e.g. allowing user customization through general and stable querying functionality, properties:: implementation, and also of importance: customizable interface (example feature request).

That would allow experimentation. Among the user base, various strategies for structure could evolve, and the community as a whole can get inspiration and gain knowledge. Some converge could be expected with time.

My current LogSeq schema

Implemented only to a part, and in no way a perfect solution.

It consists basically of the following:

A hierarchy of categories

encodes: is-a relation
through: namespaces

example: [[vehicle/boat/submarine]]

The above page has a
alias:: submarine
for allowing shorter [[submarine]] link names.

I have chosen not to adhere to strict subtyping for my category taxonomies. Yes, Barbara is left unsatisfied as a consequence. The opposite choice here probably could allow for some clever tricks on how page properties could be utilized. But how the possible LogSeq queries that could perhaps make use of that would look… I don’t even want to think about. Even less, debug them.

Instance-to-category assignment

encodes: instance-of relation
through: a type:: page property with the category as target

example: page [[Boaty McBoatface]] has a page property
type:: [[vehicle/boat/submarine]]

Then, each category page has a query that lists all its instances.
example: page [[vehicle/boat/submarine]] page has a query
{{query (page-property type <% current page %>)}}
which will list all submarines:

Boaty McBoatface
HSwMS Östergötland
…

drawbacks:

refactoring is very tedious
the LogSeq bug of not resolving <% current page %> when page is opened in the right sidepane is a great nuisance

I use faceted classification:

meaning: we can have multiple, distinct, hierarchical taxonomies, and the final classification of the page will be the intersection of the assigned category node for each taxonomy hierarchy
this is possible in LogSeq since properties can have multiple values, we get optional support for
example: page [[HSwMS Östergötland]] has type:: [[vehicle/boat/submarine]], [[military_thing/naval_vessel]]
in the field of information science, faceted classification is generally considered a very good thing - and it does brings a lot of benefits to my LogSeq classification system

I don’t use polyhierarchies:

(and it wouldn’t be possible if using LogSeq namespaces for the category tree)
meaning: every category node has at most 1 parent node
a beneficial consequence is that it allows short, pragmatic node/category names, while we’re still conforming to the all-some rule: it’s generally easy to append an additional hierarchical level, with a short name, in order allow further refinement/specificity
example: a category path /military_thing/naval_vessel/ can, because of no polyhierarchies, be read equivalently as /military_thing/military_thing--naval_vessel/. All military_thing--naval_vessels are military_things - so /military_thing/naval_vessel/ conforms to the all-some rule.

Leaning towards fewer and longer pages

benefits:

hierarchy enforcement
faster entry
faster re-factoring
- moving stuff around within the block hierarchy of a page is easy and fast

drabacks:

linking to blocks is inferior vs. to pages (and the nead increases as we reduce page granularity)
the target of a tag can’t be a block - it can only be a page, so this is a a general drawback of less fine-grained pages
the still-present, much-too-old LogSeq UI bug of sometimes not displaying the full page is a terrible friction point (…this bug manifests itself both in the main pane and side pane!)

extensively using, for better overview within a page:

headings (# …, ## …, ### …)
folding

Tagging

i.e. page tags
through

tags:: myTag page property
tagging individual
- blocks #myTag

used for: less formal, sometimes add-hoc, additional structure, such as marking a page as belonging to some bigger area, or collecting a number of related pages (instances or categories) together

benefit:

easy to tag not just pages, but also individual blocks

“Authority control”

For each page, I try to add some page property where the value is some URL pointing to some external reference for the intended scope of the page. Usually this is a link toa Wikipedia article. The purpose is to attach an identifier to the page that I can use if, at some later point in time, the page name isn’t enough for me to quickly determine what the intended scope of the page is. This can be seen as some rudimentary/poor-mans linked data vocabulary or autority file connection.

Example: When re-visiting the page [[grammar]] I might ask: does it refer to my internal LogSeq grammar? To grammar in natural language? To database grammars? To help comes the page property wikipedia:: https://en.wikipedia.org/wiki/Formal_grammar which answers the question.

Further comments

this system is most of all just a convention I have for myself, in order to have an established standard for how to enter and structure information in LogSeq. Its intention is not to allow a lot of “implementation” such as using a lot of clever queries etc. I use queries very sparingly. I’ve gone down that route a few times, but the unstable state of LogSeq is just too prohibitive (bugs, and to a part incomplete documentation)
about is-a and instance-of relations
- difference betwee the two: the type-token distinction
- I don’t use “category” pages differently from “instance” pages, e.g. pages of both kinds can contain contents, and can be used as tags
I make heavy use of aliases
- often I want contents for similar but perhaps not distinct concepts on the same page, so terms for those concepts would be alias-ed to the same page
- I usually provide aliases for alternative spellings, for singular+plural forms, and for both abbreviated
  and unabbreviated versions (yes, due to combinatorics this can indeed end up with a page having a lot of aliases)
- this makes linking easier
  - I can just add [[…]]'s around all occurences in various texts to get linking
  - I can easily find the pages to link by using the “Unlinked References” section
I avoid page names/aliases that are too general, and that are often used in the general language
- example: for the concept of frames in symbolic AI I don’t have a page name/alias “frame”, but stick to a qualified page name such as [[frame (AI)]]

My current graph has:

~500 pages
~18k lines
~123k words
(for pages/*.md, so excluding journals - but I don’t use them much)

The ultimate, non-existing, note-taking system

Here, I’m leaving the LogSeq domain. This section is on note-taking systems in general.

A possible ultimate goal: a note taking system that is formal, fully semantic, fully linked (internally and externally), type-safe and invariant-enforced.

There is currently no such note-taking tool.

I regard the quest for a total knowledge representation system, as in the general information science sense, with full structure and semantics, as a perhaps unsolved problem.

“Note-taking system” might sound innocent, but would probably rather be one of the harder domains to model. It would need to be so all-encompassing: we take notes on facts, ideas, thoughts, beliefs, possibilities, to mention a few. It includes relations that are, e.g.: temporal, probabilistic, causal, conditional. Sometimes a connection/link would need to be specified along all of these dimensions in order to be fully described. Any of these relationship types are not unlikely to send shivers to a practicing ontologist. That’s about link/relationship types. Another link/relationship strata would be arity - bi-directional links only would be a limitation (relates to hypergraphs further down). Yet another would be the posibility of linking not only to notes/nodes/entities - but also to some set of such (relates to meteagraphs further down).

One direction would be some complete and fully-specified ontology. It would need to include all possible link relationships as well. To grasp the vastness of such a potential ontology, we can have a look at a published ontology for cultural heritage sites. That’s a 240-page pdf - for a reasonably narrow domain.

Another, but partially overlapping, perspective would be to see our data as a knowledge graph or graph database. Well, LogSeq could possibly be described as a knowledge graph. But if we want a knowledge graph that is fully-semantic and fully-typesafe it starts getting complicated. For full generality and expressivity the graph model won’t suffice. Probably not even its generalization, hypergraphs. Probably rather the generalization of those - metagraphs. This is following the lines of thinking of Ben Goertzel - see further down.

One obvious inspiration in the graph-based realm is the Semantic Web, with its roots in the early 2000s. Unpractical, never really fully realized, and using the much-dreaded XML for everything - but, it is impressively rigorous, extensively researched, and very well specified. Within technology, it probably has a world-record in the (no. of published papers)/(actual practical use) category. The intersection of Semantic Web technology and personal note-taking has seen several product attempts, and not surprisingly: even more academic papers. See for exampke Max Völkel’s thesis (alt.) or his later papers. These are the best sources I have found so far on formalizing link relationships and their types in the area of personal note-taking. For a lighter take on the subject, see for example Jonathan Reeves blog post.

Among initiatives along other but related routes, and more recently, we have e.g. Hode by Jeff Brown. Hode can be described as a note-taking DSL in the form of a hypergraph editor (text+GUI). It is implemented in Haskell, so a checkmark on “type safe” would probably be an understatement. It is/was a single-developer tool, or prototype. It is now abandoned.

An interesting approach, that has not been tried yet as far as I know, would be to implement a note taking system in TypeDB: a fully type safe graph database with built-in schema. That’s a good start, and it gets even better as we encounter such gems as an included inference engine and hypergraph capabilities. The TypeDB+note-taking idea has been mentioned by the creator of Hode, and also by e.g. by Robert Haisfield.

One could also take a category theory approach, e.g. with ologs (“knowledge representation with category theory”). Or - one step closer to note-taking, utilizing the Categorical Query Language (CQL) (my understanding: “a graph database language with category theory”). It is created by David Spivak et al., who were also behind the ologs. Ologs has been mentioned before on the forum, e.g. in A meta-graph as a set of linked graphs by @gax, who also briefly mentions CQL. He also presents a take on LogSeq vs graph databases.

For completeness we should also mention frames (my understanding: “knowledge representation inspired by Objective-oriented modeling”) as a possible note-taking data model.

This section should end with what is perhaps currently the pinnacle of formal, graph-based knowledge representation: Ben Goertzel’s knowledge representation model ([2]). It has its home within his OpenCog symbolic AI system. This model, and implementation, has (my understanding:) very far-reaching expressivity, is fully formalized, totally type-safe, total invariance/schema enforcement, includes its own ontology, includes/allows all the relationship types I mentioned previously, allows inference (and includes the inference engine), and relationship expressivity is on the metagraph level: they have arbitrary arity, and can point not only to entities but to arbitrary sets of entities.