Building a self-hostable sync implementation

GreenTeaRoll · October 6, 2023, 6:06pm

Hi there!

I had commented in another thread about this, but starting a new thread for some more focused discussion.

I’m interested in building an open-source, self-hostable implementation of the sync service. I haven’t started any work on this yet. Some relevant questions:

Is there a specification of the API somewhere (gRPC, OpenAPI, Google Doc, etc), or is the code the de facto specification?
Do the Logseq folks want to contribute anything here (mainly docs/knowledge, and possibly the small changes required in Logseq to allow setting a sync server URL), or will this be a pure reverse engineering effort?
Has anyone else started work in this domain/on a similar project?

I’m a heavy Logseq user, and I’ve started integrating more workflows into it (namely task management) and building tooling on top of it (mainly using the HTTP API server), and it’s gotten to the point where I’d like a robust mechanism to sync between my devices, instead of the Git-based backups I’ve been doing so far.

gacuxz · October 10, 2023, 7:09am

I’m using self-hosted SyncThing solution for about half an year between 3 computers and a smartphone. The only issue is you should not keep Logseq open on different devices at the same time. It may lead to “out of sync” because of some Logseq files.

MoneroDude · October 12, 2023, 11:34am

Same here. It works so far. I do not keep both the laptop AND the mobile logseq apps open at the same time, anyways.

DeepReef11 · October 12, 2023, 3:55pm

I host my own nextcloud and sync logseq folders. It works well except for Android (there seems to be the option in Logseq, but it didn’t work last time I tried) which require another sync solution like FolderSync or Syncthing as mentioned.

If this is only for syncing Logseq folder acrosse multiple devices, then the solution already exist. Any syncing solution should work fine. For a more server oriented solution, a raspberry pi can certainly handle it, even Nextcloud can be hosted with it.

If you mean a contribution feature, that’s something I would be very interested and I would be willing to contribute in my spare time. With Nextcloud, it is able to tell that there’s 2 version of a file which you can manually merge afterward so it is not a big problem.

randomnote · October 12, 2023, 6:59pm

I offer my help from devops prospective if needed.

DeepReef11 · October 18, 2023, 6:40pm

After some research, I’ve found out that Logseq doesn’t have a webdav integration for Android (which prevent syncing with Nextcloud directly). Here is a related thread to nextcloud integration.

For WebDav Integration, magisk rclone could be a solution for rooted android device, allowing to edit files on the cloud directly. I tried it, but I wasn’t able to set it up properly. This is not a userfriendly solution.

There’s also a tricky way to sync logseq with nextcloud, but it needs to be sync (download) manually to the android device (uploads are automatic). Here is how, better feature might come later.

GreenTeaRoll · October 18, 2023, 8:58pm

I think folks may be losing the thread here: yes, there are file system-based ways to do syncing (Git, WebDAV, iCloud, SyncThing, etc). They have sharp edges that can lead to data loss/corruption/confusing merges. I imagine this is why Logseq Sync, the built-in feature with a dedicated backend, exists in the first place.

The official sync backend is great, and it’s nice that it has E2E encryption out of the box, but there are benefits to being able to self-host the full sync pipeline by one’s self:

Syncing offline/on air-gapped networks
Full stewardship over your data
Can limit attack surface by not hosting on the public internet (e.g. Tailscale or Nebula)
Not depending on an external service, which may have downtime, high latency from one’s location, etc, etc.
With an open-source backend, one can make tweaks and add any features they deem useful
Being able to audit the code

So the goal of this thread is to figure out what needs to happen to bring a self-hosted implementation of the Logseq Sync backend into existence. This thread has had time to marinade and I don’t think anyone on the thread is part of the Logseq team, so I’m going to assume that building an open-source version will require looking at the client code and building a compatible backend implementation for that.

I’ve got a few other projects I’d like to get out of the way first, but I should have a chance to start work on this this weekend, and I’ll create a GitHub repo then.

RichardJActon · October 20, 2023, 1:42pm

Absolutely agree, this kind of self-hostable sync option which can handle keeping the state consistent better than just synced folders is essential to a smooth experience for any self-hoster who has some less techie people using infrastructure that they run.

(I use git for sync ATM and that has it’s advantages in terms of a complete change history but it’s very manual, especially on android via termux. Longer history retention is something that a self hosted option could also add but which would potentially be a cost issue for the main sync server)

PS I’m not a real dev just a bioinformatician but I’d be happy to help out with docs and maybe some testing

GreenTeaRoll · October 24, 2023, 3:03am

Okay, I’ve started working on this a bit. Not much to show for it, but:

bcspragu/logseq-sync - Where I’m starting to work on this. Right now it’s just a simple server that serves one endpoint for WebSockets and another for everything else, just to log traffic sent by the client. I have some tweaking to do to get a local Logseq client to play nice with self-signed certificates (or just plain HTTP on localhost)
A branch of Logseq - It doesn’t work yet, but the idea is to add the option to the settings page to specify a custom sync server URL, instead of the current ones.

I’ve also got a doc with a few pages of notes on the shape of the API, but nothing ready to share yet. I’ll clean it up and put it in the logseq-sync repo itself. The good news is that I think the API will be pretty simple to replicate, since the whole thing is end-to-end encrypted, the server is kinda just shuttling data + events back and forth.

I still need to understand the various endpoints and the role of IPC and the format of WebSockets stuff, but I think that should come fairly quickly. One thing I was surprised to find is that there is at least one package in the Logseq codebase where source is not available, which complicates my effort to understand things a bit.

randomnote · October 25, 2023, 10:18am

It would be great if we can have some hints from the devs instead of reverse-eng the client/server communication.

Thanks

GreenTeaRoll · October 25, 2023, 1:20pm

While I agree, I also think it’s totally reasonable for them to not join in here. Aside from the time/money cost of assisting, Sync is the main hosted/paid for feature of Logseq, so there’s somewhat of a conflict of interest there.

Also, the @logseq/rsapi package is the one that handles lots of the syncing details, and it’s obfuscated, which takes effort to do. The thread I had linked to indicates the team intends to open source it, but that has yet to materialize.

DeepReef11 · October 28, 2023, 12:51pm

Here’s what I’ve found on syncing for android. This isn’t really advancement for this topic, but I wish to draw attention to the matter and bring development in self-hosted syncing.

I’ve been using Foldersync with a cloud storage (Nextcloud in my case). Foldersync is able to sync though webdav, but it is quite slow. I’ve been able to make it almost 10 times faster as it was. Here’s my config:

Filtering out bak using filter, folder name equal bak.
With the above filtering, schedule syncing becomes less battery intensive so I schedule to sync every 15 minutes (could be 5 for move intensive).
Instant push helps syncing what is being done on the mobile device more instantly and thus help prevent erasing progress from another device.
On both file modified, I use “Always use remote” because put more trust on desktop device, but this is likely overkill.
I also disable deletion. This prevent all file deletion as foldersync will always sync them back so it is quite irritating and also likely overkill.

I’ve found out Tasker which is able to launch FolderSync. I stopped using it since it was only a free 7 days trial (I thought it was free) but the app is quite cheap. Here’s what I had:

Event on fille modified on folder pages (in logseq folder) that trigger sync after 3 minutes wait.
Same but for journals folder.
Apps Logseq trigger sync. So everytime Logseq becomes active it (only once) launch sync. Switching from apps back to Logseq will also trigger this sync.

I will be testing out collaboration on mobile device with the foldersync configuration. Of course, this won’t be real collaboration but just working on the same logseq data.

fivestones · October 29, 2023, 9:58pm

I went looking for this code and found https://www.npmjs.com/package/@logseq/rsapi?activeTab=code which is MIT licensed. Is this the same rsapi you’re referring to? I know from the thread you linked it looks like they havent gotten around to open sourcing it, but maybe they actually did open source it and just never mentioned it on that thread?

However, there are othe npm packages like @logseq/rsapi-darwin-arm64 - npm which only have a binary and no code that I can find (the license says MIT but I guess that’s for the binary, not the code…?

GreenTeaRoll · October 29, 2023, 10:04pm

Indeed, as you’ve found, the main @logseq/rsapi is just a wrapper for platform-specific code (see index.js), which is a binary blob. It’s only “open-source” in the most literal sense, in what is required to build the repo locally.

I’m actually about to push a big update to my repo, I’ve figured out lots of details and endpoints and workings and documented a bunch of them. I believe the rsapi package is actually Rust code compiled to WASM.

fivestones · October 29, 2023, 10:20pm

Two things:

I’m super excited about your effort here. This is something I’ve been hoping and wishing for for at least a year (especially every time I run in to some git error in my current sync setup between my mac and my iPhone!).
If you want to try to get what you’re doing merged back into the codebase, there was some openness from the team (cnrpman) to this last year when people were talking about a fork that could work with a self hosting sync option here:

We are welcome to more deployment options. I’m happy to do code-review if you’d like to submit PRs addressing this
But it won’t be easy, basically it’s something similar to VSCode-remote / Codespaces.
To achieve this, may need to implement the fs protocol and a head-less server.

GreenTeaRoll · October 29, 2023, 10:29pm

Glad to hear the team could be amenable to it! I just pushed some documentation on the Logseq Sync API, I think I know enough of the surface now and how it fits together to start actually building a simple (probably in-memory to start) implementation, which I’ll get to this week.

GreenTeaRoll · November 2, 2023, 6:17pm

Another update: I’ve got enough of the API implemented that I can actually create a remote graph from a local Logseq instance that is pointing at the open-source backend. All code is in the repo. The next two big pieces are:

Getting WebSockets working correctly, there’s definitely something wrong with them at the moment
Adding a real (persistent) database, probably SQLite

I’m hoping to have those finished by the end of this weekend, at which point folks could tentatively start testing it.

GreenTeaRoll · November 30, 2023, 6:51pm

Sorry to just be continuously pinging this thread, but one big update: As of this morning, I think the OSS Logseq Sync backend has everything it needs to be actually usable. The last remaining piece was persistence, which I submitted this morning.

The next step is to engage with the Logseq team for the stuff that needs to change on the Logseq client side of things, which is mainly just adding additional configuration settings in the (advanced) settings. I’m happy to make these changes, but I want some sort of confirmation they’ll be accepted before I start learning ClojureScript and Rum.

Aside from that, there’s also a parallel discussion about protocol-level changes we could have. For example, right now it’s tightly coupled to AWS + S3, and I haven’t touched anything regarding AWS Cognito integration yet (meaning you still need a Logseq account to use self-hosted sync).

What’s the appropriate forum for this discussion? Is it here, or GitHub Issues/Discussions, or somewhere else? Happy to engage with the team wherever makes sense.

isidoreisou · December 21, 2023, 10:10am

Hello,

I’m just wondering if I’m at the best place to follow the progress of this project?

All the best and good luck

GreenTeaRoll · December 21, 2023, 2:14pm

The GitHub repo is probably the best place to follow along, I (try to) keep the README up to date with the latest progress.

I’ve mostly just been dragging my feet on the (fairly minor) changes to Logseq itself, but since I haven’t gotten much engagement from the Logseq team here, I’ve started a discussion on GitHub.