How to properly use the S3 API to connect to Cells

I’m opening a new topic on this issue because I have been posting here and there on similar issues, in vain trying to figure out a solution/hack for my current issue.

It all started a year ago when I suddenly realised that I couldn’t connect CellsSync to Cells 2.X. At that time I had several issues with my overall configuration; I presumed that once I could fix those, CellsSync would work again (as it did under 1.0). Many of the errors listed on that old thread have been fixed — nevertheless, CellsSync continues to be stubborn and refuses to connect (or even to write anything pertinent on the logs which might give me a clue to figure out what is so wrong…).

Nowadays, compiling CellsSync from scratch fails… this is not Pydio’s fault, but rather something that has popped in bbolt a while ago and is not fixed yet:

go get: github.com/etcd-io/bbolt@none updating to
        github.com/etcd-io/bbolt@v1.3.5: parsing go.mod:
        module declares its path as: go.etcd.io/bbolt
                but was required as: github.com/etcd-io/bbolt

Unfortunately, when starting with a clean install, go will drop a fatal error at this time and not create a proper go.mod with all dependencies… so the workaround mentioned above cannot be applied. My guess is that when bbolt gets this one-year-old issue finally fixed, I’ll be able to compile it from scratch again (currently testing the compilation under macOS, ARMv8, and Ubuntu Linux/x86_64, all with the same results). In the meantime, we have to stick to pre-built binaries.

So, learning from the XY Problem, here is what I’m actually trying to accomplish:

Old-timer Pydio users remember that Pydio 8 had several connectors to many different storage systems and cloud providers. Under Cells, things have changed: the Community Edition gives access to just two storage systems — local filesystem and remote S3 storage. There are a few more connectors on the Enterprise Edition, which also includes the awesome Cells Flows application, with which it should be possible to connect to basically everything and design precisely what one wants to do. For all of us who cannot afford the Enterprise Edition — or aren’t able to persuade the management — it means using what we’ve got and add a lot of shell scripts around the Cells ecosystem.

In my system, I need to synchronise one folder — one storage — with Microsoft’s OneDrive. I’ve explained why elsewhere (some members of our team are unable/not allowed to use OneDrive, but have no problem using Cells). Also, it’s rather large — about 56 GBytes at last count.

There are, at first sight, a few alternatives to do so. Here is what I’ve come up with:

  1. Synchronise locally on my Mac with OneDrive, and then use CellsSync to sync everything with Cells. It takes longer, uses up unnecessary bandwidth, but it’s a reasonable alternative, especially if most content doesn’t change all the time (which is the case). Unfortunately, that requires a working CellsSync configuration, something which I cannot get to work under 2.2.3.
  2. Instead of using the full-blown CellsSync on the systray etc., just run the CLI version as a background dæmon on the Mac. Sadly, it suffers the exact problem: for the moment, I cannot get cells-sync to compile at all, including the CLI version.
  3. Replace cells-sync with cells-client (a.k.a. cec). This ‘works’… for an initial upload of 56 GBytes, it’s fine. But cells-client just makes copies, not synchronisation (that’s what cells-sync is for!). It would be insane to copy everything over and over again when only a handful of small files are changed!
  4. On the server-side, mount the OneDrive remotely and let Cells ‘see’ it as a regular filesystem.

I’ve been focusing on 4. simply because I don’t want to ‘waste’ more bandwidth on my home connection. There are, however, several issues with that approach.

Mounting a remote OneDrive folder under Linux is not exactly rocket science. There are a few tools specifically for that, but the Swiss Army Knife for such tasks is most definitely rclone. Written in Go, it’s both a package/library for accessing a plethora of cloud-based storage services as well as a command-tool available on all platforms (at least, those that can compile things with Go — which are ‘almost all’). It can mount cloud-based storage locally (via FUSE, also available as a Go library) or it can use direct copy, move, etc. commands, of which the most interesting is, obviously, remote synchronisation. In fact, rclone's tagline is “rsync for cloud storage”.

There are two main gotchas with this approach.

First, the rclone devs warn about mounting filesystems with FUSE. Allegedly, it’s not such a great idea — it seems that FUSE is still not very stable when dealing with remote filesystems (one might argue that this is true for all remotely-mounted filesystems…).

Secondly, mounting the OneDrive is not really the issue here. For all purposes, you could just copy everything — better, sync it — to one folder and move it to a Cells storage, or make the Cells storage point to a directory where you use rclone to sync with OneDrive. Setting up any of these scenarios is quite easy; the problem is actually how to tell Cells that files have changed.

Cells, as you all know, provides a mechanism to re-index/resync a datasource. This can be launched from the Web interface manually or, on the Enterprise Edition, automated with Cells Flows; on the Community Edition, you can use Cells Client (cec) to force a resync as well, so in theory, you could launch cec storage resync-ds <my-datasource-name> from cron. There is apparently no mechanism to figure out if the resync succeeded or not, but at least you can do it remotely.

So, in theory, all that is needed is to launch two jobs from cron: first, use rclone to sync OneDrive with the datasource’s directory; then launch cec to resync it with Cells.

As you can see, this is grossly inefficient (even if it works!), since, even though rclone will do a ‘real’ sync — i.e. just copying any files that have changed, if any — cec will force a resync of all files, even if none (or few!) have changed!

And while it’s also theoretically possible to figure out how many files rclone has changed in a sync session — and skip cec storage resync-ds if none have changed — there will be always cases where the whole directory needs to be resynced, even if just one file changed (and didn’t change much). This is quite sadly the case that is quite frequent in our setup: people will be constantly auto-saving their Word or Excel files to OneDrive (this is pretty much built-in), but most of the time these changes will be very small, and usually limited to one or two files at the same time.

As mentioned at the very beginning, in our scenario, we’re talking about 56 GBytes of disk space, scattered among tons of files, some of which very small, others being reasonably large images and videos. On average, assuming that the server is not overloaded with other things, it will take approximately three minutes to do a full resync. It’s conceivable that using different hardware I could get a little better performance than that, but it’s the order of magnitude that counts: after the first sync, OneDrive/rclone will take, most of the time, just a few milliseconds to do a rsync of the changed files (and sometimes, when there are a few more files to sync, it might take a few seconds). Cells’ ‘full resync’ is, therefore, at least two orders of magnitude slower, and wastes a lot of resources. In fact, assuming that rclone runs, say, every five minutes (a trade-off to account for Cells), and that Cells takes three minutes to do a full resync, it means that on the worst-case scenario the server will be doing little else but a continuous resyncing — when it finishes one batch, it goes straight into the next one (as said, that can be optimised by only launching cec if rclone found anything to sync, but, even so, during most of the day, these changes will be constantly happening).

That is, sadly, for our particular scenario, completely unacceptable. We simply cannot afford to waste so many resources on this particular server, just to keep Cells happy.

There are other options to consider.

So far, I have just explored the first kind of storage that Cells implements, namely, filesystem-based. We still have S3 as an option, and here things look a little brighter.

Cells is, essentially speaking, an application running on top of MinIO, a robust and incredibly efficient storage system, which is free and open-source, cloud-ready (that includes distributing the load among many instances), exposing a complete Amazon S3 API; according to some of their benchmarks, it’s consistently faster than Amazon’s own offer. Because it’s also written in Go, MinIO can be deployed anywhere, and it exposes a reasonably simple library interface for embedding inside applications — such as Cells. It also includes a lot of CLI utilities.

But so far I haven’t been able to put any of those to work with Cells, either.

I guess I’m stumped for now… any suggestions?