Cells Admin Structured Datasource Checksum

jaimedelano · May 18, 2023, 5:27pm

Sometimes it is inconvenient that file hashes are generated while the datasource is synchronized, for example when the datasource is a remote cloud storage locally mounted with Rclone: it takes a lot of time (if terabytes of data or there are many files) and uses a lot of bandwidth.
So I think it would be nice to add an option to skip checksum of new files:
cells admin datasource resync -d remote --skip-checksum

GwynethLlewelyn · May 20, 2023, 4:49pm

Aye, I totally agree and subscribe to that request!

I also use Rclone to keep some of the datasources ‘in sync’ with remote cloud storage — 80,000+ “nodes” in one of these (!) — and while my server is reasonably powerful (and has very few users), it struggles to complete the “full” sync. I guess that the reason for that are those file hashes that need to be recalculated before the storage becomes available again and its files shown…?

To be honest, I told my few users never to rely on Pydio Cells for having up-to-date information; eventually, things will get sync’ed, but that might happen only after a few days (sometimes weeks) because I need to run the sync process manually — mostly because I have no idea how long it takes, and it’s extremely I/O intensive when it happens. It’s not the kind of thing that I can run from crontab, say, every 5 minutes or so (on a different datasource, with “only” ±20,000 nodes, Cells takes about 3 minutes to complete a full sync).

I’d certainly welcome whatever quasi-instantaneous way of presenting freshly written files on structured storage to the users It’s just because when they drop a file to that storage space, which is managed by Pydio Cells but used by different applications (including Rclone!), they expect the file to appear “instantly” — since that’s what happens when they upload the file via the web interface (or via cec/cells-client). However, not everybody accesses the storage in the same way — the main reason is the struggle I’ve had with getting third-party S3-compatible tools to ‘talk’ to Pydio Cells; those that manage to establish an authenticated connection (e.g. Cyberduck) then fail in unpredictable ways (new directories are created as zero-byte files, while old ones seem to work fine). Needless to say, I cannot get everything and everybody simply to switch to Pydio-only tools; even the number of tools that are able to “talk” S3 is pretty much reduced (some have issues with practically every non-Amazon S3 API provider… Pydio Cells is not an exception in those cases).

On the other hand, a filesystem is a filesystem, and any tool (even plain, old, rock-solid rsync) is able to read/write to it. The issue is then getting Pydio Cells to ‘know’ that the filesystem has changed without requiring a ‘full resync’.

Oh well. There are obviously no perfect tools but Pydio Cells comes very, very close to perfection… it’s just those tiny limitations that are so annoying

In an ideal world, Pydio Cells would use the inotify interface provided by the underlying OS to ‘know’ when a file inside structured storage had changed on disk, and launch its internal rehashing/syncing algorithm just for that file — exactly like almost every other cloud storage provider does. Granted, many use their own proprietary protocols to do their magic; and Pydio Cells, with unstructured data, does the same; again, the issue here is that it’s not that easy to use Pydio Cells’ own internal protocol with existing tools, especially those that are closed-source and only give a limited way of accessing remote storage (and often not much more beyond S3…).

And on the other hand, it’s not easy to get users to install Yet Another Cloud Drive app (assuming this would work in their case). They usually limit themselves to whatever their desktop provides natively, sometimes going as far as adding either Google Drive and/or Dropbox, but that’s it. It’s up to those running the servers to manage a way to get their files into Pydio (and not for the users to worry about).

I’m an untypical user and well-aware of that, but a couple of years ago, I had to struggle with having to keep a dozen cloud storage services — tied to different accounts — all getting in sync, managed from my PowerBook Mac. It ultimately became impossible when using multiple accounts for the same provider (most just allow one account — sometimes optionally allowing one personal account and one business account on the same server — but that’s as far as most providers go). Also, routinely, those cloud storage service providers (wrongly) assume they’re “the only ones” doing remote sync — which means they start to conflict with each other, initially in subtle ways, until finally one of them will break the whole filesystem — which will get instantly sync’ed to all providers, and, from there, sync’ed to all other accounts that are sharing the same folders! It’s a mess when that happens…

My personal alternative was actually simple. I just use one storage provider — the one appropriate for my home Synology NAS. From there, the NAS simply connects to as many accounts on as many different providers as I wish, and keeps them all in sync. If something fails on my side, the process gets interrupted at the NAS, and I just need to clean up my own machine, and not worry about “the rest of the world”. It’s simple yet works well — at least, that is, for those providers that Synology directly supports, and, of course, Pydio Cells is not one of them. The alternative would be to use S3 to access Pydio Cells, and that ought to work, if I only could get anything (besides cec, that is) to connect to Cells using the S3 API, reliably, and without any annoying hiccups. So far, I haven’t managed to do that sort of miracle; all I can do is to use good old rsync to keep the NAS in sync with the remote server where I run Pydio Cells, but of course that requires using structured datasources, and is dependent on a full sync for such files to become available on the Pydio Cells web backend (either directly or via the Pydio-specific tools, e.g. cells-client, with or without GUI…).

There are also some use-cases which imply a ‘quick’ resync (minutes, not days/weeks). For example, I use some structured datasources as a place to store backups from other systems running on that server. The idea is to give the ability to retrieve those backups off-site using a very well designed UI (the web backend for Pydio Cells). For now, that is only possible using the very slow full re-sync mechanism — good enough for the purpose, however, since users requiring immediate access to the backups know they can grab them directly from the filesystem (if needed) as opposed to waiting an undetermined time for the full resync to finish…

Topic		Replies	Views
Confused about unstructured data and synchronisation Pydio Cells	1	451	September 8, 2022
Possible delayed sync issues? Pydio Cells	5	592	May 11, 2021
Pydio Cells Enterprise - Syncing Datasource Pydio Cells	0	289	August 18, 2020
Sync DataSource Pydio Cells	7	1659	November 3, 2020
Cells does not recognize files/folders created directly on storage Pydio Cells	3	562	June 27, 2022

Cells Admin Structured Datasource Checksum

Related topics