Sometimes it is inconvenient that file hashes are generated while the datasource is synchronized, for example when the datasource is a remote cloud storage locally mounted with Rclone: it takes a lot of time (if terabytes of data or there are many files) and uses a lot of bandwidth.
So I think it would be nice to add an option to skip checksum of new files:
cells admin datasource resync -d remote --skip-checksum
Aye, I totally agree and subscribe to that request!
I also use Rclone to keep some of the datasources âin syncâ with remote cloud storage â 80,000+ ânodesâ in one of these (!) â and while my server is reasonably powerful (and has very few users), it struggles to complete the âfullâ sync. I guess that the reason for that are those file hashes that need to be recalculated before the storage becomes available again and its files shownâŠ?
To be honest, I told my few users never to rely on Pydio Cells for having up-to-date information; eventually, things will get syncâed, but that might happen only after a few days (sometimes weeks) because I need to run the sync process manually â mostly because I have no idea how long it takes, and itâs extremely I/O intensive when it happens. Itâs not the kind of thing that I can run from crontab, say, every 5 minutes or so (on a different datasource, with âonlyâ ±20,000 nodes, Cells takes about 3 minutes to complete a full sync).
Iâd certainly welcome whatever quasi-instantaneous way of presenting freshly written files on structured storage to the users Itâs just because when they drop a file to that storage space, which is managed by Pydio Cells but used by different applications (including Rclone!), they expect the file to appear âinstantlyâ â since thatâs what happens when they upload the file via the web interface (or via cec/cells-client). However, not everybody accesses the storage in the same way â the main reason is the struggle Iâve had with getting third-party S3-compatible tools to âtalkâ to Pydio Cells; those that manage to establish an authenticated connection (e.g. Cyberduck) then fail in unpredictable ways (new directories are created as zero-byte files, while old ones seem to work fine). Needless to say, I cannot get everything and everybody simply to switch to Pydio-only tools; even the number of tools that are able to âtalkâ S3 is pretty much reduced (some have issues with practically every non-Amazon S3 API provider⊠Pydio Cells is not an exception in those cases).
On the other hand, a filesystem is a filesystem, and any tool (even plain, old, rock-solid rsync
) is able to read/write to it. The issue is then getting Pydio Cells to âknowâ that the filesystem has changed without requiring a âfull resyncâ.
Oh well. There are obviously no perfect tools but Pydio Cells comes very, very close to perfection⊠itâs just those tiny limitations that are so annoying
In an ideal world, Pydio Cells would use the inotify
interface provided by the underlying OS to âknowâ when a file inside structured storage had changed on disk, and launch its internal rehashing/syncing algorithm just for that file â exactly like almost every other cloud storage provider does. Granted, many use their own proprietary protocols to do their magic; and Pydio Cells, with unstructured data, does the same; again, the issue here is that itâs not that easy to use Pydio Cellsâ own internal protocol with existing tools, especially those that are closed-source and only give a limited way of accessing remote storage (and often not much more beyond S3âŠ).
And on the other hand, itâs not easy to get users to install Yet Another Cloud Drive app (assuming this would work in their case). They usually limit themselves to whatever their desktop provides natively, sometimes going as far as adding either Google Drive and/or Dropbox, but thatâs it. Itâs up to those running the servers to manage a way to get their files into Pydio (and not for the users to worry about).
Iâm an untypical user and well-aware of that, but a couple of years ago, I had to struggle with having to keep a dozen cloud storage services â tied to different accounts â all getting in sync, managed from my PowerBook Mac. It ultimately became impossible when using multiple accounts for the same provider (most just allow one account â sometimes optionally allowing one personal account and one business account on the same server â but thatâs as far as most providers go). Also, routinely, those cloud storage service providers (wrongly) assume theyâre âthe only onesâ doing remote sync â which means they start to conflict with each other, initially in subtle ways, until finally one of them will break the whole filesystem â which will get instantly syncâed to all providers, and, from there, syncâed to all other accounts that are sharing the same folders! Itâs a mess when that happensâŠ
My personal alternative was actually simple. I just use one storage provider â the one appropriate for my home Synology NAS. From there, the NAS simply connects to as many accounts on as many different providers as I wish, and keeps them all in sync. If something fails on my side, the process gets interrupted at the NAS, and I just need to clean up my own machine, and not worry about âthe rest of the worldâ. Itâs simple yet works well â at least, that is, for those providers that Synology directly supports, and, of course, Pydio Cells is not one of them. The alternative would be to use S3 to access Pydio Cells, and that ought to work, if I only could get anything (besides cec, that is) to connect to Cells using the S3 API, reliably, and without any annoying hiccups. So far, I havenât managed to do that sort of miracle; all I can do is to use good old rsync
to keep the NAS in sync with the remote server where I run Pydio Cells, but of course that requires using structured datasources, and is dependent on a full sync for such files to become available on the Pydio Cells web backend (either directly or via the Pydio-specific tools, e.g. cells-client
, with or without GUIâŠ).
There are also some use-cases which imply a âquickâ resync (minutes, not days/weeks). For example, I use some structured datasources as a place to store backups from other systems running on that server. The idea is to give the ability to retrieve those backups off-site using a very well designed UI (the web backend for Pydio Cells). For now, that is only possible using the very slow full re-sync mechanism â good enough for the purpose, however, since users requiring immediate access to the backups know they can grab them directly from the filesystem (if needed) as opposed to waiting an undetermined time for the full resync to finishâŠ