Confused about unstructured data and synchronisation

Hello there,

I admit I’m a bit confused about what is ‘expected behaviour’ under Pydio Cells Home Edition 3.0.7…

Here is the issue I’ve got: my organisation has 80 GBytes of files accumulated over the past 16 years or so. At some point in the past, this was just a Windows-shared folder in someone’s computer. As the organisation started having lots of telecommuters — that was in 2006, not 2020! — we ended up having a remote file server that could be accessed by FTP (Dropbox was just invented in 2007).

Over the years, the stored files only grew, and we sort of have moved the data to different systems (starting with remotely mounting the data via WebDAV, for example). At the same time, more and more cloud providers have launched their synchronisation services, and, naturally enough, each person has their ‘favourite’ sync service, and we need them somehow to talk to each other.

Parallel to that, there are also the many automated backup systems. Some of them are rather primitive and can just use simple filesystem copies, or a tool that works with the filesystem only (e.g. rsync, unison). When moving to Pydio, this was not a problem, since Pydio operated directly on the filesystem, and with a bit of tweaking, we even got Pydio Cells fully operational. There is limited connectivity to other protocols in the Home Edition, though (just S3), so we needed to rely on third-party tools to shuffle data around the many systems and place it on a folder where Cells could ‘find’ it.

With Cells v3, the so-called ‘structured storage’ was released, which makes sense if you start from scratch and all interactions go through Pydio. Unfortunately, over the past two years, I have been unable to deal with a one-size-fits-all solution. I have tried to use both cells-sync and cells-client to accomplish the syncing aspect, but, alas, I can’t compile them on all platforms I need to support (believe me, I’ve tried — it’s not easy, especially because Cells does not yet support modules… and it’s quite tricky to make substantial changes on the code without relying on modules).

I have tried other approaches, such as considering Pydio Cells as a S3-compatible container — it runs on top of minio anyway. I have many tools that ‘speak’ S3 (and I’m always coming across new ones!), but, sadly, none of them is able to communicate fully with Pydio Cells. Some cannot even authenticate; some can, but get stumped with the notion that there is only one ‘hidden’ bucket, io (which has less than 4 letters, and thus breaks the Amazon specs) — even if it can be fully accessed via data instead. Some tools, like, say, Cyberduck, are able to connect and even list the contents of io and the filesystem inside, but… they have serious problems when trying to create new folders (new files works well) — the best it can achieve is to write a zero-byte text file with the folder’s intended name — and deleting existing files, strangely enough, does only happen after a few refreshes. That rules out any sort of remote synchronisation — at least, until I figure out a tool that will work with Cells natively on different platforms (there might be one just around the corner, who knows…).

(As a side note: the culprit is not minio. All the tools I’ve experimented with work flawlessly with minio, using the S3 protocol. It’s just that Cells adds ‘something extra’ on top of minio — I have no idea exactly what — and that ‘extra’ is enough to break compatibility with all the tools I’ve experimented with, including minio’s own CLI!)

So, currently, the only way to do anything to keep the files in sync is to do it manually (rather, with an automated script running periodically): people will drop files inside of a folder, and, eventually, Pydio Cells will get the command to attempt a re-sync. As you can imagine, having thousands of files of varying sizes takes quite some time to re-sync, but it’s better than anything — nobody, after all, is expecting instant synchronisation anyway. A few minutes is acceptable. So long as nothing gets lost in the process…

Because each person will therefore access the data differently — depending on whatever technology they use — it’s very hard to track down common faults. In almost every case, there is some other tool that is to blame (or a user who has something configured incorrectly).

But sometimes (such as with S3 access to Cells) it’s clear that the issue is deep in the core of Cells; sometimes, I can figure out what to do based on the documentation (or the forums!), changing/tweaking the configuration in some subtle way.

And sometimes I just get stumped.

So, to recap: I’ve got a filesystem directory which is accessed by many different tools; Pydio Cells is just one of them; so I really need unstructured storage, which I think is what is meant when the following option is checked (but greyed out!) on the Advanced Settings tab for a particular storage folder:

:ballot_box_with_check:︎ Keep data structure on storage

As far as I understand, in this scenario, the files themselves will be in the expected folder, and they can be freely manipulated, copied, removed, etc. Once Cells re-syncs the folder, it then stores its own index under ~/data/.minio.sys/buckets. So far, so good (I guess), until I got someone telling me that they had uploaded quite a lot of files via the Web interface, but I couldn’t see them.

They were not in the filesystem.

But they certainly showed up on the Web interface, and they were clearly ‘there’, that is, there is no question that Pydio Cells has full access to the data (and not just an index pointing to empty data, as reported on a different thread). Other users can log in via the Web interface and have no problem previewing and/or downloading the files. Photos have all the EXIF metadata, etc. There are thumbnails for each image, as expected. So… everything seems to be working flawlessly… but…

Where are the files?

The strangest thing is that files uploaded to the filesystem directory will — after a resync — appear on the Web interface; but the reverse is not true: files uploaded via the web interface will not be written to the expected folder. I can only speculate where they are actually being written, but certainly it’s nowhere why I expect them.

Interestingly, while I was doing some tests, before finishing off this post, I tried one new setting — having read/write as the default access setting — and wondered if that would have ‘fixed’ the S3 access. It didn’t, but… I made a very interesting discovery. Using Cyberduck via the S3 protocol, I can see those files that are not on the filesystem, as well as the ones that are. I was guessing that whatever the Web GUI sees, the S3 subsystem sees as well, and it seemed to be the case, until I uploaded a file via Cyberduck using S3.

This time, that file did appear on the filesystem directory as well!

That’s a mystery piling on top of another mystery. Clearly, at the internal level, something very strange is going on…

What would you recommend I try next? Should I simply dump that workspace and storage, create a new set, and copy all the existing files to the new directory (and resync it)? The idea is that starting with a clean slate might provide some better results — or is that just wishful thinking?

I might also try out the v4 RC (yay for that!), just to see if there are any substantial changes to this odd behaviour… especially after the update tool makes all sorts of changes here and there (database, filseystem, medatada…). Who knows, it might work… on the other hand, the system may finally break down, therefore giving me even more reasons to tread carefully…