.pydio file should not be 'exported' via S3

Hi (again)!

Under Pydio 4+, when using structured datastores (regular files & directories as opposed to the flat file mechanism), every directory will have a .pydio text file containing a single UUID; I believe these files are automatically generated when Cells requires synchronisation with the external filesystem, mapping directories to whatever data structures Cells uses internally.

Such files are not visible using the Web interface (they shouldn’t be!). PydioSync naturally doesn’t show them, either, nor does the command-line tool cec. All this is expected and works as intended.

However, when using the S3 API, these files are actually visible (but not really accessible). Here are a few examples.

First, Cyberduck. Cyberduck is notoriously good at connecting to all sorts of things, although I’m not a fan of its UI. Nevertheless, it’s one of the very few GUI tools that I have which is able to connect to Cells via S3 (most, unfortunately, don’t).

Here is what I can see when listing a directory inside my ‘Personal Files’ folder:

Note that the .pydio file is greyed out and is actually not really accessible, just visible. For instance, trying to get some more information by right-clicking on it, only gives the following box:

A few examples from command-line tools that use the S3 API:

1. rclone:

$ rclone ls 'cells:/data/personal-files/Publicly shared items on forums etc/'
       36 .pydio
  4882356 Screenshot 2019-07-21 at 17.28.26.png
  1083700 Screenshot 2021-10-12 at 00.08.05.png

It also ‘knows’ of the existence of .pydio. We can get a bit more information using the lsjson command:

$ rclone lsjson 'cells:/data/personal-files/Publicly shared items on forums etc/'
[
2022/11/01 14:45:38 ERROR : : Entry doesn't belong in directory "" (same as directory) - ignoring
2022/11/01 14:45:38 NOTICE: .pydio: Failed to read metadata: object not found
2022/11/01 14:45:38 NOTICE: .pydio: Failed to read metadata: object not found
{"Path":".pydio","Name":".pydio","Size":36,"MimeType":"application/octet-stream","ModTime":"2022-11-01T14:45:38.483508000Z","IsDir":false,"Tier":"STANDARD"},
{"Path":"Screenshot 2019-07-21 at 17.28.26.png","Name":"Screenshot 2019-07-21 at 17.28.26.png","Size":4882356,"MimeType":"application/octet-stream","ModTime":"2019-07-21T16:50:37.000000000Z","IsDir":false,"Tier":"STANDARD"},
{"Path":"Screenshot 2021-10-12 at 00.08.05.png","Name":"Screenshot 2021-10-12 at 00.08.05.png","Size":1083700,"MimeType":"application/octet-stream","ModTime":"2021-10-12T09:31:30.000000000Z","IsDir":false,"Tier":"STANDARD"}
]

Removing .pydio, as expected, doesn’t work, even though rclone gives no errors:

gwyneth@mymac:~$ rclone delete 'cells:/data/personal-files/Publicly shared items on forums etc/.pydio' --verbose
gwyneth@mymac:~$ rclone ls 'cells:/data/personal-files/Publicly shared items on forums etc/'
       36 .pydio
  4882356 Screenshot 2019-07-21 at 17.28.26.png
  1083700 Screenshot 2021-10-12 at 00.08.05.png
gwyneth@mymac:~$

When using the sync command for rclone (which is what I hope to be able to use some day with Cells!), this causes some issues:

gwyneth@mymac:~/Downloads$ mkdir sync-test
gwyneth@mymac:~/Downloads$ cd sync-test/
gwyneth@mymac:~/Downloads/sync-test$ rclone sync 'cells:/data/personal-files/Publicly shared items on forums etc/' .
2022/11/16 13:33:08 ERROR : : Entry doesn't belong in directory "" (same as directory) - ignoring
2022/11/16 13:33:49 ERROR : .pydio: Failed to copy: failed to open source object: SerializationError: failed to unmarshal error message
        status code: 404, request id: , host id:
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 68 74 6d 6c 3e 0d 0a  3c 68 65 61 64 3e 3c 74  |<html>..<head><t|
00000010  69 74 6c 65 3e 34 30 34  20 4e 6f 74 20 46 6f 75  |itle>404 Not Fou|
00000020  6e 64 3c 2f 74 69 74 6c  65 3e 3c 2f 68 65 61 64  |nd</title></head|
00000030  3e 0d 0a 3c 62 6f 64 79  3e 0d 0a 3c 63 65 6e 74  |>..<body>..<cent|
00000040  65 72 3e 3c 68 31 3e 34  30 34 20 4e 6f 74 20 46  |er><h1>404 Not F|
00000050  6f 75 6e 64 3c 2f 68 31  3e 3c 2f 63 65 6e 74 65  |ound</h1></cente|
00000060  72 3e 0d 0a 3c 68 72 3e  3c 63 65 6e 74 65 72 3e  |r>..<hr><center>|
00000070  6e 67 69 6e 78 3c 2f 63  65 6e 74 65 72 3e 0d 0a  |nginx</center>..|
00000080  3c 2f 62 6f 64 79 3e 0d  0a 3c 2f 68 74 6d 6c 3e  |</body>..</html>|
00000090  0d 0a                                             |..|

caused by: expected element type <Error> but have <html>
2022/11/16 13:33:49 ERROR : Local file system at /Users/gwyneth/Downloads/sync-test: not deleting files as there were IO errors
2022/11/16 13:33:49 ERROR : Local file system at /Users/gwyneth/Downloads/sync-test: not deleting directories as there were IO errors
2022/11/16 13:33:49 ERROR : Attempt 1/3 failed with 1 errors and: failed to open source object: SerializationError: failed to unmarshal error message
        status code: 404, request id: , host id:
caused by: UnmarshalError: failed to unmarshal error message
        00000000  3c 68 74 6d 6c 3e 0d 0a  3c 68 65 61 64 3e 3c 74  |<html>..<head><t|
00000010  69 74 6c 65 3e 34 30 34  20 4e 6f 74 20 46 6f 75  |itle>404 Not Fou|
00000020  6e 64 3c 2f 74 69 74 6c  65 3e 3c 2f 68 65 61 64  |nd</title></head|
00000030  3e 0d 0a 3c 62 6f 64 79  3e 0d 0a 3c 63 65 6e 74  |>..<body>..<cent|
00000040  65 72 3e 3c 68 31 3e 34  30 34 20 4e 6f 74 20 46  |er><h1>404 Not F|
00000050  6f 75 6e 64 3c 2f 68 31  3e 3c 2f 63 65 6e 74 65  |ound</h1></cente|
00000060  72 3e 0d 0a 3c 68 72 3e  3c 63 65 6e 74 65 72 3e  |r>..<hr><center>|
00000070  6e 67 69 6e 78 3c 2f 63  65 6e 74 65 72 3e 0d 0a  |nginx</center>..|
00000080  3c 2f 62 6f 64 79 3e 0d  0a 3c 2f 68 74 6d 6c 3e  |</body>..</html>|
00000090  0d 0a                                             |..|

caused by: expected element type <Error> but have <html>
2022/11/16 13:33:49 ERROR : : Entry doesn't belong in directory "" (same as directory) - ignoring
2022/11/16 13:34:33 ERROR : .pydio: Failed to copy: failed to open source object: SerializationError: failed to unmarshal error message
        status code: 404, request id: , host id:
caused by: UnmarshalError: failed to unmarshal error message
...etc. repeats at least 3 times, one attempt per minute or so, and finally aborts...
gwyneth@mymac:~/Downloads/sync-test$ ls -la
total 5.7M
drwxr-xr-x    4 gwyneth staff  128 Nov 16 13:33  .
drwxr-xr-x+ 189 gwyneth staff 6.0K Nov 16 13:32  ..
-rw-r--r--    1 gwyneth staff 4.7M Jul 21  2019 'Screenshot 2019-07-21 at 17.28.26.png'
-rw-r--r--    1 gwyneth staff 1.1M Oct 12  2021 'Screenshot 2021-10-12 at 00.08.05.png'

It’s worth explaining that Cells is behind a nginx reverse proxy; the weird HTML error messages come from nginx, not Cells. The S3 protocol is supposed to be a JSON API, and that’s why the error decoding fails: for some reason, no matter how much I tweak the configuration, I cannot get nginx to pass the JSON error (which technically can be sent with 200 OK but within the JSON-encoded message body there is supposed to be an entry for an error). Who came up with that incredibly stupid idea of breaking the HTTP conventions of sending back errors should have been shot on sight, a decade ago…

Anyway, it’s too late to complain, and I’m sure that one day I’ll be able to get a workaround (essentially, what I’m trying to do is to force nginx to pass a JSON-encoded error every time it gets a reply that is MIME-tagged with application/vnd.api+json — but that doesn’t seem to be working).

2. s3cmd

s3cmd is a popular free and open-source command-line script written in Python, originally designed to work only with Amazon S3 (there was nothing else at the time… aye, it’s that old!), but currently being able to connect to almost all S3-based cloud storage systems. Configuring it is quite simple, with just a few standard options — destination URL, access key, secret key (the usual gatewaysecret), v4 signatures — and access is quite fast. However, it also shows the .pydio files:

$ s3cmd ls 's3://io/personal-files/Publicly shared items on forums etc/' -v --progress --stats
                          DIR  s3://io/personal-files/Publicly shared items on forums etc/
2020-02-26 17:40           36  s3://io/personal-files/Publicly shared items on forums etc/.pydio
2019-07-21 16:50      4882356  s3://io/personal-files/Publicly shared items on forums etc/Screenshot 2019-07-21 at 17.28.26.png
2021-10-12 09:31      1083700  s3://io/personal-files/Publicly shared items on forums etc/Screenshot 2021-10-12 at 00.08.05.png

But naturally enough you cannot remove it:

$ s3cmd rm 's3://io/personal-files/Publicly shared items on forums etc/.pydio'
ERROR: Error parsing xml: mismatched tag: line 6, column 2
ERROR: b'<html>\r\n<head><title>405 Not Allowed</title></head>\r\n<body>\r\n<center><h1>405 Not Allowed</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
ERROR: S3 error: 405 (Not Allowed)

Note that, again, the error is mistakenly being formatted in HTML by nginx (see above comments), which gives the error parsing XML as shown before.

Discussion

The existence of a .pydio file is crucial for structured datastores to work. In ancient times, files starting with a dot were deemed to be ‘invisible’, in the sense that they could not be seen or changed using ‘normal means’. To this day, and to the best of my knowledge, this is the default behaviour of Apple’s macOS Finder application, for instance: you need to explicitly turn on the ability to see files starting with a dot.

However, contemporary usage of files and folders starting with a dot has increased dramatically, for many reasons, the most important of which is perhaps an historical difficulty of removing them with a ‘plain’ command on a terminal (e.g. rm * will not delete a .profile by mistake). The sheer amount of dot-files these days is also conveniently ‘hidden’ by command-line tools such as the venerable ls.

But because such files are so convenient to store all sorts of things, almost all (if not all) remote access protocols will naturally list them and treat them as regular files (which is exactly what they are). As such, any special rules that apply to them — from the perspective of the software that created them and treats such files as ‘special’ — will naturally not be ‘known’. This is best illustrated using a GUI-based remote tool (using SFTP, SMB, NFS, WebDAV, S3, or whatever protocol it uses) and attempting to remove an entire directory tree — the tool will not ‘know’ which dot-files are ‘special’, and therefore treat them as plain files instead. Such a tool, therefore, might wreak havoc on a Cells installation that assumes that structured datastores have been previously populated with .pydio files (AFAIK, identifying each directory with an UUID).

The reverse is of course also true: simply adding a new directory under a structured datastore will not work remotely (at least not without a resync), since Cells will obviously have no previous information on that directory, if it hasn’t its unique UUID written on .pydio.

One major issue I’ve found happens when attempting to remotely remove the enclosing directory outside of Pydio own tools.

While this file is, to a degree, being ‘protected’ by Cells — in the sense that via Cells-specific tools (GUI, cec, sync, etc.) it isn’t visible and/or accessible — the same is not true if the Cells repositories are directly accessed using a non-Pydio protocol, i.e. anything from directly writing on those directories, either locally or remotely, or accessing the workspace via the built-in S3 API.

Moreover, the handling of the change or removal of the .pydio file is different, depending on the tool that attempts such a removal. A local command such as rm .pydio will obviously immediately remove the file forever (especially if executed under root!), since it acts on the underlying filesystem. Pydio-specific applications and CLI apps, from cells-sync (GUI or CLI) to cells-client will properly handle .pydio files — never allowing them to be read or written, but when deleting the directory containing such files, it gives an unpredictable error (which, in my case, is not even ‘caught’ at the reverse-proxy level, which, in turn, will produce a ‘wrong’ error — HTML instead of JSON — thus breaking the communication protocol even further). The same usually applies when attempting to rename the directory or attempting to sync with it (by overwriting its contents).

While this issue does not exist with unstructured datasources — since they write on disk in a format that is not directly accessible by any non-Pydio tools — there are quite good reasons for keeping structured datasources around, even if those might be accessed with a noticeable delay when compared to unstructured ones. This was already true for Pydio 8, which naturally had similar issues. Pydio Cells can be deployed as a sophisticated front-end to a filesystem that is not unique to Cells itself, but rather used for several different purposes and being read/written by a plethora of other tools, which, however, may not be able to communicate directly with Cells instead and have therefore no other choice but to directly access the filesystem.

In my use-case scenarios, I use the Cells datasources for several distinct purposes:

  1. As a ‘central repository’ for many different remote synchronisation protocols. My users all have different ways to synchronise their data — they might use Microsoft OneDrive, Google Drive, Dropbox, iCloud, Synology Drive, NFS, SMB, or even AWS. While many of those protocols are natively ‘spoken’ by Cells Enterprise, we poor Home Edition users only get access to S3 and WebDAV beyond Pydio’s own communication protocol and its own API. In some cases, it might be possible to adapt existing solutions to ‘speak native Cells’ — when the tool requiring access to the datasource is being actively developed in-house and can therefore be adapted to handle Pydio Cells’ own API — but most end-users will rely upon their own favourite communication protocol instead. Cells needs to adapt, and this is only possible by having regular filesystem synchronisation tasks — meaning that, unlike what happens with the S3 and WebDAV protocols, synced files or directories will not be immediately accessible for other end-users.
  2. As a master backup service for network servers. While many hosting providers offer additional backup storage in geographically distinct locations, such storage can be expensive — especially if there is quite a lot of data to back up. In my case, everything gets synced to some special directories, which, in turn, are visible via Pydio Cells. Here the problem is that Cells users will most likely not ‘see’ the latest backups unless they manually resync the datasource, something that most of them are not even allowed to do by themselves, thus requiring manual sysadmin intervention.
  3. As a source for all kinds of media files for a streaming service. I run a (small) streaming service for very specific projects which cannot afford a ‘real’ streaming service and, of course, wish to avoid having ads of all sorts as inserted by popular public streaming services (such as YouTube). Such streaming services are costly, even for small projects; also, the public ones might limit the way such streaming is provided: in most cases, for instance, there is an absolute requirement of accessing the stream using a web browser — so that ads and tracking cookies can easily be injected by the free public streaming service provider — and most ‘workarounds’ are routinely patched by such providers. But certain use-cases really require access to the ‘raw’ stream, so to speak, using RTSP/HLS, Shoutcast/Icecast, or some such streaming protocol. Instead of installing a full-blown commercial streaming service solution, I simply use Pydio Cells for end-users to upload their media, and a reasonably simple tool which will read those files and properly feed them to a streaming server, using combinations of FOSS tools for this purpose. Again, in this scenario, end-users may wish to remove files and/or directories, or replace those with different versions, and expect all those changes to be immediately reflected on Pydio.

Anyway, my point was just to try to explain that a lot of the above can already be achieved with the existing version of Pydio Cells. The devil, as they say, is in the details. Making the .pydio file invisible for whatever backends communicate to Pydio (S3, WebDAV, direct API…) would go a long way to get a lot of the above functionality working. Note that it’s not enough to change the files’ permissions; it really must not exist, from the perspective of any backend, so that if a client app attempts to remove, say, an ‘empty’ directory (which only has a .pydio file inside — even if it’s ‘invisible’), that doesn’t return an error — but rather removes the directory and the .pydio file.

The same, of course, applies to those special cases when a directory is to be overwritten. Usually, because it contains one file, an error is given — and thus things like remotely syncing whole directories will fail, especially if this process attempts to overwrite existing directories, rename them, or delete them (because from a remote perspective, such directories are empty — they just contain a .pydio file on the server’s side, after all).

Conversely, when creating a new directory (or a whole hierarchy), it should get automatically populated with .pydio files. I believe that this actually happens in most cases.

None of that, AFAIK, is possible to do using the current tools, and it’s usually hard, if not impossible, to change the behaviour of existing clients that try to connect via, say, S3 or WebDAV, and let them go ahead with an operation of removal/renaming even if such an operation fails with an error.

A possible rewrite of the code could instead hold a copy of the storage’s hierarchy (the UUIDs with their respective directory entry) on separate storage, that is, either on disk (in a flat file, using, for example, JSON to replicate the hierarchy) or on the database itself. The idea would be that the storage space itself could be an exact replica to what the user sees on their own app client — no extraneous .pydio files or any other ‘pseudo-files’. This comes at the cost of making the maintenance of such a tree structure costly — it would also require a ‘disk check’ background monitor that would be watchful about any changes made on the many directories, and update them accordingly on either the database or a flat file (outside the storage space). That means extra code… and a different file sync procedure on the side of the Cells server, so I’m not even sure if this can/should be considered a ‘bug’ fix, or a whole new way of accessing the storage filesystem, requiring a lot of additional development work…

Hi, Interesting subject.

We did think of keeping the .pydio files somewhere else but it quickly raised the questions you briefly state, so we went with the KISS approach :slight_smile: plus we had plenty on our plates with the sync side of the structured datasources.

But definitely worth an analysis to see what would be required to change imo.

Greg