Import from a remote source

Describe your issue in detail

I’d like to import data already located remotely (here on Google Docs, …)

What version of Cells are you using?

latest

What is the server OS? Database name/version? Browser name or mobile device description (if issue appears client-side)?

Debian

What steps have you taken to resolve this issue already?

That a topic I can’t find much information on.
It often happens that data already exists remotely and users want to avoid a painful download/upload procedure.
I use a “flat/opaque” object storage as a backend.

What’s the advised way for Pydio to import GoogleDocs data straight into my cell?

I plan to do rclone mount on the server filesystem then add a read-only structured local data-source in Pydio Cells and copies files/folders using the UI.
Is this the best and advised way to handle this issue?

Hello,

If you would like to import already located in Google, you may use “Google Cloud Storage” as a datasource in Cells.

Otherwise, you can use Cells to expose a sftp service. Then use rclone to synchronize the data between

GoogleDocs <===rclone==> sftp (cells)

Both GoogleStorage and SFTP are Entreprise features. I plan to go Enterprise at some point in the future, but importing aren’t feature I like to see paywalled. If going the rclone-mount way, then I can avoid SFTP (but direct Python GoogleStorage would probably have been more efficient)

Reading and exploring further on this:

  1. If mounting (from NFS, Samba, rclone, …) the parent directory is considered. But using a parent directory, leads to Export path should not have any sub-mounts.

  2. Assuming (for now) we are fine importing only subdirectory as a datasource, we get hit by
    Checking parent folder is writable before creating datasource (why can’t we add a truly read-only datasource, not only read-only workspace? :warning:)

  3. Let’s make this writable by doing an overlayfs out of the parent directory and give it permission to Pydio:
    mount -t overlay overlay -o lowerdir=where-rclone-mount-google,upperdir=_upper,workdir=_workdir pydio-mntpoint && chown -R pydio pydio-mntpoint and point datasource to pydio-mntpoint/subdirectory (Remember above point #1)

  4. Here I got object service local2 is already pointing to XXX, make sure to avoid using nested paths for different datasources (XXX is the parent of pydio-mntpoint/) but removing/recreating the datasource got rid of this.

  5. From here I could create a read-only Workspace, but the copy option is disabled (the workspace must be made read-write even if the goal is just to copy files out into another workspace) :warning:

  6. Side note, even deactivating the datasource, the umount is impossible target is busy (the datasource must be deleted first, which is, at least, very inconvenient :warning:)

Overall, this sounds doable with the below sample hierarchy:

├── _upper        # overlayfs (to make the it seemingly writable to Pydio) 
│   └── origin
│       ├── a.txt
│       ├── b.txt
│       └── subdirectory
│           ├── c.txt
│           └── d.txt
├── _workdir    # overlayfs
│   └── work
├── one-level   # This is the intermediary directory needed by Pydio
│   └── origin  # This is where you would mount your SMB/NFS/Google/rclone-mount/...
│       ├── a.txt
│       ├── b.txt
│       └── subdirectory
│           ├── c.txt
│           └── d.txt
└── pydio-mntpoint  # The overlayfs mount-point
    └── origin      # The directory to present to Pydio when creating a datasource
        ├── a.txt
        ├── b.txt
        └── subdirectory
            ├── c.txt
            └── d.txt

[tbc]

1 Like

Ouchie. Ok, I have been struggling with this issue for a long while as well (although, in my case, I need such a mount point to be read-write), and had no idea that it took such a complex and convoluted approach to achieve in Cells something which, otherwise, seems pretty simple to do.

I have a question, though. Using this approach, will Pydio Cells’ in-build mechanism of auto-indexing anything that is dropped in the datasource? It looks like it might be the case. My problem is that all my futile attempts so far always requires Cells to reindex the whole datasource — something which, in my case, takes more than a few minutes, and, as such, cannot be synced “in quasi-real time” (like rsync, Rclone, or other similar synchronisation tools are able to do).

And of course reindexing everything is a very intense procedure, in terms of resources consumed; my otherwise reasonably powered bare metal server will be overwhelmed for the minutes it takes. As such, I reindex only occasionally — and my users know that the only way to get things appearing in Pydio instantly is to upload them via the Web pages (they usually resist any of my suggestions to use Pydio Sync instead — and truthfully I cannot blame them).

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.