Synchronizing / indexing system files?

I was hoping to use Cells as a basic file viewer for my CentOS server. All the files are being created by the system, not through Cells.

I created a data source with my folder, and added it to a new workspace. However, when I add files to the folder, they do not show up automatically in Cells. I used Pydio 8 Community a bit before this, and that seemed to work fine. In order for them to show up, I have to re-synchronize the datasource.

So, I have a few questions:

  1. Does synchronizing also index the files in the database, for faster searching?
  2. Is there a way to automate this, and if so, what is the best way?
  3. Do all the files need to be re-indexed every time this happens, or does it only need to synchronize the new files?

Question 3 is especially important because the folder is going to have tens of thousands of text files, and having to re-index them all every time likely won’t be practical, and I’ll have to find a different solution.

Thank you.

  1. Yes. Synchronizing is indexation process. It does not make a copied version of data.
  2. If you change files directly from file system, the indexation is required. However, cells will sync just changed parts in directory
  3. Yes, you can use it from cli:
    /home/cells data sync --service=pydio.grpc.data.sync.datasource_name --path=/
1 Like

Oh, that’s great. So if I set up a cron task for ’ /home/cells data sync --service=pydio.grpc.data.sync.datasource_name --path=/’ (with my paths) to run every 5 minutes or so, it will check the folder for new files only, and add those new files to the database? Which should mean even with folders that have 50,000 text files or something, the performance should be alright? Just confirming.

Is there a way to see which files are indexed, or which files are being indexed as the sync is running?

No way :slight_smile:

cron job is ok, but it should be run with “pydio” user. Should be longer thang 5 minutes depending on the size of data. 50k text files => I think 20 minutes is fine.

Why 20 minutes? If the bulk of the files are already indexed and it’s only scanning for new files, would it have that much of a performance impact?