Cells purging question

Hey, we’ve been trialling Cells for a while now and have noticed files are not being purged. The retention policy is set as “30 days max”. The documentation doesn’t say I need to setup a cron job as I had to do for Pydio v8 so I assumed it would do its thing automatically. Am I missing something?

Hello @scott.bentley,

If you are referring to the versioning, then you should have only the last 30 versions of a file(one for each 30 previous days).

To make sure that I understood your request, You wish for the files that are 30 days old to be Deleted forever from your Cells?

Thanks @zayn,

Yes, we want the file to be permanently deleted. We do not use the product as a file management tool and specifically do not want users storing tons of documents forever. Our use-case is to provide file transfer and limited collaboration between staff and clients. Purging files regularly is necessary to (a) relieve staff from having to remember to do it themselves and (b) ensure we aren’t using any more disk space than absolutely necessary for our use-case purposes.

Hey @zayn, just wanted to check in on this and see if you have anything else to add. Is purging a feature that is likely to be implemented in the future? Should I attempt to write a plugin of some sort? Where would I start if I wanted to write a plugin that would perform file purging?

Thanks,

Scott

A plugin development guide with some boiler-plate code would be, indeed, very nice to have.

One way would be to use the sdk to create a custom task that deletes (ran with a cron or systemd),

The go-sdk, allows you to CRUD the resources therefore allowing you to list and delete data. (you could list data and check if it is old by X days then proceed to delete).

If you are stuck, tell me and I’ll write a snippet for you.

You could also script something with the cells-client

Thanks Zayn. I was looking into this and then COVID hit. I’m back looking at it again now though!

So, I installed the cec client and wanted to list all existing cells but I don’t see how to do this? If I run “cec ls” I only get the workspaces and cells of the authenticated user (admin, in this case). I need a listing of ALL cells and/or files so that I can then delete those older than 30 days. Would be able to provide me with a snippet that might accomplish this?

Thank you so much, and I hope you and your loved ones have been well throughout this pandemic!

Scott

Hey @zayn , I don’t want to be pushy and I understand if there’s more pressing issues you need to respond to, however if there is an example of using the cec to find and remove files from accounts other than the logged in account, could you please point it out to me?

Otherwise, can you please advise how I can find files that are older than a number of days and administratively clear them for ALL accounts?

Also, a related question, how do I force an expiry on all public shares so the user cannot create shares without at least a minimum expiry of, for example, 60 days?

Hello @scott.bentley,

here is a go snippet to help you start, what it does is list nodes with the AdminTreeList (which will allow you to see all the nodes ), I have added comments to point where you have to add the functions that you need to.

package main

import (
	"log"
	"path/filepath"
	"strconv"
	"time"

	cells_sdk "github.com/pydio/cells-sdk-go"
	"github.com/pydio/cells-sdk-go/client/admin_tree_service"
	"github.com/pydio/cells-sdk-go/example/cmd"
	"github.com/pydio/cells-sdk-go/models"
)

var (
	config = &cells_sdk.SdkConfig{
		Url:        "https://my-cells.com",
		ClientKey:  "cells-front",
		User:       "admin",
		Password:   "",
		SkipVerify: false,
	}
)

func main() {
	ctx, cli, err := cmd.GetApiClient(config)
	if err != nil {
		log.Fatalf("Could not GetApiCLient, cause: %v\n", err)
	}

	params := &admin_tree_service.ListAdminTreeParams{Body: &models.TreeListNodesRequest{
		// Lists all the nodes under the personal datasource (meaning, personal-files/admin, personal-files/johndoe, etc...)
		Node: &models.TreeNode{Path: "personal"},
		Recursive: true,
	}, Context: ctx}

	// ListAdminTree lists all the nodes
	result, err := cli.AdminTreeService.ListAdminTree(params)
	if err != nil {
		log.Fatal(err)
	}

	for _, n := range result.Payload.Children {

		// ignores .pydio files
		if filepath.Base(n.Path) == ".pydio" {
			continue
		}

		// Parse and convert files MTime
		i, _ := strconv.ParseInt(n.MTime, 10, 64)
		tu := time.Unix(i, 0)
		d := time.Since(tu)

		// Checks if duration is older than 30 days
		if d.Hours() > float64(time.Hour * 24 * 30) {
			// add function to delete the nodes
			// see TreeService
			cli.TreeService.DeleteNodes(nil)
		}

	}
}

@zayn thanks that looks great! Of course, now I need to figure out how to use GO as I’ve never touched it before.

Was it your intention that this snippet should be used to create a separate app/script or is this something that could be turned into a plugin? I don’t see any resources in the documentation about developing plugins and I was kind of hoping to make this something that could be used from within Cells itself so other admins wouldn’t need to use command line scripts for this purpose.

Also, I see that you’ve used AdminTreeService to list all nodes, and TreeService to delete nodes. Can I do this with the REST API? Can an admin delete any node using TreeService as long as they have the path reference?

Thanks so much, and sorry to be asking so many questions lol

Hey @scott.bentley,

Yes, all of those operations are available through the REST API as well, sorry If I wrote the snippet in go as it was the main language that I use to write my scripts.

You could write the same script with bash, or other languages for instance java ( cells java sdk )
.

To add this as a plugin would be possible, there are no direct indication but you could analyze the code and see how it is done for the other plugins.

Hey @zayn,

I’ve been playing with Postman and the Cells API trying to make this work and I’ve come a long way but stumbled at the finish line. I have managed to use the a/tree/admin/list endpoint to list all the files under cells belonging to users, as you can see in this screenshot:

And in this second screenshot you can see that I am trying to delete one of these files from the “cellsdata” storage datasource, under the user “scott.bentley@hhangus.com”, however, the response says I cannot access the workspace. Now, I’m doing this as the “admin” account, so there should not be permissions issues involved. I’m wondering if the issue has to do with the fact that “cellsdata” is a storage “datasource” and not a “workspace”. Can you please let me know if I’m doing something wrong, or provide guidance?

Hello @scott.bentley,

cellsdata, is actually the name of the datasource that holds all the cells and the api to delete nodes (/tree/delete) only works with path that are not from the admin_view (meaning the admin list).

Unfortunately I just realized that there is no API to delete with the admin pathes.

Another solution that was given to me by the devs is to create a workspace with all of your datasources as roots and then use the usual api /a/tree/stat see /tree/stat on that workspace to list the nodes a perform the actions that need on them.

Sorry if I have misguided you with the adminTreeService API, it seems that this api is only used for specific cases on the application.

Thanks @zayn

I actually had been considering trying that as I noticed that the Workspaces configuration would allow me to make any disk location part of the root. I was reluctant to try it though as I wasn’t sure what effect it might have on permissions or other settings. If you think this is a good idea, I’ll try it out, thanks.

Getting back to the idea of a plugin, is there documentation of any sort that would help me understand how to write one, where to even start? I looked at the github code and as I have zero experience with whatever framework* you’re using and have never even used GO, I’m more than a little lost as to where to begin.

** what framework are you using anyway??

@zayn,

Ok, so I’ve made progress!

(1) I created a workspace called “Cells Data” with the slug “cellsdataws” with Read Only basic permissions.

(2) Assigned Read/Write permissions to the Administrator Role.

(3) I tried to list the workspace contents using a/tree/stat but this only lists the content of one node at a time, which would require me to write a recursive script to run through all the nodes to find what I want. Instead of this, I am using a/tree/admin/list to actually find contents of the DATASOURCE:cellsdata with the LEAF filter, and this is great because I can get a full list of all the files and their MTime from a single API call.

(4) I used a/tree/delete to delete the node, however, because the node must reside in a workspace I replaced the datasource name (cellsdata) with the workspace slug (cellsdataws) in the path and it works…sort of. So, it actually moved the node into the recycle bin, which is not quite what I wanted.

(5) I used a/tree/delete to delete the recycle_bin and that cleared it out too.

So, this works!

Thank you sooo much!

Hello,

Thank you for this script that helped me getting starting with go and purging some file. However, I am stuck at the “deleting specific nodes” part.

I assumed that DeleteNodes(nil) would not just delete the files I need and tried something like :

DeleteList := tree_service.NewDeleteNodesParams()
	Body := &models.RestDeleteNodesRequest{
		Nodes:     []*models.TreeNode{},
		Recursive: false,
	}

	DeleteList.SetBody(Body)

<loop on datasource/folder to select appropriates nodes to purge>
  <If node was selected>
      DeleteList.Body.Nodes = append(DeleteList.Body.Nodes, n)

if purgedfilecounter > 0 {
		ok, ret := cli.TreeService.DeleteNodes(DeleteList)
		fmt.Println("DeleteNode returned : ", ok, "Nodes", ret)

	}

This returns 404 unknown. Which fundamental detail did I miss to achieve deleting only specific nodes?

Actually looking into cells logs, it state that the requested workspace is not found. but the node I passed to delete is one returned by the tree list service. \ 0.0 /
Regards

Partial answer to myself.
The delete fails, as the log message states, because the path node path returned by the ListAdminTree service starts with a DATASOURCE and not a workspace. Replacing the datasource with a workspace name “fixes” the issue.

Now I need to figure how to find the workspace of a given node using its full path starting with a datasource.

hey @sbs

I actually wrote a shell script to do this for us. What it does is move/delete the files from the location on disk, and then uses the Cells CLI tool to run a datasource synchronization that repairs the database.I know it’s not the “API based solution” I was hoping for when I posted this question, but it does work well.

This is the script:

#!/bin/bash
#
# This script will move files older than 30 days to recyclebin
# NOTE files inside folders with a name that starts with "/lts-" will not be moved.
# "lts-" folders are considered "Long Term Storage" and are exempt from this script.
#
shopt -s globstar

if [ "$USER" != "pydio" ]; then
        echo "Please run this script as pydio user"
        #exit
fi

#Empty the previous recyclebin
rm /home/pydio/recyclebin/*

#Obtain timestamp -30 days
pastdate=$(date -d "$date -30 days" '+%s')
#Loop through all files/folders
for file in /home/pydio/.config/pydio/cells/data/cellsdata/**
        do
                #Use perl to match file paths that do not contain "/lts-"
                match=$(echo "${file}" | perl -0777 -pe 's/^(?:(?!\/lts-).)+$/match/i')
                #Get type, directory | regular file
                filetype=`stat -c %F "$file"`
                #If its a good path and file then
                if [ "$match" == "match" ] && [ "$filetype" == "regular file" ]; then
                        #Get the mtime timestamp
                        filemtime=`stat -c %Y "$file"`
                        #If the file mtime is older or equal to the comparison date, delete it
                        if [ $filemtime -le $pastdate ]; then
                                echo "$file"
                                mv "$file" /home/pydio/recyclebin/
                                #rm "$file"
                        fi
                fi
        done

/home/pydio/cells admin resync --datasource=cellsdata
echo "****"
echo "Script finished. Please verify cellsdata resync in the logs."
echo "****"

Hi @scott.bentley ,

Thank for this. I was using scripts with pydio 8, and seeing that there was a rest API for these kind of things gave me an opportunity to dive into yet another language.

I have added a few constraints to my scripts (folders to ignores, size based purging policy ) so I would be happy to be able to complete that with a better solution than hardcoding a string.replace to handle the the ws <-> datasource issue.

I wish there was a “readable” documentation for this API. Hopefully someone will point out to me the proper page explaining how to properly find the base workspace from a file path using the API.

BTW : reading the /admin/tree/ls document page (https://pydio.com/en/docs/developer-guide/cells-admin-files-ls) it states that file name are datasource based.

Hello,

If some people are looking for a script that delete files older than X days.
It’s based on cec command.

#!/bin/bash

# Il faut avoir installé & configuré l'utilitaire cec et le rendre disponnible pour le user pydio
# CF. https://pydio.com/fr/docs/developer-guide/cells-client

# Option pour la commande cec : lancer cec --help pour voir la liste des options
cec_opt="--details"

# Nom du stockage pydio
cec_storage="common-files"

# Durée de rétention
days_keep="3"

# Formate l'output pour affichier les info des fichiers (type,ID,nom,taille et date de création) en suprimant les headers, les lignes vides et les caractères +/- et |
# Expample de sortie :
#  File     80a03e0642ed466a8d522bbd8c5efed8   rich.rtf    1.7 kB   27 minutes ago
#  File     55283ea788e84694810f7362d97df107   test.txt    1 B      1 day ago
#  File     072e00ca5f44446689f21a7381aca387   yolo.xlsx   4.2 kB   25 days ago

# Exécute la commande Linux et stocke la sortie dans la variable 'output' une fois formaté
output=$(cec ls $cec_storage $cec_opt | sed '1,5d' | tr -d '+-' | grep -v '^$' | sed 's/|/ /g')

# Découpe la sortie en lignes et boucle sur chaque ligne
while IFS= read -r line; do
  # Découpe la ligne en colonnes en utilisant des espaces comme séparateurs
  columns=($line)

  # Récupère le nom du fichier situé à la colonne 3
  filename=${columns[2]}

  # Récupère la taille du fichier situé à la colonne 4
  size=${columns[3]}
  unit=${columns[4]}

  # Récupère la date/nombre de jours à partir de la colonne 6 et 7
  number=${columns[5]}
  time_unit=${columns[6]}
  # Join les éléments des colonnes >6 et ajoute un espace comme séparateur
  date_full=$(IFS=' '; echo "${columns[*]:5}")

  # Affiche le nom du fichier / date / taille
  #echo "Nom du fichier : $filename, taille du fichier : $size $unit, date : $date_full"
  # Vérifie si l'unité de temps est en jours
  if [ "$time_unit" == "days" ];then
    # Vérifie si le nombre de jours dépasse le seuil définit plus haut
    if [ "$number" >= "$date_threshold" ];then
      echo "Suppression du fichier $filename plus vieux que $date_threshold days !"
      # Supression du fichier via la commande cec
      cec rm -f $cec_storage/$filename
    fi
  fi

done <<< "$output"
1 Like