I am encountering a strange issue if my cells instance has been running for a while. Randomly no edit actions (delete,rename,edit, etc.) stop working. The http api returns a timeout (chrome devtools) and the logs state a weird jobs issue.
If I restart if works fine for an undefined amount of hours
I have no Idea on how to proceed. This just randomly started happening.
Cells v4.1.0. Home
Revision: d7276aacaea3f7c35ece7c893f4098b89bdc5d90
2023-02-14T20:05:52.378Z ERROR pydio.rest.jobs Rest Error 500 {"error": "rpc error: code = Canceled desc = latest balancer error: service is not known by the registry", "SpanUuid": "cf027ed9-35d0-430d-ab1d-eb7869cf3215", "RemoteAddress": "xxx.xxx.xxx.xxx", "UserAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0", "ContentType": "application/json", "HttpProtocol": "HTTP/2.0", "UserName": "flexusma", "UserUuid": "a2bb80d9-9803-4ec9-97e6-d00f451cb087", "GroupPath": "/m4fx/", "Profile": "admin", "Roles": "ROOT_GROUP,d23a93b0-fb4e-43f1-8315-48f328faa8c7,ADMINS,a2bb80d9-9803-4ec9-97e6-d00f451cb087"}
Iโve just found another similar looking error log, but with a different job name:
GroupPath : /m4fx/
HttpProtocol : HTTP/2.0
JsonZaps : {"ContentType":"application/json"}
Level : error
Logger : pydio.rest.tree
Msg : Rest Error 500 - rpc error: code = Canceled desc = latest balancer error: service is not known by the registry
RemoteAddress : xx.xx.xx.xx
SpanUuid : ed221a83-98d5-4e0a-bd56-6e6ed5612db6
Ts : 1676658235
UserAgent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/110.0
UserName : flexusma
UserUuid : a2bb80d9-9803-4ec9-97e6-d00f451cb087
I think I found the cause on this. I had a large workspace ~250g with large video files that i had previously turned indexing off on. Seems like an index job was still running when restarting the vm and something broke. I removed the datasource and workspaces but the scheduler seems to be crashing on an unfinished job or something similar. I donโt know if there is a way to flush the jobs queue.
Seems like clearing the job history/cache solved the weird issue. Jobs are running fine again:
mv /var/cells/services/pydio.grpc.jobs/tasklogs.bleve/ /var/cells/services/pydio.grpc.jobs/tasklogs.bleve.bck