Registry error using etcd - v4

Hello,

I have been developing a cells v4 cluster using the third-party services described in https://pydio.com/en/docs/cells/v4/external-services

When I run cells on two instances, the second instance isn’t able to access the AWS S3 datasources. And when either cells instance is stopped, the other instance also receives the stop request so both stop.

Cells Configuration

  Package:      Pydio Cells Enterprise Distribution
  Version:      4.0.4
  BuildTime:    10 Nov 22 15:36 +0000
  Git Commit:   b84bc90532aee48a22ea430a36ed21dfccda5aff
  Go Version:   go1.19.2
  OS/arch:      linux/amd64

Drivers:
  Registry:     etcd://etcd01:2379/registry
  Broker:       nats://nats01:4222/broker
  Config:       etcd://etcd01:2379/config
  Vault:        etcd://etcd01:2379/vault
  Keyring:      vault://vault01:8200/kv/keyring
  Certificates: vault://vault01:8200/caddycerts
  Cache:        redis://redis01:6379/cache
  ShortCache:   pm:///shortcache

Networking:
  Hostname:    ********
  Advertise:    127.0.0.1

Monitoring:
  Metrics:      false
  Profiles:     false

Build Settings:
  -compiler:    gc
  -trimpath:    true
  CGO_ENABLED:  1
  GOARCH:       amd64
  GOOS: linux
  GOAMD64:      v1

I identified the issue as the registry service as the issue doesn’t persist when the registry service is set to default.

The Etcd cluster contains two nodes running on separate EC2 ubuntu servers.

Errors in log of instance 2 on start:
There is already a running instance of pydio.grpc.data.index.pydiods1. Do not start now, watch registry and postpone start
WARN [ClientsPool] cannot find datasource, retrying in 2s... {"ds": "pydiods1", "retries": 0}

Log of instance 1 on instance 2 stop
INFO pydio.server.manager Delete event received for pydio.grpc.data.index.pydiods1|pydio.grpc.data.index.pydiods1|pydio.grpc.data.index.pydiods1, debounce server Restart4

Cells Version: 4.0.4

Edit:
I have deployed Pydio Cells from scratch using fresh external services and the problem persists

Hi,

For the stop, I’ve been working on similar problems the last two days. Fixes will be available in the next version.

For the start, you shouldn’t have that message “cannot find datasource”. It might mean that communication between the two servers is not correct. Are they part of a private network and can the two servers communicate one to another ?

If you don’t mind, can you please share with us what you’re trying to achieve with the cluster and let us know a bit more about your setup ? It’s quite interesting for us to have a bit of feedback on how it’s used and what we can do to improve things in the future.

Thanks,
Greg

Hi Greg,

Thanks for getting back to me.

When you say the two servers do you mean the two servers running cells?

My method is to configure the external services, run cells configure with the external services, once cells is configured correctly and working on a single instance, I create an image and start a new instance from that image. This way I have two working cells instances accessing the same services.
I have not seen that the instances need to be able to communicate with each other. They are part of a private network. They couldn’t communicate with each other but they now can via private IP but no change.

I am trying to create an auto scaling system using AWS ASG.

Thanks,
Cameron

Thanks for that.

I did mean the two servers running cells.

Your method is perfectly fine. Services need to communicate between 2 instances of cells. gRPC calls are made between services and while priority is given to communication between services sharing a same instance, the app can choose to use another instance in different cases. Services for a datasource are always running only on one instance at a time because access to indexed data can lead to corruption.

If the two instances can communicate with one another, your problem might be that they are using the wrong advertise address. The advertise address is the address that is given by the registry to a grpc client within a service, so that it can access a specific service on any instance. By default the advertise address is equal to the bind address (so 127.0.0.1) except if the bind address is equal to 0.0.0.0, in that case the advertise address will be resolved. You can see all advertised addresses in the “Services” section of the console.

You could try to set the bind address to 0.0.0.0 so that the advertise address is resolved directly, or set the advertise address to whatever private IP your instance has.

Hopefully that’s not too cryptic :slight_smile:

Thanks,
Greg