When I run cells on two instances, the second instance isn’t able to access the AWS S3 datasources. And when either cells instance is stopped, the other instance also receives the stop request so both stop.
I identified the issue as the registry service as the issue doesn’t persist when the registry service is set to default.
The Etcd cluster contains two nodes running on separate EC2 ubuntu servers.
Errors in log of instance 2 on start: There is already a running instance of pydio.grpc.data.index.pydiods1. Do not start now, watch registry and postpone start WARN [ClientsPool] cannot find datasource, retrying in 2s... {"ds": "pydiods1", "retries": 0}
Log of instance 1 on instance 2 stop INFO pydio.server.manager Delete event received for pydio.grpc.data.index.pydiods1|pydio.grpc.data.index.pydiods1|pydio.grpc.data.index.pydiods1, debounce server Restart4
Cells Version: 4.0.4
Edit:
I have deployed Pydio Cells from scratch using fresh external services and the problem persists
For the stop, I’ve been working on similar problems the last two days. Fixes will be available in the next version.
For the start, you shouldn’t have that message “cannot find datasource”. It might mean that communication between the two servers is not correct. Are they part of a private network and can the two servers communicate one to another ?
If you don’t mind, can you please share with us what you’re trying to achieve with the cluster and let us know a bit more about your setup ? It’s quite interesting for us to have a bit of feedback on how it’s used and what we can do to improve things in the future.
When you say the two servers do you mean the two servers running cells?
My method is to configure the external services, run cells configure with the external services, once cells is configured correctly and working on a single instance, I create an image and start a new instance from that image. This way I have two working cells instances accessing the same services.
I have not seen that the instances need to be able to communicate with each other. They are part of a private network. They couldn’t communicate with each other but they now can via private IP but no change.
I am trying to create an auto scaling system using AWS ASG.
Your method is perfectly fine. Services need to communicate between 2 instances of cells. gRPC calls are made between services and while priority is given to communication between services sharing a same instance, the app can choose to use another instance in different cases. Services for a datasource are always running only on one instance at a time because access to indexed data can lead to corruption.
If the two instances can communicate with one another, your problem might be that they are using the wrong advertise address. The advertise address is the address that is given by the registry to a grpc client within a service, so that it can access a specific service on any instance. By default the advertise address is equal to the bind address (so 127.0.0.1) except if the bind address is equal to 0.0.0.0, in that case the advertise address will be resolved. You can see all advertised addresses in the “Services” section of the console.
You could try to set the bind address to 0.0.0.0 so that the advertise address is resolved directly, or set the advertise address to whatever private IP your instance has.