Nginx as a reverse proxy in front of Cells: what's the correct form of passing errors back?

This is perhaps an extremely basic question which might have been answered elsewhere, but I couldn’t find a matching answer… so here it goes:

I run Cells 4.0.2 behind nginx, being used as a reverse proxy for the specific domain (let’s call it files.mydomain.tld). This is not the only thing that nginx does, however; it’s serving some other 30+ sites as well. Acting as reverse proxy for Cells is just one of the many tasks it does; it’s not even the only instance of reverse proxying. And, last but not least, it’s not being used only for reverse proxying HTML content; some reverse-proxied applications are also providing APIs that return text/plain or application/json or whatever content it wants. None have issues doing so.

Cells, however, has a slightly different behaviour in this situation. Most JSON-based errors propagate correctly to whatever client is calling… but some don’t. It’s very hard for me to figure out what exactly is going on, but this seems to be more likely to happen when using one of the exposed APIs, namely, the S3 API (which I’ve tested most).

Indeed, after looking at some logs (trying to connect with, say, rclone, or s3cmd, or even Cyberduck…), every now and then the client will complain about ‘malformed JSON response’, ‘invalid error format’, ‘expecting , got ’; the exact message varies considerably depending on what client is doing the call, but the point is that nginx is returning its own HTML-based error in those circumstances, as opposed to the expected JSON-based error (e.g. nginx just sees a 200 message). I could grumble a lot about the way JSON-based APIs deal with errors, but I won’t; too many APIs out there rely on these semantics…

In fact, I suspect that nginx will not know what to do if it gets a 520 Gateway Error — meaning that Cells, for some reason, hasn’t replied, or replied in a way that nginx didn’t like — except for passing it along, using its own built-in templates (or some user-generated ones). The same might happen with more ‘serious’ errors that aren’t exactly visible at the Cells level — in fact, all errors with 5XX will, in one way or another, signal nginx that something with Cells is wrong, but nginx has no way to know what it is.

It’s quite different if Cells itself figures out something like an authentication error, or a file that doesn’t exist; whatever level in the stack is addressed, Cells will know what to do: send back an error message to the user (which will be JSON-based). The only problem is if Cells itself ‘breaks’ with an unexpected error, such as a panic/fatal error that severs the connection abruptly. Nginx doesn’t know what to do in such scenarios and just outputs an HTML error — which will utterly confuse the client.

Now, I’m pretty sure that dealing with such errors is something happening all the time at the vast global scale of JSON-based API endpoints out there, most of which will be behind reverse proxies, many of those being nginx. It’s quite reasonable to expect that there is a specific configuration for nginx to properly handle those two kinds of errors, e.g. errors that Cells is aware of and is able to send back as a properly-formatted JSON-based error, and ‘gateway errors’ of some sort that mean that the connection has been terminated for some unexpected reason — so Cells never manages to generate an error that nginx might catch. Nginx steps in and emits its own error.

This should be prevented from happening — since the client obviously is not expecting a ‘traditional’ HTTP error, but rather a JSON-based one, and this will break (or at least utterly confuse) the client. This might be one of the reasons why I can connect via the S3 API with some clients (those that log their ‘surprise’ at a badl2y-formatted error, but which still persist in doing something useful), while others never managed to connect (because getting an HTTP error, wrapped in HTML as opposed to JSON, breaks them completely). This has baffled me for a long time, since the S3 protocol is rather well-known, and MinIO’s own implementation is so close to the Golden Standard (set by Amazon!) that it’s a reference for everybody else (in other words, anything based on MinIO should be 99.9% compliant with whatever an S3 client expects to get); I could expect some errors, obviously, but not ‘breaking errors’ that do not respect the JSON-based error protocol of the API, and which clients are obviously not compelled to implement.

In fact, some clients, AFAIK, may actually force the S3 endpoint to output only JSON, by indicating on the request headers that they will only accept the application/json MIME type (or something equivalent). Consequently, if nginx replies with anything that is, say, text/html or text/plain, such clients will either ignore it (and attempt to continue the communication, in the expectation that the ‘glitch’ was temporary), or, well, simply refuse to keep the connection up.

I even tried to trick nginx to do the following: when it gets a connection requesting content in JSON (and not HTML), then it returns a personalised error message — but using the ‘standard’ JSON-error semantics. If the request was for HTML, it replies with the ‘usual’ HTML error page. I even did the same for text/plain and XML — just for the sake of completeness.

This still didn’t work. My guess is that nginx simply fails to get any kind of error from Cells — the only error it’s aware of is that somehow the connection was cut while Cells was attempting to emit a valid formatted error, but for unknown reasons, nginx will fail to recognise even that. That makes sense if Cells simply crashes without returning anything, so nginx will not know what to return; when at last the connection is dropped after the usual timeout, nginx will just emit a generic gateway error.

Last but not least, I have restricted access to the way I’m ‘allowed’ to configure a vhost under nginx. Sure, I can use a specific configuration in nginx just for Cells — but I don’t want to go that route, or my automation tools will break. And that, for now, is not a good idea (the server runs many production websites side-by-side with many beta-testing software — such as Cells),

Any suggestions for dealing with this issue?

For the sake of completeness, I’m adding the current configuration, redacted here and there:

Partial, redacted nginx configuration
# Generic settings for all vhosts
user <web server user>;
worker_processes auto;
worker_rlimit_nofile 16384;
pid /run/nginx.pid;
pcre_jit on;

events {
        worker_connections 1024;
        multi_accept on;
}

http {
        map $remote_addr $ip_anonym1 {
                default 0.0.0;
                "~(?P<ip>(\d+)\.(\d+)\.(\d+))\.\d+" $ip;
                "~(?P<ip>[^:]+:[^:]+):" $ip;
        }

        map $remote_addr $ip_anonym2 {
                default .0;
                "~(?P<ip>(\d+)\.(\d+)\.(\d+))\.\d+" .0;
                "~(?P<ip>[^:]+:[^:]+):" ::;
        }

        map $ip_anonym1$ip_anonym2 $ip_anonymized {
                default 0.0.0.0;
                "~(?P<ip>.*)" $ip;
        }

        log_format anonymized '$ip_anonymized - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';

        ##
        # Basic Settings
        ##

        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
#       keepalive_timeout 650;
        types_hash_max_size 2048;
        server_tokens off;
        charset utf-8;

        server_names_hash_bucket_size 256; # must be here for certbot check (gwyneth 20200510)

        # Cloudflare
        # see https://github.com/pothi/wordpress-nginx/blob/master/globals/cloudflare.conf
        # List of valid IP addresses for Cloudflare will be saved /etc/nginx/conf.d/cloudflare-ip-list.conf;
        # and included later in the configuration.

        include /etc/nginx/mime.types;
        default_type application/octet-stream;

        ##
        # Optimizations
        ##
        keepalive_requests 100000;
        client_max_body_size 256M;

        # TLS/SSL settings are configured on /etc/nginx/conf.d/ssl.conf

        ##
        # Logging Settings
        #
        # Note that all individual vhosts will override these!
        # (gwyneth 20220312)
        ##
        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log;
        # Gzip configured on /etc/nginx/conf.d/gzip.conf
        # Brotli configured on /etc/nginx/conf.d/brotli.conf

        ##
        # Many other unique configurations for this server
        ##
        include /etc/nginx/conf.d/*.conf;

        # The options below are for the Cloudflare configuration.
        # This might one day be set on another file, but cloudflare-ip-list.conf
        # is being overwritten every time, so we keep it here (gwyneth 20220312)
        real_ip_header          CF-Connecting-IP;
        real_ip_recursive       on;

        ##
        # Virtual Host Configs
        ##
        include /etc/nginx/sites-enabled/*;
}

# This is for the vhost running Pydio Cells:
server {
        listen <external IPv4>:80;
        listen [<external IPv6>]:80;
        listen <external IPv4>:443 ssl http2;
        listen [<external IPv6>]:443 ssl http2;

        ssl_certificate <path/to/vhost>/ssl/le.crt;
        ssl_certificate_key <path/to/vhost>/ssl/le.key;
        ssl_trusted_certificate <path/to/vhost>/ssl/le.crt;     # can be the same

        server_name <virtual host name>;

        root    <filepath/to/vhost>;
        disable_symlinks if_not_owner from=$document_root;

# probably superfluous
        index index.html index.htm index.xhtml standard_index.html;

        error_log <path/to/vhost>/error.log;
        access_log <path/to/vhost>/access.log combined;

# Needed because *some* Pydio Cells services require access to `.pydio` files...
        location ~ /\. {
                allow all;
        }

        location = /favicon.ico {
                log_not_found off;
                access_log off;
                expires max;
                add_header Cache-Control "public, must-revalidate, proxy-revalidate";
        }

        location = /robots.txt {
                allow all;
                log_not_found off;
                access_log off;
        }

            client_max_body_size 200M;
            proxy_buffering off;
            proxy_request_buffering off;
            proxy_http_version 1.1;
            proxy_ssl_server_name on;

       location / {
                proxy_connect_timeout 300;
                proxy_pass https://<internal private IPv4>:8443;
                proxy_pass_request_headers on;
        }

        location /ws/ {
                proxy_pass https://<internal private IPv4>:8443;
                proxy_set_header Upgrade $http_upgrade;
                proxy_set_header Connection "upgrade";
                proxy_read_timeout 86400;
        }
}

server {
            listen <external IPv4>:33060 ssl http2;
            listen [<external IPv6>]:33060 ssl http2;
            ssl_certificate     <path/to/vhost>/ssl/le.crt;
            ssl_certificate_key <path/to/vhost>/ssl/le.key;
            ssl_protocols       TLSv1.3 TLSv1.2;
            ssl_ciphers         ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:E>
            ssl_prefer_server_ciphers off;
            ssl_dhparam /etc/nginx/cert/dhparam.pem;

            location / {
                proxy_ssl_verify off;
                grpc_pass grpcs://<internal private IPv4>:33060;
            }

            error_log <path/to/vhost>/log/proxy-grpc-error.log;
            access_log <path/to/vhost>/log/proxy-grpc-access.log;
}

A few points worth mentioning…

  • This is by far not the complete configuration; there is a lot more, mostly for security purposes, which I’ve omitted
  • You might be able to notice that I use a templating system (and if you use the same one you’ll recognise the style :slight_smile: ). As mentioned on the very beginning, there are things I can do, and others that I cannot.
  • I’m running Cloudflare in front of nginx, so I have a double-reverse-proxy in place :slight_smile: (trust me, it’s quite worth the effort)
  • Although not strictly necessary, the communication with the internal IPv4 address to which Cells is bound is made via https. That means that I can run some tests even if I turn off Cloudflare and nginx. It’s probably superfluous, though.
  • For the sake of convenience, on the redacted version, it seems that the SSL certs are actually below the document root — a security flaw. In reality, they’re stored elsewhere. I was just lazy while copying & pasting things :slight_smile:
  • The above configuration uses — naturally enough! — the ‘official’ guidelines for using nginx as a reverse proxy. Note that there are two vhosts configured, one for the default HTTP ports (80/443), and another for gRPC (33060)

There are, however, a few differences. Pydio recommends binding to a private address other than localhost (i.e. 127.0.0.1 or ::1). Looking at some past issues, it seems that some things get broken when using the r and others get broken when using a new private address (‘alias’) bound to the Ethernet interface. In my special scenario, I have stumbled upon a limitation: the datacentre I use explicitly forbids sending out packets to private networks, and they expect that one’s firewall will block all such attempts (this is a recent, new enforcement). Granted, there are ways to write firewall rules to take that in account, namely, by punching a hole through the firewall to allow access to only one private network address. At some point back in time, I actually had things set up like that, until my provider told me otherwise — I would have to drop any firewall rule that allowed any kind of traffic towards (or from) a private address configured outside my machine but allow only those configured locally. Oh well, my firewall config was already becoming more and more complex, so I reverted it to just the basics, and assigned Cells to the local loopback device…

Note, however, that the whole issue about these errors is wholly independent on whatever IP address Cells binds to.

Also, if there is any interest, I can also post my non-working attempt of getting nginx to emit errors in HTML/XML/JSON, depending on what the client accepts. Unfortunately, my approach doesn’t seem to have any effect — I still get HTML-formatted errors, no matter how nginx is configured, so ultimately I reversed the changes to what I’ve posted here…