Quay relies on the database as the single source of consistent truth when it comes to repositories, images, permissions, action logs, tracking uploads, etc. Breaking this hard consistency can have disastrous effects. We recommend running the Quay database (MySQL or Postgres) in a master-slave setup, with automatic replication and failover, as well as automatic backup.
MySQL should be fairly large. Storage requirements vary based on a number of users, registries etc but 50 GB should be a good starting place. Note that most of the database data will most likely be user/audit activity logging which can be purged/scraped as needed. Backups for MySQL should be automatic and often and MySQL should have auto-failover.
Redis is fairly small unless users are using builds in which case a couple of tens of GB should be good enough.
Why is it you recommend that users use hosted storage (like S3 or similar), as opposed to an on-premises SAN / NAS, for an HA setup?
A few reasons, but the most important being that hosted storage (or an equivalent like Ceph RADOS) allows for downloading of blob data via signed URLs, which results it far better performance versus the registry having to stream the data. It is also a requirement for our georeplication support that all storages be accessible by all nodes, which is much easier when there are independent storage systems vs locally mounted file systems. Finally, most hosted storage services already handle backup and scaling, which reduces complexity.
Do all nodes in the cluster share the same storage for hosting the registry, as opposed to using separate storage, and replicating the content?
They don't need to share storage, but they all, must be accessible to each other for our georeplication feature to function, as it is asynchronous and in the background (to ensure consistency and performance).
For multiple locations you say that Quay storage (geo)replication should be used. Is this better than using a backend solution (like EMC Isilon) which would replicate at the block level?
Up to you. If you do use a backend solution, it must be globally consistent for consistency in the registry to be maintained. Otherwise, a user in one region could push, and then a user in another region could immediately try to pull, which would fail as the blocks had not yet been replicated. In contrast, if you use Quay's georeplication, if the data had not yet been replicated to the second user's local region, they will pull from the initial storage region; while this is slower, it is full consistent and does not result in an error.
Does WE georeplication replicate data at file or block level?
File, which is known as a "layer" in Docker parlance.
I gather that all cluster nodes will use the same MySQL database, but will they all share the same Redis instance too?
Yes, although Redis is only used for build logs and user events, so it requires minimal performance or uptime guarantees. If it goes down, you lose in-flight build logs and the Quay tutorial stops functioning, but everything else continues without issue.
Is it true that you don’t recommend using a cluster for Redis, as it’s not critical, and doesn’t store persistent data?
Its not that we don't recommend, its just not required. If you have an easy clustered solution that gives you HA "for free", might as well do so.
I assume you recommend using Docker containers rather than traditional VMs for the QE cluster nodes?
QE runs under Docker; where that Docker runs and how you scale it is up to you. We, of course, recommend Kubernetes, but that's our own opinion.