"How Quay handles disaster recovery and/or high availability?"
At all tiers of Quay Enterprise the recommendation is to run at least 2-3 instances of QE in your data center to ensure high availability, ideally with autoscaling to replace any instances that go down. Quay stores all its state in the database and backing storage engine(s), so losing the app server instances when behind a load balancer will only interrupt any in-progress requests.
"How does geo-replication work?"
It actually applies to all storage engines we support, with exception of local (NFS) storage. Further, replication between different storage engines is supported: You could have your primary storage be GCS, with a secondary being S3 (or Ceph or Swift, etc). We generally recommend having the same storage type (with the same credentials) across multiple data centers, as we can then automatically "fast-path" the copying by calling the storage engines themselves to do the copy work. However, if we see the storages are not homogenous in this way, then we fallback to streaming the data from one storage system to the other via the app servers.
The database needs to be accessible in all regions; Quay does not support master-master database setups, so the recommendation is to have a single master in one (centrally located) region, with all sites pointing to it, and a read-replica in hot standby, ideally with automatic configuration for failover. Longer term on our roadmap, we are going to support wiring of the local read-replica(s) into Quay, so that it itself can (if configured) failover to the read-replica for read-only operations in case the master goes down.
"What about DR/HA of the data in MySQL/PostgresQL?"
Redis is used as a non-durable cache and, as such, you can run a single instance (with auto-replacement or not; if it goes down, everything except the tutorial and some logging will continue working as-is). When Quay app containers are running in different AWS regions, hhey need to connect to the same Redis, as we store logs in there centrally. That being said, if you are not using the build support in the registry, then you could (theoretically) run different Redis instances without any issues, but this isn't a supported design.
For the database, we recommend using a service that has automatic failover and backup, such as RDS. If you are looking for a purely on-premises solution, you'll need to setup your own failover and backup services. The database is the one source of truth for all metadata stored in the registry, so it needs to be a single master.
For storage, we generally recommend a storage engine with automatic backups and built-in HA (like S3 or GCE).
"Does “geo-replication” imply we can run a HA cluster of 2+ Quay nodes in different data centers?"
Yep and that's its primary use case. If you have users in multiple locations, you can enable geo-replication for a storage engine (or engines) in each location, either for all images or configurable on a per-namespace basis. Once enabled, you then run a QE cluster (2-3 machines minimum is our recommendation) in each data center pointing to the single global database master, and you specify (by env var or config field) the "preferred" storage location for that region. When a user pushes to the QE nodes in that region, the system will automatically place the layer data in the storage engine preferred. Once placed, a worker in the background will copy the layer data to all other regions required. If a user in another region pulls the image, they'll receive the closest copy available: if the image has been copied to their local region, they'll receive the local copy. Otherwise, they'll receive it from afar, which is slower, but ensures that pulls always work.
Comments
0 comments
Please sign in to leave a comment.