In short: Galera doesn't meet the consistency requirements that are standard on an ACID-compliant database such as MySQL (despite what they may say), which results in spurious and frustrating errors appearing.
Quay relies on the database as the single source of consistent truth when it comes to repositories, images, permissions, action logs, tracking uploads, etc. Breaking this hard consistency can have disastrous effects: In the past when customers have tried to use Galera, they have seen large-scale and recurring failures that completely disappeared once they moved off of Galera to a more traditional master-slave failover model. As a concrete example: Quay tracks blobs being uploaded via an internal table, to ensure that any Quay node can handles uploads, as Docker performs uploads in a two-step process. In the first step, the upload row is created, and then a second HTTP request is made with the upload's ID to actually send us the binary data. In cases where Galera has been used, users would see a consistent set of HTTP 400 errors when pushing images, as the request to add the upload row had not been replicated fully to all instances before the subsequent HTTP request arrived on another Quay node; that node would attempt a lookup of the row, get back no results from the local Galera instance, and fail. While this is a fairly innocuous issue, it is also highly annoying and far from the most concerning issues we've seen when running against Galera, such as garbage collection inconsistencies and loss of real data.
If your goal is to ensure that Quay remains highly available. We recommend running your database (MySQL or Postgres) in a master-slave setup, with automatic replication and failover, as well as automatic backup. We have been running Quay.io on such a setup for nearly 4 years now with no appreciable downtime due to database failure; in the case where the database did fail (happened only once), it simply failed over to the secondary and we were back up in under a minute We are also working to add a feature to Quay Enterprise to allow for active use of read-replicas, which would ensure that if the master database instance is in failover, pulls could continue unabated by falling back to the read replicas until such time as the master returns to service.
Comments
0 comments
Please sign in to leave a comment.