There is no "Disaster Recovery" document per se, since Quay Enterprise container itself is stateless all disaster recovery is punted to the persistence mechanisms, i.e. the DB and storage. Just about the worst thing that could happen is the DB getting out of sync with storage, That's basically the only thing that's not on your list. During normal operation, we prevent it the best we can, but if you ran with the wrong config that pointed to a different bucket for example, it could happen and it would be an actual disaster.
/health/endtoend endpoint on the Quay Enterprise hostname allows you to check the status of services, database, redis, registry_gunicorn, storage, status_code. The status code is important. Most likely, error 500 indicates storage problems, error 503 - network issues such as wrong SSL termination, 504 - gunicorn can't get a worker: connection problem, no free workers available or blocked worker or if a request to the database from any downstream service takes longer due to a spike in network latency.
Comments
0 comments
Please sign in to leave a comment.