-
Take a backup, see Create a backup of etcd
-
To create a backup (a snapshot) of the current status of your cluster, first download the new version of etcdctl from the website:
wget https://github.com/coreos/etcd/releases/download/v3.2.14/etcd-v3.2.14-linux-amd64.tar.gz tar xvf etcd-v3.2.14-linux-amd64.tar.gz
-
Once untarred, the folder will contain the new version of the etcdctl executable. To create a snapshot, run the following command:
ETCDCTL_API=3 ./etcdctl snapshot save snapshot.db
-
This will create a snapshot.db file in the current directory. Do not omit the ETCDCTL_API environment variable, it defines the version of the API etcdctl will use to connect to the ETCD server.
-
Scp off the backup to a secure location
-
-
Remove bad etcd instance from cluster
- Ssh into healthy etcd instance
ETCDCTL_API=3 etcdctl --endpoints "https://127.0.0.1:2379" --cacert /etc/ssl/etcd/ca.crt member remove <bad_member_id>
-
Taint the bad instance
-
in the tectonic-installer directory, following https://coreos.com/tectonic/docs/latest/install/aws/aws-terraform.html (or our internal docs per cluster)
terraform taint --module etcd aws_instance.etcd_node.<bad_etcd_member_instance_number>
-
-
Plan to verify reconstruction of instance
terraform plan platforms/aws
-
Apply after verification
terraform apply platforms/aws
-
Ssh into healthy etcd instance
-
add new node to etcd cluster
ETCDCTL_API=3 etcdctl --endpoints "https://127.0.0.1:2379" --cacert /etc/ssl/etcd/ca.crt member add https://<fqdn_of_new_etcd_member_instance>:2380 --peer-urls='https://<fqdn_of_etcd-0_node>:2379,https://<fqdn_of_etcd-1_node>:2379,https://<fqdn_of_etcd-2_node>:2379'
save the output of this command, as you will need to add it to the etcd-member.service dropin.
-
-
Ssh into new instance
-
Ensure the etcd-member service is stopped
sudo systemctl stop etcd-member.service
-
Edit
/usr/lib/systemd/system/etcd-member.service
sudo vim /usr/lib/systemd/system/etcd-member.service
-
Add the following to the end of the
ExecStart=/usr/lib/coreos/etcd-wrapper \
--initial-cluster-state="existing"
-
Note: be sure to add a
\
to the previous line. Example:[Service] Environment="ETCD_IMAGE=quay.io/coreos/etcd:v3.1.8" Environment="RKT_RUN_ARGS=--volume etcd-ssl,kind=host,source=/etc/ssl/etcd \ --mount volume=etcd-ssl,target=/etc/ssl/etcd" ExecStart= ExecStart=/usr/lib/coreos/etcd-wrapper \ --name=etcd \ --advertise-client-urls=https://server.example:2379 \ --cert-file=/etc/ssl/etcd/server.crt --key-file=/etc/ssl/etcd/server.key --peer-cert-file=/etc/ssl/etcd/peer.crt --peer-key-file=/etc/ssl/etcd/peer.key --peer-trusted-ca-file=/etc/ssl/etcd/ca.crt -peer-client-cert-auth=true \ --initial-advertise-peer-urls=https://server.example:2380 \ --listen-client-urls=https://0.0.0.0:2379 \ --listen-peer-urls=https://0.0.0.0:2380 \ --initial-cluster-state="existing" \
-
-
Reload the systemctl daemon
sudo systemctl daemon-reload
-
Purge /var/lib/etcd on the new node
rm -rf /var/lib/etcd
-
Start the etcd-member service
sudo systemctl start etcd-member.service
-
Watch the logs to verify service comes up happy
journtalctl -u etcd-member.service
-
-
Verify health on another healthy instance
-
Ssh into healthy etcd instance
-
Verify cluster health (new node healthy)
ETCDCTL_API=3 etcdctl --endpoints "https://127.0.0.1:2379"--cacert /etc/ssl/etcd/ca.crt endpoint health
-
Verify Status of all members
ETCDCTL_API=3 ./etcdctl --endpoints "<ALL_etcd_endpoints_comma_separated>" --cacert /etc/ssl/etcd/ca.crt endpoint status -w table
-
Verify health of all members
ETCDCTL_API=3 ./etcdctl --endpoints "<ALL_etcd_endpoints_comma_separated>" --cacert /etc/ssl/etcd/ca.crt endpoint health -w table
-
-
Comments
1 comment
Hi Kyle,
Thanks for the guide, I managed to get it working with some changes though.
Like this:
We probably don't need the trailing slash if it is the last command
I couldn't find the wrapper execution in the /usr/share, I change it here:
I could call the member list only by doing this:
And re-add the member by this
Where n is the node that you want to recover
Please sign in to leave a comment.