Restore and migration

NetFoundry Self-Hosted provides scripts for restoring your Ziti network from a Velero backup. You can also use the backup and restore workflow to migrate an installation to a new cluster.

Restoring from backup

The ./velero/velero_restore.sh script steps through the following:

Checks if AWS credentials are set.
Installs the Velero plugin to the velero namespace in the cluster if not already installed.
Displays the list of available backups for selection.
Restores the resources based on the selection.

Restoring the Ziti controller PVC

In order for Velero to restore the Ziti controller PVC from the backup, it first needs to delete the existing PVC. The restore script will prompt for this option. If n is selected, the restore will skip restoring the PVC but restore all other resources. By default, Velero will skip restoring a resource if it already exists. See the Velero restore reference documentation for more information.

Run the restore script and follow the prompts to select which backup to restore from:

./velero/velero_restore.sh

Restores can also be run manually if you need to use specific Velero flags:

velero restore create --from-backup <BACKUP NAME>

Migrating to a new cluster

Migration uses the same backup and restore workflow to move a NetFoundry Self-Hosted installation from one cluster to another.

Step 1: Back up the existing cluster

Load the AWS credentials into the environment

Install Velero:

K3s

velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.8.0 \
    --bucket <S3_BUCKET_NAME> --features=EnableRestic --default-volumes-to-fs-backup --use-node-agent \
    --backup-location-config region=us-east-1 --snapshot-location-config region=us-east-1 \
    --secret-file <credentials-file>

EKS

velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.8.0 \
    --bucket <S3_BUCKET_NAME> --features=EnableCSI --use-volume-snapshots=true \
    --backup-location-config region=us-east-1 --snapshot-location-config region=us-east-1 \
    --secret-file <credentials-file>

Back up all resources in all namespaces including persistent volumes:
```
velero backup create <backup-name> --include-cluster-resources
```
Destroy the existing cluster

Step 2: Restore to the new cluster

Create a new cluster
Load the AWS credentials into the environment
Install Velero on the new cluster (same commands as above)
Run the restore script and follow the prompts to select the backup to restore:
```
./velero/velero_restore.sh
```

Migration-specific notes

EKS: The DNS addresses for the controller advertise address and the router advertise address must be updated with the new Load Balancer addresses.
K3s: The new cluster should use the same node configuration and default storage class.

Verifying the restore

Check that all deployments have come back online in the following namespaces:

kubectl get deployments -n ziti
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
ziti-controller   1/1     1            1           78m
ziti-router-1     1/1     1            1           78m

kubectl get deployments -n cert-manager
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cert-manager               1/1     1            1           78m
cert-manager-cainjector    1/1     1            1           78m
cert-manager-webhook       1/1     1            1           78m
trust-manager              1/1     1            1           78m

kubectl get deployments -n support
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
grafana            1/1     1            1           5m7s
kibana-kb          1/1     1            1           5m7s
logstash           1/1     1            1           5m7s
rabbitmq           1/1     1            1           5m7s
ziti-edge-tunnel   1/1     1            1           5m6s

Known issues after restore

Different issues can arise depending on which Kubernetes provider is being used.

Common

For most installations, it's necessary to restart the ziti-edge-tunnel deployment in the support namespace since the tunneler will likely come back online before the Ziti controller.
If the DNS address changes for the Ziti controller advertise address or the edge router advertise address, it may take a few minutes for client resources to come back online. For any hosting router or identity, a process restart will accelerate their recovery.

EKS

The Load Balancer addresses will likely change after restoring from backup. The DNS addresses for the controller advertise address and the router advertise address will need to be updated. The ziti-router-1 deployment will not come back online until it can successfully reach the controller over its advertise address. This is normal in a restore scenario.

K3s

The trust-manager deployment in the cert-manager namespace can encounter an issue where it doesn't start back up after a restore. If this error exists in the logs: Error: container has runAsNonRoot and image has non-numeric user (cnb), cannot verify user is non-root, update the deployment to correct the problem.

To fix this error, run:
```
kubectl edit deployment/trust-manager -n cert-manager
```
Add the following line under the securityContext block:
```
        securityContext:
          # add this
          runAsUser: 1000
```
Save the file and restart the deployment:
```
kubectl rollout restart deployment trust-manager -n cert-manager
```
The elasticsearch-es-elastic-nodes statefulset can encounter an issue where it doesn't start back up after a restore, causing Kibana to show "Kibana server is not ready yet." To fix:
```
kubectl rollout restart statefulset elasticsearch-es-elastic-nodes -n support
```

Stalled restore jobs

If the restore appears to have worked but the restore job seems hung and never completes:

kubectl delete restore -n velero <restore name>
# If the above command hangs, it may also be necessary to cancel the above and run:
kubectl rollout restart deployment velero -n velero

Restoring from backup​

Migrating to a new cluster​

Step 1: Back up the existing cluster​

Step 2: Restore to the new cluster​

Verifying the restore​

Known issues after restore​

Common​

EKS​

K3s​

Stalled restore jobs​