Large resource backup config parameters
The cluster configuration with large number of Kubernetes resources can be spread across a broad spectrum of resource and system configurations. To make the solution viable to fit a wide range of configurations, users can alter the ConfigMap parameters.
Add the parameters specified in the table below in stork-controller-config
ConfigMap in kube-system
namespace and alter the values as required to suit your configuration:
Parameter | Default Value | Usage |
---|---|---|
large-resource-size-limit | 1 MB | large-resource-size-limit: "819200" , this number sets the size limit to 800 KB |
resource-count-limit | 500 | resource-count-limit: "200" |
restore-volume-backup-count | 25 | restore-volume-backup-count: "22" |
restore-volume-sleep-interval | 20 s | restore-volume-sleep-interval: "1m" or restore-volume-sleep-interval: "53s" |
The behavior of these parameters is explained below:
Large-resource-size-limit: In a cluster, if the etcd‘s message size is configured lesser than the default value of 1.5 MB, then you should alter this parameter's value to adapt to its cluster-wide settings. Users can specify an appropriate value (in bytes) to update the value of this parameter.
Resource-count-limit: If the number of resources overload the Kubernetes API server, then you may see the following error in stork log and eventually the backup operation can time out:
time="2023-04-22T04:22:49Z" level=debug msg="Monitoring storage nodes"
time="2023-04-22T04:23:55Z" level=warning msg="gatherResourceInChunks: failed to list resources"
time="2023-04-22T04:23:55Z" level=error msg="Error getting resources: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=<resource-version>
time="2023-04-22T04:23:55Z" level=error msg="Error backing up resources: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=<resource-version>
time="2023-04-22T04:23:55Z" level=error msg="Error backing up volumes: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=resource-version>To troubleshoot this scenario, you can change the default value of 500 resource queries at a time to a lesser number, say 200 or 300.
Restore-volume-backup-count: This configuration parameter defines the number of volumes that will be restored in a single batch. Whenever the restore process fails with device busy error, then one of the probable errors can be higher batch count of PVCs supplied for the restore process. Hence, the backend storage system fails with device busy error. Here is the sample error message displayed in the user interface window for this scenario:
Restore failed for volume: cloudsnap Restore id:<restore_id> for <backup-name> did not succeed: [createRestoreDestinationVol, Failed to create restore vol err:Volume (Name: <pvc-name>)] create failed error: Volume is busy on Node-not-assigned, processingNode <node-name>]
Alter the default value of this parameter to a value below 25 as a troubleshooting measure.
Restore-volume-sleep-interval: This parameter helps you to increase the time interval between two batches of volumes that will be restored. You can increase the default value to increase the interval between two batches of restore.
Large resource NFS backups and restores
KDMP job pods consume increased amounts of memory for large resource backup and restore operations to NFS backup locations. As a result, you may see out of memory alerts or a failure of the NFS job pods that run on each target cluster. In these scenarios, Portworx by Pure Storage recommends increasing the CPU and memory limit by adding the following parameters to the kdmp-config
ConfigMap which resides in the kube-system
namespace on the target cluster:
Name | Default value | Usage |
---|---|---|
KDMP_NFSEXECUTOR_LIMIT_CPU | 0.5 | KDMP_NFSEXECUTOR_LIMIT_CPU: "1" |
KDMP_NFSEXECUTOR_LIMIT_MEMORNFS | 1.5Gi | KDMP_NFSEXECUTOR_LIMIT_MEMORNFS: "3Gi" |
Note that these values are not displayed in the kdmp-config
ConfigMap by default. When you edit ConfigMap with kubectl command, you can refer usage column in the above table to set the parameters. In this case values of these two parameters are modified to double of the default value of CPU and Memory limit.
For example, consider a cluster with 4 nodes and 50,000 resources composed of ConfigMap and secret resource types. Maximum required memory limit (KDMP_NFSEXECUTOR_LIMIT_MEMORNFS
) to back up and restore data in such an environment is approximately 3Gi. Please note that values provide approximate value for required memory, actual memory may vary depending on your environment and configuration.