Troubleshoot Portworx on Kubernetes
Troubleshoot problems
The following sections provide troubleshooting tips for common problem areas:
Portworx Node is down
ssh
into your cluster node that haskubectl
installed with yourkubeconfig
and check the Kubernetes cluster status usingkubectl
to ensure cluster nodes are in theReady
status:kubectl get node -o wide
If a node is not ready, describe that node to see why and take corrective action:
kubectl describe node <nodename>
If the previous command does not help identify the problem, log in as root and consider running the
journalctl
command on the node in question to identify the problem:journalctl -u kubelet
If the Kubernetes cluster is healthy, check Portworx alerts using
pxctl
from the node, either throughssh
or usingkubectl exec
. Alerts may help you understand why the Portworx node is down:pxctl alerts show
You can also enter the
pxctl status
command to check the status on the respective node where portworx is running:pxctl status
If you find no useful information in the
pxctl status
output, check your Portworx pods to confirm they are up and running:kubectl get pods -n <name-space> -l name=portworx
If necessary, describe the respective Portworx pod to identify the problem:
kubectl describe pods <px-podname> -n <name-space>
If necessary, check the
journalctl
logs from the node in question to further help identify the problem:journalctl -lfu portworx*
Check all Portworx pods running in kube-system or other namespace and confirm they are up and running:
kubectl get pods -n <name-space>
Describe the respective pod running in kube-system or other namespace to identify the problem.
kubectl describe pod <podname> -n <name-space>
Portworx logs reports "Node is not in quorum", kvdb error: "context deadline exceeded"
ssh
into the respective nodes and runpxctl status
on each node to check the Portworx cluter status:pxctl status
- If running internal KVDB check KVDB cluster members and confirm the health status using pxctl:
pxctl service kvdb members
- If quorum has been lost perform the following before contacting technical support:
- Save px-diags on each affected node (captures all logs)
pxctl service diags -a
- Make backups of your config map for px-bootstrap and px-cloud-drive
kubectl get cm -n kube-system | grep px
kubectl get cm <px-bootstrap> -n kube-system -o yaml > px-bootstrapbkp.yaml
kubectl get cm <px-cloud-drive> -n kube-system -o yaml > px-cloud-drivebkp.yaml
- Collect KVDB end points using pxctl:
pxctl service kvdb endpoints
- Contact technical support (see below)
- Save px-diags on each affected node (captures all logs)
- If using external etcd, check your external etcd cluster status.
- Portworx container will fail to come up if it cannot reach etcd. For etcd installation instructions refer this doc.
- The etcd location specified when creating the Portworx cluster needs to be reachable from all nodes.
- For external Etcd run
curl <etcd_location>/version
from each node to ensure reachability. For e.gcurl "http://192.168.33.10:2379/version"
- If you deployed etcd as a Kubernetes service, use the ClusterIP instead of the kube-dns name. Portworx nodes cannot resolve kube-dns entries since Portworx containers are in the host network.
- Portworx container will fail to come up if it cannot reach etcd. For etcd installation instructions refer this doc.
Portworx pxctl cluster summary reports Status "Online", StorageStatus "(StorageDown)" "Full or Offline"
- Identify the node and the storage pool in question by running pxctl (ssh into the respective node) status:
pxctl status
- From the same node, inspect the pool to identify the disk device that makes up the pool:
pxctl service pool show
- Logged in as root, identify why the disk is failing by running dmesg
dmesg | grep error
To correct the problem:
Remove or replace the drive following these instructions: Remove or replace
If the pool is full follow these instructions: Expand your storage pool size
Performance related
- Run Grafana dashboard to identify volumes, pools, nodes, network and other components.
- Refer to the following performance tuning document: Tune Performance
- There are many performance tuning enhancements in the latest release of Portworx. Please see: Portworx release notes
PVC Controller pod failed to start
If you are running Portworx in managed Kubernetes service provider and run into port conflict in the PVC controller, you can overwrite the default PVC Controller ports using the portworx.io/pvc-controller-port
and portworx.io/pvc-controller-secure-port
annotations on the StorageCluster
object:
apiVersion: core.libopenstorage.org/v1
kind: StorageCluster
metadata:
name: portworx
namespace: kube-system
annotations:
portworx.io/pvc-controller-port: "10261"
portworx.io/pvc-controller-secure-port: "10262"
...
Collect Portworx logs
Run the following command on the suspect or affected nodes running Portworx:
pxctl service diags -a
Note:
Include these logs when contacting Portworx support, along with generated diags located in /var/cores/<node-x-x-diags>-<timestamp>.tar.gz
Generate stack traces
Portworx support will occasionally request stack traces to help you troubleshoot. Enter the following command on the troubled node to create a *.stack
file in the /var/cores
directory with the latest timestamp:
pxctl service diags --profile
Contact support
View your options for contacting support by visiting the Portworx support page:
Portworx support