Erlang cluster on Kubernetes: Fatal - Certificate Expired
The nodes in the cluster stopped talking to each other at some point. I only noticed this afternoon after investigating some other problem.
Looking in the logs, I see the following:
% kubectl --namespace erlclu logs erlclu-54b96fdb7c-tdtcl
2023-01-07T14:50:58.126136+00:00 info: Supervisor: {<0.21079.3>,tls_dyn_connection_sup}. Started: id=sender,pid=<0.21080.3>.
2023-01-07T14:50:58.126674+00:00 info: Supervisor: {<0.21079.3>,tls_dyn_connection_sup}. Started: id=receiver,pid=<0.21081.3>.
2023-01-07T14:50:58.228820+00:00 notice: TLS server: In state certify received CLIENT ALERT: Fatal - Certificate Expired
That’s a bit of a surprise; the pods were restarted this morning during an Ubuntu upgrade, so they should have been provisioned with brand-new certificates.
So, I guess the question needs answering: which certificate has expired, and where did I screw that up?
Things to investigate
I’ll draw up an initial list of things to investigate:
- Do the nodes and pods have the correct time?
- What’s the expiry on the root CA?
- What are the expiries on the mTLS certificates?
CA certificate
The easiest thing to check is the CA certificate, so let’s do that:
% kubectl --namespace erlclu get secret erlclu-ca-key-pair -o json | \
jq -r '.data."tls.crt"' | \
base64 -d > erlclu-ca.crt
% openssl x509 -in erlclu-ca.crt -text -noout
Certificate:
...
Validity
Not Before: Dec 2 08:25:09 2022 GMT
Not After : Jan 1 08:25:09 2023 GMT
...
Well; there’s your problem. Apparently the certificate was issued with a 30d expiry and expired 6 days ago.
Node certificate
How to get the validity of the node certificates?
% kubectl --namespace erlclu exec -it deployments/erlclu -- /bin/sh
Defaulted container "erlclu" out of: erlclu, erlclu-init (init)
/ $ cd /certs
/certs $ openssl x509 -in tls-dist.crt -text -noout
Certificate:
...
Validity
Not Before: Jan 7 09:44:27 2023 GMT
Not After : Apr 7 09:44:27 2023 GMT
...
Three months, apparently. That seems to be the default for cert-manager if the request doesn’t specify duration
or renewBefore
.
I can live with that.
Do the pods have the correct time?
I suspect that various K8s weirdness would have occurred if not, but here’s how to check anyway:
% kubectl --namespace erlclu \
get pods -l app=erlclu \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
xargs -I {} \
kubectl --namespace erlclu exec {} -c erlclu -- date
Sat Jan 7 17:49:50 UTC 2023
Sat Jan 7 17:49:51 UTC 2023
Sat Jan 7 17:49:51 UTC 2023
Sat Jan 7 17:49:51 UTC 2023
Sat Jan 7 17:49:51 UTC 2023
Given that – at the time of writing – it’s Sat Jan 7 17:49:51 UTC 2023
, that seems OK.
Immediate Actions
But not until we’ve completed the investigation; this isn’t a business-critical system.
Reissue the CA certificate
At this time, I’m not going to issue a longer-lived CA certificate, because I want it to break. This will remind me to find some time to look at “Further Actions” below.
# Create a new keypair and certificate
openssl ecparam -name prime256v1 -genkey -noout -out erlclu-ca.key
openssl req -new -x509 -key erlclu-ca.key -sha256 \
-subj "/C=GB/L=London/O=differentpla.net/CN=erlclu CA" -out erlclu-ca.crt
# Recreate the K8s secret
kubectl --namespace erlclu delete secret erlclu-ca-key-pair
kubectl --namespace erlclu create secret tls erlclu-ca-key-pair \
--cert=erlclu-ca.crt \
--key=erlclu-ca.key
# Nuke the pods and redeploy everything
kubectl --namespace erlclu delete deployment erlclu
kubectl --namespace erlclu apply -f k8s/deployment.yaml
Further Actions
- Did clustering break 6 days ago and I just didn’t notice?
- Investigate metrics for cluster membership.
- This is done.
- Investigate metrics for cluster membership.
- Or did it keep going until I restarted the pods this morning?
- Investigate and characterise what Erlang distribution does when the CA expires.
- Nodes already in the cluster stay connected. New nodes can’t connect to the cluster.
- Ditto, but for node certificate expiration.
- Investigate and characterise what Erlang distribution does when the CA expires.
- Can we alert from the init container if the CA certificate has expired? Or just fail to start the pod, and then alert on that?
- Can we fail pod startup if our certificate (or the CA) has expired?
- Does that suggest a custom distribution module?
- What are our options for CA expiry? How do we roll the CA certificate without downtime?