We expected our wildcard domain certificate (*.hourfleet.com) to expire later this month, on the 19th August. And we had scheduled to execute the renewal this week. The date was misunderstood by us incorrectly. It was in fact the 9th August. And unfortunately that date lapsed causing the outage last night.
Due to weekend family obligations, we were unable to resolve the issue as quick as we would have liked once we were alerted to the outage at midnight (NZT)
Creation and deployment of the renewed certificate was fairly straightforward. With the internet latency inherent in the process it took us about 60 mins to create and deploy to all services.
However the certificate refused to work correctly between the web services of the back end of Hourfleet’s API services, even though it worked just fine on our other sites like: https://www.hourfleet.com and https://registry.hourfleet.com.
It turned out (after extensive investigation) that the new SSL certificates we obtained from SSL.COM now has a certifying chain of new certification authorities (CA), and the certificates for those authorities were not installed on each of our back end web service servers by default, which led to our web services failing to talk to each other securely, with failed SSL connections. Preventing us from restoring the service.
We learned from investigating Azure documentation (and a plethora of related technical articles) that as well as specifying the new domain certificate in our configuration we also needed to include the new root CA certificates in the configuration. Otherwise consumers of our services would not be able to connect securely, until those certificates were pre-loaded onto each back end server. By adding them to our configuration, they automatically get deployed to back end servers whenever we release a new version of Hourfleet.
Evidently, we had not encountered this issue in the past, and had been getting away without this extra configuration for the last 5+ years of deploying Hourfleet, because Azure servers by default have the previous chain of root certificates that SSL.COM was using already installed on them. And this is why this practice had alluded us for this long.
Configuration has now been updated and new procedures for certificate renewal have been updated to avoid this pitfall in the future.
We have also set up a stream of reminders closer to when the current certificates expire so that we don't befall this fate again and cause another outage in 12 months time when the current certificate is due to expire again.