Table of Contents
This chapter should give you a lot of hints which you should bear in mind if you design your first PKI. Please don't ignore this section if you are an experienced PKI administrator. We try to list the big traps here. So if you know another major problem then please add it.
This section lists hardware issues which were a problem for some PKIs during it's productional use. This list does not discuss performance issues.
One of the biggest problems of PKI systems is the time. There are two different kinds of computers - online and offline. The usual administrators logic is that a network connected computer can use a timeserver. The question is can you trust this timeserver? A timeserver creates two problems. First is the timestamp really from the timeserver and second is the time source of the timeserver trustworthy? The connection to a timeserver can be secured via tunnel technologies like IPsec but the real problem is the timesource. The most timeservers use finally a radio clock which receives a not signed timesignal from radiostation. This signal can be easily faked because it is very weak. So network timesources really insecure.
After the online computers lost all of their advantages we can discuss offline technologies. Radioclocks are problematical and so we have not to discuss them twice. Also many buildings made from good ferroconcrete has no problems with radiosignals because they implement really nice Faradeay cages. What we have for alternatives? First we need a trustworthy timesource. Simply take a digital watch and compare it's time with several other clocks in the internet, the videotext of your TV, a radioclock, the sun ;-), GPS and any other source you can find. Second you transfer the time from your watch to the computer. Last but not least you have to check the drift and the clock itself on the computer. The drift is a small and easy to handle problem. The clock itself is a much bigger problem. If your computer is always connected to a power outlet then the clock should only drift. Please remember this if you put your CA on a notebook and the notebook into a safe. Several new notebooks have really bad CMOS batteries which result in a wrong time at every reboot. You see time is not trivial.
The most common hardware crashs are cooler and disk failures. You should have of course a backup with all important data - especially with ALL issued certificates. Never lose a certificate or you must revoke the complete CA. Backups are a nice thing but it costs some time to recover from a backup. This results in two facts. First you must have a detailed (time-) plan how to recover from a backup. Second you should be able to tolerate diskcrashes. RAIDs are sometimes expensive but they helps a lot (ask you SAN admins :) ).
Usually you will see if your laptop crashes. A crashing offline computer can be detected by visual monitoring too. A crashing online component of a PKI is problem because important informations are not available. Such informations are new certificates and CRLs. Your services are offline too. This includes SCEP. If a public interface of the security infrastructure is down then you will get problems with your users in the future. So the trust into a running infrastructure is very fast gone. Please note that a simple ping is not enough. You cannot detect a crashed web or OpenCA perl server with a ping. Software bugs can also cause 100 percent load. I know this problem from our web mail programs really well.