Nov3rd2009

Thermal Failure Protection

When you have a server (and lets hope it is a real one and not a ZhongGuanCun job) you should enable thermal protection. Because A/C units do fail as do fans and other servers. Thermal protection will cause your systems to shut down gracefully and prevent damage to them and surrounding devices – like UPS batteries.

The following is a screen shot of a log from one client, who has a fairly large rack and a few servers for their thin client deployment. In this case the A/C failed and the servers shut down gracefully.

thermalfailure

First noting the rising temperature and alerting CANDIS through our monitoring system and then when the failure threshold was met – issuing a shut down command to the OS and then powering off the server. All this happened on a long weekend and the final shut down was alerted to us as well.

Worked like clock work. Then again – I walk into many IDC’s in China and see flashing red failure LEDs and it seems they are forever on. Do SysAdmins in China not setup hardware monitoring? Do they not care? Why is this? Why would you want a failure to occur or interrupt your free time or your clients business?