We will want to monitor liberty, to make sure everything is working properly, and have it alert
us if that's not the case.
Internal monitoring
Both the hardware and the software have the capability to check and report on various conditions. We should make use of this, and send mail if things are out-of-range. Things to monitor this way might include:
- Load average
- Free disk space
- Temperature and fan speed
- Hard disk health (SMART)
- RAID status
- Unusual firewall/service activity
External monitoring
We also want some monitoring done externally -- that is, running on computers
other than liberty. For one, to really be sure everything is working, you need to check things from off the system. More importantly, a really serious problem (kernel crash, power loss, network down, etc.) will prevent liberty from notifing us in the first place.
Things we might want to monitor this way:
- ping response
- TCP response (simple "Can I connect?" probes)
- SMTP banner (connect, make sure you get the proper identification)
- HTTP page request (make sure we can fetch some known URL)
- SMTP mail flow (will mail forwarded through the system make it to the other end?)
Monitoring hosts
Anyone doing any external monitoring, please record your IP address and host name, and what you're monitoring:
- 192.0.2.69 - somebox.example.com
- ping
- TCP probes of HTTP, SSH, SMTP
- requesting www.gnhlug.org home page
- Bill's Intermapper on adelphia
- ping
- TCP probes of HTTP, SSH, SMTP, HTTPS, DNS
- Cole's Nagios from 64.34.179.90 and 64.34.182.198 (approx every 15 minutes)
- ping
- TCP probes of HTTP
- (more to come)
- -- ColeTuininga - 01 May 2006
Comments
Panic? ;-) Seriously , I would expect the person to trouble-shoot it if they can, and notify someone who can if they can't. Use the -sysadmin list to keep everyone informed. The one issue I see is problems requiring either physical access or MV assistance; other than calling you, Bruce, our options are limiting there.
--
BenScott - 21 May 2006
What should happen when one of us discovers something wrong?
--
BruceDawson - 02 May 2006