Log
DATE
08/03/2009
03/11/2009
01/12/2009
06/05/2007
03/12/2007
01/26/2007
01/17/2007
01/05/2007
01/05/2007
EVENT
LaCie drive went down, listing empty directory contents. dmesg revealed
scsi 5 : destination target 1, lun 0
command: Test Unit Ready: 00 00 00 00 00 00
ieee1394: sbp2: reset requested
ieee1394: sbp2: Generating sbp2 fetch agent reset
The cause is unknown. Although there
is much discussion online about faulty ieee1394 chips producing this error, the proposed solution,
i.e. insertion of "options sbp2 serialize_io=1 max_speed=2" into the modprobe.conf, has not really
been useful.
disconnected swt118 to reduce the load on one of the power strips that keeps overloading
the power supply on swt101 went down, this node is used to host external hard disk
the cluster crashed, it had to be rebooted; reason unknown
swt106 had to be restarted, it crashed and login was not possible
"(handle_mpd_output 359)" and "(handle_mpd_output 368)" errors were reported again, yesterday, after running
mpdboot. It appears that the solution is to make sure that there are no running remnants of the mpd ring crash,
such as python or mpiexec. If these processes are running, they should be killed, before retrying to run
mpdboot. mpdcleanup alone will not take care of this.
Trouble logging into some slave nodes was reported. Though ping was successful, logging into these nodes
was confirmed to be impossible. Consequently, all slave nodes were rebooted.
"(handle_mpd_output 359): failed to ping mpd on swt[...].internal.uci.edu" was reported. As an attempt to
fix this, /etc/hosts file was checked and corrected -- it contained an entry with nonexistent node. Also,
/etc/mpd.conf on swt100 was made consistent with other nodes, though this is unrelated to the error since
swt100 is not involved in computations.
Node swt101 crashed and had to be rebooted. Later, nodes swt102, and swt103 were rebooted as well.