NodeManager bug

There were some hundred nodes with hung NMs. /var/log/nm<.gz.*> would look like the following:

Sat Sep 15 10:01:50 2007: bwmon:  Found 271 running HTBs
Sat Sep 15 10:01:50 2007: bwmon: Found 1 new slices
Sat Sep 15 10:01:50 2007: bwmon: Found 0 slices that have htbs but not in dat.
Sat Sep 15 10:01:50 2007: bwmon Slice utah_elab_31230 doesn't have xid. Must be delegated. Skipping.
Sat Sep 15 10:01:50 2007: bwmon: Found 1 dead slices
Sat Sep 15 10:01:50 2007: bwmon: removing dead slice 1186
Sat Sep 15 10:01:51 2007: bwmon: now 270 running HTBs
Sat Sep 15 10:02:08 2007: bwmon: Saving 270 slices in /var/lib/misc/bwmon.dat
Sat Sep 15 10:20:19 2007: Traceback (most recent call last):
File "/usr/share/NodeManager/nm.py", line 82, in run
GetSlivers(plc)
File "/usr/share/NodeManager/nm.py", line 36, in GetSlivers
data = plc.GetSlivers()
File "/data/build/tmp/NodeManager-1.5-4.planetlab-root//usr/share/NodeManager/plcapi.py", line 86, in w
rapper
return function(*params)
File "/usr/lib/python2.4/xmlrpclib.py", line 1096, in __call__
return self.__send(self.__name, args)
File "/usr/lib/python2.4/xmlrpclib.py", line 1383, in __request
verbose=self.__verbose
File "/data/build/tmp/NodeManager-1.5-4.planetlab-root//usr/share/NodeManager/safexmlrpc.py", line 21,
in request
raise xmlrpclib.ProtocolError(host + handler, -1, str(e), '')
ProtocolError: <ProtocolError for boot.planet-lab.org:443//PLCAPI//: -1 >

 

Why the XML+RPC would throw an exception is beyond me. The API, AFAIK, never does that unless it's loaded... But the CPU load on these machines is so low it makes no sense to me.

Regardless, adding a try except around GetSlivers in nm.py should keep NM from falling over. I pushed the fix already.