Support Summary

Support Oct 1-7th

  • I pushed the new PCU page and requested that Tech's update their information.  This resulted in a flurry of feedback related to all the bugs I had introduced to the index.php/pcu.php scripts.  A more thorough testing would be nice for this.
  • There were several site_admin problems, related to stale or mis-matched information between the PLC db and the node configuration.  
  • Why can't PI's assign all roles to less privileged users such as 'user' and 'tech'?  This seems like an unnecessary task for Admins to perform any more.
  • An 'operators' mailing list would be nice for this discussion of myplc administration.  This will be especially reelvant as myplc is pushed at more users.
  • Feature request for faster notifications of down nodes.

 

Sept 17-23

 Whitelist

PI at a site wasn't able to view/update node info at their site because some nodes had a whitelist. Had to modify the whitelisting policy so that site members, and slice mememebrs on the whitelist can see whitelisted nodes.

 MyPLC

A user downloaded and installed a myplc with a broken schema. User was advised to download another version. 

Adding Nodes To A Site

When the guardog myplc was recently updated we lost functionality that allowed pi/tech's with multiple sites to choose which site to add a node to. This functionality was added again and checked into cvs.

Sept 3-9th

Registration with PI role.

Old netflow log requests. What's in, what's out?  How do we get this information for requests?

Support Week Sep 10-16

Bogus PI signups were the main problem this week, but users should not be able to sign up as a PI now. Also maybe we don't want to ask people to respond to pl_mom messages as this spawns a lot of support tickets.

Explaining the whitelisting feature of the current API was bothersome.

Mike and Andy were not aware that they were on support duty so David carried the load himself. Possibly someone should write a script to inform those on support each week that they're up.

bootcd and plnode.txt causing additional confusion.

There is clearly confusion being caused by users still thinking that they need separate BootCD and plnode.txt/floppy files, when they are using the Custom, all-in-one BootCD.

I have made some changes to the myplc/db-config that sends messages, and I will make a few other changes to the GUI to provide in-line hints about what to do or download. I think there may be other sources of information causing confusion.

 Also, the bootmanager, should look on the BootCD itself first, and ignore any other node configuration files it finds... Or not?

Support Summary for Aug 20-26th

1. Slice mailing list isn't working: The problem was due to a bug in the aliasing script. It is fixed, and in case of a future problem, take a look at /usr/bin/gen_aliases.py on golf. Reid also mentions this below.

2. pl_mom kills a slice: When we get an apology email from the users, it often proves useful to provide the CoMon's memory consumption page for the slice as reference to help them debug the problem. At least they will stop their experiment on the most problematic nodes.

The CoMon URL is http://comon.cs.princeton.edu/status/tabulator.cgi?table=slices/table_{slice}&sort=6

where you need to replace {slice} with the actual slice name.

3. NodeManager gets stuck at boot[#21930]: I believe Faiyaz is working on this problem, but in the mean time, 1) reboot the node 2) if rebooting doesn't work, reinstall it.

Support Week Review


Problem: Users not recieving emails sent to alias: slice_name@planet-lab.org.

Solution: Bug fix in gen_aliases.py on mail server. Script was only emailing PIs of the slice. Updated script to email users as well. Discussed merging this into the API itself, no decision made. Thoughts?

 

Problem: Users unable to find up to date PCU suggestions for new site.

Solution: Infromed the User of a few different options. Updated PCU documentation is needed. Removal of current out of date suggestions is recomended.

Support Summary for Aug 13-19th

Dominant issues or repeated problems and their causes, if known. Suggestions for solutions would also be nice.

 

  • Privileged operations:
  • Boot manager:
  1. My node doesn't boot with the error 'unable to contact any boot servers' -- This problem has come up often, sometimes because of the high load on the db and web servers, making them fail requests with a 500 internel server error. Faiyaz and Tony have recently resolved this issue. But in general /tmp/bm.log should contain the exact error.
  • General networking:
  1. When you traceroute PL nodes, they don't return the port unreachable message that they're supposed to as the last hop -- I tcpdumped such a traceroute and confirmed that they do emit the message, so the assumption is that they get firewalled off somewhere.
  • Sensors:
  1. I would like to implement a sensor for PL, can I get more details about how CoMon, Netflow etc. implement it? -- The sensor API has been dormant for a long time, and Netflow and CoMon do not actually implement it. Andy is the person to contact about this API.
  • Node manager
  • GUI
  1. I can't upload my public key because the GUI rejects it. -- It turns out that the regex we use to validate keys is quite restrictive and rejects anything not generated by ssh-keyge (eg. Solaris keys). Short answer - use ssh-keygen.
  • Misc
  1. Does PL have a blacklist of IP addresses belonging to administrators who rant? -- No, but Neil Spring @ UMD does.

 

Syndicate content