From time to time we run into problems on our cluster with parallel jobs running under PVM. The main scenario is when running GOLD and programs from Openeyes, such as Omega2 and Rocs.
Jobs simply die with PVM errors such
/tmp/pvmtmp0113 permission denied. The causes are beyond us and our IT department to resolve. But we do have a very painful workaround. Basically you have to ssh to every affected node
on the cluster and delete everything starting with pvm that you own. Of course if you have succesfully started a job on that node don’t delete them, just on nodes which fail.
We are also in the painful situation at the moment where we have commissioned a new cluster and no PVM jobs run at all. They do give errors explicitly stating that PVM processes could not be spawned but all attempts to modify the way PVM starts jobs has not worked.