
"Unknown Job Id Error (15001)" in Torque (PBS)

Today I spent some time struggling with an little obscure error in work.

A job with dependencies on others jobs was impossible to queue in Torque (running in a virtual cluster created just for development and testing).

The damn job:


#PBS -S /bin/bash
#PBS -N metgrid
#PBS -q batch
#PBS -l walltime=02:00:00
#PBS -l nodes=1:ppn=1
#PBS -o metgrid-oe.log
#PBS -j oe
#PBS -W depend=afterok:325.dadm:326.dadm

#bla bla bla


This error was reported in :

cat /var/spool/torque/server_logs/20140505
05/05/2014 17:55:00;0080;PBS_Server.4427;Req;req_reject;Reject reply code=15041(Job rejected by all possible destinations (check syntax, queue resources, ...)), aux=0, type=Commit, from xxx@dhead
05/05/2014 17:55:29;0010;PBS_Server.4427;Job;327.dadm;Exit_status=0 resources_used.cput=00:00:28 resources_used.mem=43356kb resources_used.vmem=96480kb resources_used.walltime=00:00:50


The dependency jobs running:

qstat -n:

                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
334.dadm                user        infiniba ungrib            19895     1      1    --   02:00:00 R  00:00:00
335.dadm                user        infiniba geogrid           19901     1      1    --   02:00:00 R  00:00:00


Then this surprised me:

dhead:~/swrf/test/testMetgrid$ qstat -f 340.dadm
qstat: Unknown Job Id Error 340.dadm.mydomain.cl

So, I was using an incomplete job id that the machine tried to repair adding the domain but even so it failed again...

user@dhead:~/swrf/test/testMetgrid$ qstat -f 340.dadm.mydomain.cl
qstat: Unknown Job Id Error 340.dadm.

Well, the domain was missing in the configuration of the server...

This explained a solution:

"check that the first name given in your hosts file for your server exactly matches the name given
in your server_name file in your torque configuration."

... and yes, it was about basic configuration :-P

The changes I made:

root@dadm:/var/spool/torque# cat /etc/hosts    localhost    dadm.mydomain.cl dadm

root@dadm:/var/spool/torque# cat /var/spool/torque/server_name

Then I restarted the server and after queueing my job all went fine.

root@dadm:/var/spool/torque# qstat -n

                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
352.dadm.mydomain.cl     user     infiniba ungrib            20605     1      1    --   02:00:00 R  00:00:04
353.dadm.mydomain.cl     user     infiniba geogrid           20619     1      1    --   02:00:00 R  00:00:04
354.dadm.mydomain.cl     user     infiniba metgrid             --      1      1    --   02:00:00 H       --


Finally, I just need to repair the shellscripts that build and configure the cluster.