El blog del Gato: 2014/05

Today I spent some time struggling with an little obscure error in work.

A job with dependencies on others jobs was impossible to queue in Torque (running in a virtual cluster created just for development and testing).

The damn job:

----------------------------------------
#!/bin/bash

#PBS -S /bin/bash
#PBS -N metgrid
#PBS -q batch
#PBS -l walltime=02:00:00
#PBS -l nodes=1:ppn=1
#PBS -o metgrid-oe.log
#PBS -j oe
#PBS -W depend=afterok:325.dadm:326.dadm

#bla bla bla
----------------------------------------

This error was reported in :

----------------------------------------
cat /var/spool/torque/server_logs/20140505
...
05/05/2014 17:55:00;0080;PBS_Server.4427;Req;req_reject;Reject reply code=15041(Job rejected by all possible destinations (check syntax, queue resources, ...)), aux=0, type=Commit, from xxx@dhead
05/05/2014 17:55:29;0010;PBS_Server.4427;Job;327.dadm;Exit_status=0 resources_used.cput=00:00:28 resources_used.mem=43356kb resources_used.vmem=96480kb resources_used.walltime=00:00:50
...
----------------------------------------

The dependency jobs running:

----------------------------------------
qstat -n:

dadm:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
334.dadm                user        infiniba ungrib            19895     1      1    --   02:00:00 R 00:00:00
   dhead/0
335.dadm                user        infiniba geogrid           19901     1      1    --   02:00:00 R 00:00:00
   dhead/1
----------------------------------------

Then this surprised me:

dhead:~/swrf/test/testMetgrid$ qstat -f 340.dadm
qstat: Unknown Job Id Error 340.dadm.mydomain.cl

So, I was using an incomplete job id that the machine tried to repair adding the domain but even so it failed again...

user@dhead:~/swrf/test/testMetgrid$ qstat -f 340.dadm.mydomain.cl
qstat: Unknown Job Id Error 340.dadm.mydomain.cl

Well, the domain was missing in the configuration of the server...

This explained a solution:

"check that the first name given in your hosts file for your server exactly matches the name given
in your server_name file in your torque configuration."

... and yes, it was about basic configuration :-P

The changes I made:

root@dadm:/var/spool/torque# cat /etc/hosts
127.0.0.1    localhost
192.168.24.38    dadm.mydomain.cl dadm

root@dadm:/var/spool/torque# cat /var/spool/torque/server_name
dadm.mydomain.cl

Then I restarted the server and after queueing my job all went fine.

----------------------------------------
root@dadm:/var/spool/torque# qstat -n

dadm.mydomain.cl:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
352.dadm.mydomain.cl     user     infiniba ungrib            20605     1      1    --   02:00:00 R 00:00:04
   dhead/0
353.dadm.mydomain.cl     user     infiniba geogrid           20619     1      1    --   02:00:00 R 00:00:04
   dhead/1
354.dadm.mydomain.cl     user     infiniba metgrid             --      1      1    --   02:00:00 H       --
----------------------------------------

Finally, I just need to repair the shellscripts that build and configure the cluster.

El blog del Gato

2014/05/05

"Unknown Job Id Error (15001)" in Torque (PBS)