Today I spent some time struggling with an little obscure error in work.
A job with dependencies on others jobs was impossible to queue in Torque (running in a virtual cluster created just for development and testing).
The damn job:
----------------------------------------
#!/bin/bash
#PBS -S /bin/bash
#PBS -N metgrid
#PBS -q batch
#PBS -l walltime=02:00:00
#PBS -l nodes=1:ppn=1
#PBS -o metgrid-oe.log
#PBS -j oe
#PBS -W depend=afterok:325.dadm:326.dadm
#bla bla bla
----------------------------------------
This error was reported in :
----------------------------------------
cat /var/spool/torque/server_logs/20140505
...
05/05/2014 17:55:00;0080;PBS_Server.4427;Req;req_reject;Reject reply code=15041(Job rejected by all possible destinations (check syntax, queue resources, ...)), aux=0, type=Commit, from xxx@dhead
05/05/2014 17:55:29;0010;PBS_Server.4427;Job;327.dadm;Exit_status=0 resources_used.cput=00:00:28 resources_used.mem=43356kb resources_used.vmem=96480kb resources_used.walltime=00:00:50
...
----------------------------------------
The dependency jobs running:
----------------------------------------
qstat -n:
dadm:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
334.dadm user infiniba ungrib 19895 1 1 -- 02:00:00 R 00:00:00
dhead/0
335.dadm user infiniba geogrid 19901 1 1 -- 02:00:00 R 00:00:00
dhead/1
----------------------------------------
Then this surprised me:
dhead:~/swrf/test/testMetgrid$ qstat -f 340.dadm
qstat: Unknown Job Id Error 340.dadm.mydomain.cl
So, I was using an incomplete job id that the machine tried to repair adding the domain but even so it failed again...
user@dhead:~/swrf/test/testMetgrid$ qstat -f 340.dadm.mydomain.cl
qstat: Unknown Job Id Error 340.dadm.mydomain.cl
Well, the domain was missing in the configuration of the server...
This explained a solution:
"check that the first name given in your hosts file for your server exactly matches the name given
in your server_name file in your torque configuration."
... and yes, it was about basic configuration :-P
The changes I made:
root@dadm:/var/spool/torque# cat /etc/hosts
127.0.0.1 localhost
192.168.24.38 dadm.mydomain.cl dadm
root@dadm:/var/spool/torque# cat /var/spool/torque/server_name
dadm.mydomain.cl
Then I restarted the server and after queueing my job all went fine.
----------------------------------------
root@dadm:/var/spool/torque# qstat -n
dadm.mydomain.cl:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
352.dadm.mydomain.cl user infiniba ungrib 20605 1 1 -- 02:00:00 R 00:00:04
dhead/0
353.dadm.mydomain.cl user infiniba geogrid 20619 1 1 -- 02:00:00 R 00:00:04
dhead/1
354.dadm.mydomain.cl user infiniba metgrid -- 1 1 -- 02:00:00 H --
----------------------------------------
Finally, I just need to repair the shellscripts that build and configure the cluster.
Your site so good and...intersting visit naukri bataoRojgar Result
ReplyDeleteApply Here
ReplyDelete