Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

errors_with_samgrid_lcg [2016/12/16 10:15] (Version actuelle)
Ligne 1: Ligne 1:
 +Modifié par Kurca, le 17 Feb 2014\\
 +\\
 +
 +====== Errors with SAMGrid/LCG ======
 +
 +\\
 +\\
 +
 +====== Chapitre 1 : SAMGrid-LCG tests ======
 +
 +=====  Authentication test  =====
 +
 +<​code>​
 +ccsvli02:​tcsh[76] globusrun -a -r obsauvergridce01.univ-bpclermont.fr
 +GRAM Authentication test successful
 +
 +ccsvli02:​tcsh[77] globusrun -a -r cirigridce01.univ-bpclermont.fr
 +GRAM Authentication test successful
 +
 +ccsvli02:​tcsh[78] globusrun -a -r clrlcgce01.in2p3.fr
 +GRAM Authentication test successful
 +
 +</​code>​
 +Attention: the requested service is "​jobmanager"​ !
 +<​code>​
 +[kurca@ccali01 ~]$ globusrun -a -r lyogrid02.in2p3.fr:​2119/​jobmanager-lcgpbs
 +GRAM Authentication test failure: the gatekeeper failed to find the requested service
 +
 +
 +
 +ccsvli02:​tcsh[88] ​ globusrun -a -r lyogrid02.in2p3.fr:​2119/​jobmanager
 +GRAM Authentication test successful
 +
 +
 +
 +</​code>​
 +=====  Job submission ​ =====
 +
 +<​code>​
 +ccsvli02:​tcsh[79] globus-job-run clrlcgce03.in2p3.fr:​2119/​jobmanager -x "​(queue=lcgpbs-dzero)(jobtype=single)"​ /​bin/​hostname
 +clrlcgce03.in2p3.fr
 +
 +ccsvli02:​tcsh[73] globus-job-run cirigridce01.univ-bpclermont.fr:​2119/​jobmanager -x "​(queue=lcgpbs-dzero)(jobtype=single)"​ /​bin/​hostname
 +GRAM Job submission failed because the job manager failed to open stderr (error code 74)
 +
 +globus-job-run cirigridce01.univ-bpclermont.fr:​2119/​jobmanager-lcgpbs -q dzero /​bin/​hostname
 +GRAM Job submission failed because the job manager failed to open stderr (error code 74)
 +
 +[kurca@ccali07 ~]$ globus-job-run cirigridce01.univ-bpclermont.fr:​2119/​jobmanager-lcgpbs -q dzero /​bin/​hostname
 +GRAM Job submission failed because the provided RSL '​queue'​ parameter is invalid (error code 37)
 +
 +[kurca@ccsvli02 ~]$ globus-job-run clrlcgce01.in2p3.fr:​2119/​jobmanager-lcgpbs -q lhcb /​bin/​hostname
 +[kurca@ccsvli02 ~]$ globus-job-run clrlcgce01.in2p3.fr:​2119/​jobmanager -q lhcb /​bin/​hostname
 +clrlcgce01.in2p3.fr
 +
 +     "​jobmanager" ​ and not "​jobmanager-lcgpbs"​ !!!!!
 +
 +[kurca@ccali07 ~]$ 
 +
 +</​code>​
 +<​code>​
 +ccsvli02:​tcsh[125] globus-job-run obsauvergridce01.univ-bpclermont.fr:​2119/​jobmanager -x "​(queue=dzero)(jobtype=single)"​ /​bin/​hostname ; date
 +obsauvergridce01.univ-bpclermont.fr
 +Wed May 11 15:44:55 CEST 2011
 +ccsvli02:​tcsh[126] globus-job-run obsauvergridce01.univ-bpclermont.fr:​2119/​jobmanager -x "​(queue=dzero)(jobtype=single)"​ hostname ; date
 +GRAM Job failed because the executable does not exist (error code 5)
 +Wed May 11 15:47:27 CEST 2011
 +
 +</​code>​
 +
 +====== Chapitre 2 : Missing Input Data for Job ======
 +
 +Error message: done(User executable exited with code 1)\\
 +On grid-lcg.physik.uni-wuppertal.de\\
 +in /​local/​jim/​jim_sandbox/​box6964.1158252356\\
 +\\
 +less std_err_-0TLjAsidEQ4v3H0Mqutkg
 +<​code>​
 +.....
 +Std err:
 +Thread-1: 09/14/06 19:54:04: ERROR: jim_gridftp:​ Thu 09/14/06 19:54:04: INFO  : User Subject: /​DC=org/​DC=doegrids/​
 +OU=People/​CN=Tibor Kurca 761753/​CN=proxy/​CN=limited proxy
 +jim_gridftp:​ Thu 09/14/06 19:54:04: INFO  : Set server type to default: data_server
 +jim_gridftp:​ Thu 09/14/06 19:54:04: INFO  : Will read server subject for host ccsvli02.in2p3.fr from configuration
 +Command is invalid
 +jim_gridftp:​ Thu 09/14/06 19:54:04: WARN  : Guessing server subject as: /​DC=org/​DC=doegrids/​OU=Services/​CN=sam/​ccs
 +vli02.in2p3.fr
 +jim_gridftp:​ Thu 09/14/06 19:54:04: INFO  : Will read port number for host ccsvli02.in2p3.fr from configuration
 +Command is invalid
 +jim_gridftp:​ Thu 09/14/06 19:54:04: INFO  : Set port number to default: 4568
 +error: a system call failed (Connection refused)
 +jim_gridftp:​ Thu 09/14/06 19:54:04: ERROR : Configuration Read:
 +jim_gridftp:​ Thu 09/14/06 19:54:04: ERROR :     ​Port ​          = 4568
 +jim_gridftp:​ Thu 09/14/06 19:54:04: ERROR :      Remote Machine = ccsvli02.in2p3.fr
 +jim_gridftp:​ Thu 09/14/06 19:54:04: ERROR :     ​Server Subject = /​DC=org/​DC=doegrids/​OU=Services/​CN=sam/​ccsvli02.i
 +n2p3.fr
 +jim_gridftp:​ Thu 09/14/06 19:54:04: ERROR :     User Subject ​  = /​DC=org/​DC=doegrids/​OU=People/​CN=Tibor Kurca 7617
 +53/​CN=proxy/​CN=limited proxy
 +jim_gridftp:​ Thu 09/14/06 19:54:04: ERROR : jim_gridftp product Configuration is
 +
 +<?xml version="​1.0"?>​
 +<​jim_gridftp_configuration>​
 +  <​interview_schema version="​2_0"​ />
 +  <host name="​grid-lcg.physik.uni-wuppertal.de">​
 +    <​data_server>​
 +      <port number="​4568"​ />
 +      <​certificate subject="/​DC=org/​DC=doegrids/​OU=Services/​CN=sam/​grid-lcg.physik.uni-wuppertal.de"​ />
 +    </​data_server>​
 +    <​head_server>​
 +      <port number="​4569"​ />
 +      <​certificate subject="/​DC=org/​DC=doegrids/​OU=Services/​CN=sam/​grid-lcg.physik.uni-wuppertal.de"​ />
 +    </​head_server>​
 +  </​host>​
 +</​jim_gridftp_configuration>​
 +.....
 +
 +</​code>​
 +.... jim_gridftp daemon was not running on ccsvli02.in2p3.fr !!!!!!!!!!\\
 +ups run server_run ​
 +====== Chapitre 3 : Error message: done(User executable exited with code 1 ======
 +
 +Symptom: input files from SAM are not delivered to jobs running on LCG clusters\\
 +- in log file of sam station ccsvli02:\\
 +> ...\\
 +> 01/25/08 23:06:35 ccin2p3-grid1.SM.FileImageMan 30735:\\
 +> 01/25/08 23:06:35 ccin2p3-grid1.SM.FileImageState 30735: 219 224\\
 +> 01/25/08 23:06:35 ccin2p3-grid1.SM.FileImageState 30735: 2 retries left\\
 +> 01/25/08 23:06:35 ccin2p3-grid1.SM.FileManager 30735: 74261476 incRefCount\\
 +> 14\\
 +> 01/25/08 23:06:35 ccin2p3-grid1.SM.FileImageMan 30735:\\
 +> FileImageMan::​transactionComplete with: cIdLMVKiyuGYh8fb22OwxPIbm3Bu9NEE,​\\
 +> failure (failure) code 219 descr - java.lang.RuntimeException:​\\
 +> java.net.NoRouteToHostException:​ No route to host FILE 26026466\\
 +> d0runjob_v07-08-01.tar.gz,​ size 584378B cached_time=29 Dec 16:21:48 at\\
 +> srm:​%%//​%%ac-uk-diskonly-ccin2p3-grid1FileImageMan::​disownTransaction id\\
 +> cIdLMVKiyuGYh8fb22OwxPIbm3Bu9NEE\\
 +> 01/25/08 23:06:35 ccin2p3-grid1.SM.FileManager 30735: 74261476 decRefCount\\
 +> 15\\
 +> 01/25/08 23:06:35 ccin2p3-grid1.SM.FileImageMan 30735:\\
 +> FileImageMan::​scheduleGetRequest 0x9236288 in 10 seconds\\
 +......
 +<​code>​
 +dCache SRM endpoint at gfe02.hep.ph.ic.ac.uk 8443 is not reachable
 +</​code>​
 +Trying 155.198.216.173...\\
 +telnet: Unable to connect to remote host: No route to host\\
 +\\
 +It's in the IC dCache domain . The course of action in this case is\\
 +either ask Kostas for help or submit LCG GOC ticket to (I believe)\\
 +http:​%%//​%%www.ukiroc.eu/​ukiroc/​content/​view/​21/​157/​\\
 +I did the former. ​
 +====== Chapitre 4 : Globus Errors ======
 +
 +=====  Error 158  =====
 +
 +hi all,\\
 +\\
 +i have a problem with some jobs and i can't find the cause.\\
 +\\
 +the jobs run and do their work (ie write some output file to the storage,\\
 +indicating correct execution) BUT the job status looks strange\\
 +(full output below).\\
 +at the end no job output can be retrieved and if i look on the RB (LCG part of\\
 +glite3), no input or output sandbox seem to exist (anymore).
 +<​code>​
 +- reason ​                 =    Job successfully submitted to Globus
 +...........
 +- reason ​                 =    Got a job held event, reason: Globus error 158: 
 +the job manager could not lock the state lock file
 +......
 +- reason ​                 =    Job got an error while in the CondorG queue.
 +</​code>​
 +That typically happens if the user's DN got mapped differently between\\
 +the time the job was submitted and the time it was cleaned up:\\
 +\\
 +1. The DN may have got added to or removed from the list of "​sgm"​ or "​prd"​\\
 +users\\
 +(pool account <--> sgm/prd account)\\
 +\\
 +2. The user used _different_ proxies for jobs sent to the same RB:\\
 +grid vs. VOMS proxies, or VOMS proxies with vs. without a role, etc.\\
 +The RB cannot handle that. Always use the same proxy, otherwise\\
 +any unfinished jobs may be lost (allow at least for 1 hour for any\\
 +active grid_monitor processes on the CEs to die out).\\
 +\\
 +Stijn De Weirdt wrote:\\
 +\\
 +> hi maarten,\\
 +>\\
 +> i saw that error, but i thought the fact that it submitted the job a second\\
 +time was the cause of this. (i still don't understand why it tried to do that.)\\
 +\\
 +When a job fails (from the RB/WMS perspective),​ it is always sent back\\
 +to the Workload Manager daemon for resubmission. In this case the WM\\
 +found the max. number of resubmissions to be zero, so the job aborted.\\
 +\\
 +> thanks anyway, (i'll check the proxies with the user)\\
 +\\
 +You can also look on gridce.iihe.ac.be in the "​gram_*.log"​ files in the\\
 +home directory of the account the DN got mapped to: they may show why\\
 +the state lock file could not be locked (e.g. "​Permission denied"​).\\
 +To find the pool account(s) in this case:\\
 +\\
 +--------------------------------------------------------------------------------\\
 +for i in `ls -li /​etc/​grid-security/​gridmapdir/​ | awk '/​heyninck/​ { print $1\\
 +}'`\\
 +do\\
 +ls -li /​etc/​grid-security/​gridmapdir/​ | awk "$1 == $i { print $NF }"\\
 +done\\
 +--------------------------------------------------------------------------------\\
 +\\
 +Note: every different set of VOMS attributes gets its own pool account!\\
 +\\
 +I found these entries in your gridmapdir indeed:\\
 +\\
 +--------------------------------------------------------------------------------\\
 +755331 -rw-r--r-- 2 root root 0 Nov 17 12:12\\
 +%2fc%3dbe%2fo%3dbegrid%2fou%3dvub%2fou%3ddntk%2fcn%3dheyninck\\
 +755341 -rw-r--r-- 2 root root 0 Nov 17 22:56\\
 +%2fc%3dbe%2fo%3dbegrid%2fou%3dvub%2fou%3ddntk%2fcn%3dheyninck%3abecms\\
 +755361 -rw-r--r-- 2 root root 0 Nov 17 11:09\\
 +%2fc%3dbe%2fo%3dbegrid%2fou%3dvub%2fou%3ddntk%2fcn%3dheyninck%3acms\\
 +--------------------------------------------------------------------------------\\
 +\\
 +The user used a grid proxy and two different VOMS proxies on the same day.\\
 +For the RB they are all equivalent, but not for the CE...\\
 +\\
 +> hi maarten,\\
 +>\\
 +> but this is the whole problem: the job didn't fail (at first)! the first\\
 +> attempt was succesful. status "​Done"​ with exitcode 0.\\
 +\\
 +The exit code is irrelevant from the point of view of the RB/WMS:\\
 +the job could not be successfully monitored or cleaned up, so it\\
 +was considered as having failed.\\
 +\\
 +Note that the exit code was logged by the WN, not the Log Monitor\\
 +daemon on the RB. The LM browses the Condor-G logfiles to discover\\
 +and log the current state of each job. In this case it did find\\
 +that the job became Running (that event is logged twice), but it\\
 +never saw that the job was Done, because the grid_monitor process\\
 +on the CE suddenly could no longer lock its own state file.\\
 +\\
 +It is true that the LM could still have reported a success in this\\
 +particular case, by looking at the events logged by the WN, but it\\
 +was decided to completely ignore such events because they confused\\
 +the job state machine code (and the user).\\
 +\\
 +> only later it was retried. and the failed and then resubmitted and\\
 +> aborted. (i still don't know why ;)\\
 +\\
 +Now you do. 
 +====== Chapitre 5 : Cannot transfer input sandbox file ======
 +
 +
 +====  samgrid: grid2 - ccd01  ====
 +
 +<​code>​
 +Solution: I would start checking if the shell run by user sam is listed in /etc/shells
 +  ..... added last 3 lines with /​usr/​local/​bin/​bash etc
 + 
 +</​code>​
 +ccd01:​tcsh[202] less shells\\
 +/bin/sh\\
 +/bin/bash\\
 +/​sbin/​nologin\\
 +/​bin/​bash2\\
 +/bin/ash\\
 +/bin/bsh\\
 +/bin/ksh\\
 +/bin/tcsh\\
 +/bin/csh\\
 +/bin/zsh\\
 +/​usr/​local/​bin/​bash\\
 +/​usr/​local/​bin/​tcsh\\
 +/​usr/​local/​bin/​zsh
 +<​code>​
 +> Job is submitted, it starts on the worker nodes but job is not
 +> able to get input sandbox file;
 +> See error message:
 +> sam@ccd01:​tcsh[286] less std_err1
 +> error: the server sent an error response: 530 530 Login incorrect.
 +>
 +> error: the server sent an error response: 530 530 Login incorrect.
 +>
 +> error: the server sent an error response: 530 530 Login incorrect.
 +>
 +> transfer_sandbox.sh:​ Fri 11/10/06 14:44:22: WARNING: Cannot download user
 +> input sandbox... will try to proceed
 +>
 +> gzip: stdin: unexpected end of file
 +> tar: Child returned status 1
 +> tar: Error exit delayed from previous errors
 +> error: the server sent an error response: 530 530 Login incorrect.
 +>
 +> error: the server sent an error response: 530 530 Login incorrect.
 +>
 +> error: the server sent an error response: 530 530 Login incorrect.
 +>
 +> transfer_sandbox.sh:​ Fri 11/10/06 14:48:16: ERROR: Cannot transfer input
 +> sandbox file.
 +> tar: sam.tgz: Cannot open: No such file or directory
 +> tar: Error is not recoverable:​ exiting now
 +> tar: Child returned status 2
 +> tar: Error exit delayed from previous errors
 +> sandbox_mgr:​ Fri 11/10/06 14:48:17: ERROR: Can't unpack SAM
 +> Command exited with non-zero status 1
 +> 0.63user 0.39system 6:​38.54elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
 +> 0inputs+0outputs (12172major+5951minor)pagefaults 0swaps
 +>
 +
 +</​code>​
 +
 +====  samgrid/​OSG:​ d0mino01 vs samgfarm ​ ====
 +
 +<​code>​
 +Reason: wrong user mapping in the jim_gridftp grid-mapfile. In other words user
 +was not in the jim_gridftp grid-mapfile ​
 +</​code>​
 +or ???? user was not in the SAM "​test"​ group and was not allowed to create test files ....????
 +<​code>​
 +ccali21:​tcsh[225] less std_err4.1740.0
 +error: the server sent an error response: 530 530 Can't set uid.
 +
 +error: the server sent an error response: 530 530 Can't set uid.
 +
 +error: the server sent an error response: 530 530 Can't set uid.
 +
 +transfer_sandbox.sh:​ Tue 12/12/06 19:21:37: WARNING: Cannot download user input sandbox... will tr
 +y to proceed
 +
 +gzip: stdin: unexpected end of file
 +tar: Child returned status 1
 +tar: Error exit delayed from previous errors
 +error: the server sent an error response: 530 530 Can't set uid.
 +
 +error: the server sent an error response: 530 530 Can't set uid.
 +
 +error: the server sent an error response: 530 530 Can't set uid.
 +
 +transfer_sandbox.sh:​ Tue 12/12/06 19:27:05: ERROR: Cannot transfer input sandbox file.
 +tar (child): sam.tgz: Cannot open: No such file or directory
 +tar (child): Error is not recoverable:​ exiting now
 +tar: Child returned status 2
 +tar: Error exit delayed from previous errors
 +sandbox_mgr:​ Tue 12/12/06 19:27:06: ERROR: Can't unpack SAM
 +Command exited with non-zero status 1
 +0.33user 0.25system 18:​01.15elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
 +0inputs+0outputs (12381major+5587minor)pagefaults 0swaps
 +
 +
 +</​code>​
 +
 +====  Error 550  ====
 +
 +<​code>​
 +Ticket: #2463: error response 550 : not a plain file, Topic: D0, Status: ​
 +resolved, Importance: Medium, Classification:​ User Problem
 +Ticket URL: https://​plone3.fnal.gov/​SAMGrid/​tracking/​2463/​
 +
 +12/11/06 18:54:31 ccin2p3-grid2.EWORKER 23051: Worker has been invoked to 
 +transfer all_1_0000224347_077.raw ​
 +refix-router:​d0rsam01.fnal.gov:/​sam/​cache2/​refix-router/​boo->​ccin2p3-grid2:​ccd01.in2p3.fr:/​samgrid/​boo ,
 +number of bytes: 222969308
 +12/11/06 18:54:31 ccin2p3-grid2.EWORKER 23051: Subprocess timeout is set to 3087 seconds.
 +12/11/06 18:54:31 ccin2p3-grid2.EWORKER 23051: Executing: ​ samcp 
 +'​refix-router:​d0rsam01.fnal.gov:/​sam/​cache2
 +/​refix-router/​boo/​all_1_0000224347_077.raw' ​
 +'​ccin2p3-grid2:​ccd01.in2p3.fr:/​samgrid/​boo'​
 +12/11/06 18:54:35 ccin2p3-grid2.EWORKER 23051: Executed: ​ samcp 
 +'​refix-router:​d0rsam01.fnal.gov:/​sam/​cache2/​
 +refix-router/​boo/​all_1_0000224347_077.raw' ​
 +'​ccin2p3-grid2:​ccd01.in2p3.fr:/​samgrid/​boo'​
 +Subprocess Status: 0  ​
 +
 +Exit Status: 256  ​
 +
 +Standard Output:
 +Using gridftp to transfer file.
 +gridftp: Local server subject is: 
 +/​DC=org/​DC=doegrids/​OU=Services/​CN=sam/​ccd01.in2p3.fr
 +gridftp: Resolved remote sam server subject as: 
 +/​DC=org/​DC=doegrids/​OU=Services/​CN=sam/​d0rsam01.fnal.gov
 +gridftp: ​
 +/​d0products/​products/​prd/​vdt/​v1_1_14_13/​Linux/​globus/​bin/​globus-url-copy ​
 +-no-third-party-transfers ​
 +-parallel 10 -tcp-buffer-size 4194304 -block-size 1048576 -s 
 +"/​DC=org/​DC=doegrids/​OU=Services/​CN=sam/​d0rsam01.fnal.gov"​ -nodcau ​
 +gsiftp://​d0rsam01.fnal.gov:​4567/​sam/​cache2/​refix-router/​boo/​all_1_0000224347_077.raw ​
 +file://​localhost/​samgrid/​boo/​all_1_0000224347_077.raw
 +  ​
 +
 +Standard Error:
 +error: the server sent an error response: 550 550 
 +/​sam/​cache2/​refix-router/​boo/​all_1_0000224347_077.raw:​ nota plain file.
 +
 +
 +
 +12/11/06 18:54:35 ccin2p3-grid2.EWORKER 23051: File transfer exit status: 256
 +12/11/06 18:54:35 ccin2p3-grid2.EWORKER 23051: Will be looking for:​kinit: ​
 +samcp_destination_failure ​
 +12/11/06 18:54:35 ccin2p3-grid2.EWORKER 23051: Executing unlink on 
 +all_1_0000224347_077.raw in ccin2p3-grid2
 +:​ccd01.in2p3.fr:/​samgrid/​boo
 +12/11/06 18:54:35 ccin2p3-grid2.EWORKER 23051: ^GWarning: Removed file 
 +/​samgrid/​boo/​all_1_0000224347_077.raw
 +12/11/06 18:54:35 ccin2p3-grid2.EWORKER 23051: Worker reported, returning with status 0
 +12/11/06 18:54:35 ccin2p3-grid2.Stager@ccd01.in2p3.fr:​Stager 23125: No rule matched
 +
 +</​code>​
 +---------------------------------------------------------------------------\\
 +Comment:\\
 +gridftp says "not a plain file" if the file doesn'​t exist. This can happen\\
 +if the file expires from the router cache between your station deciding it\\
 +will get it from there, and it actually starting the file transfer. If that\\
 +happens it ought to trigger a new route request to reobtain the file. So\\
 +seeing this error doesn'​t necessarily mean a problem, but if it keeps\\
 +happening and you never receive the files you are trying to get, then there\\
 +could be one.\\
 +Untrusted issue submission: no verification possible ​
 +====== Chapitre 6 : prd stager died at ccin2p3-grid2:​ status 134 ======
 +
 +<​code>​
 +Solution: this problem was caused by reaching a limit on
 +    the number of IPC semaphores. I'm not sure what caused it to happen here:
 +    none of the other machines running stations and stagers that I've looked at 
 +    seem to be leaking semaphores, but if it happens again,
 +    you can list the current ​ semaphores with 
 +    'ipcs -s', and remove them with 'ipcrm sem <​semid>'​
 +</​code>​
 +<​code>​
 +Change: description:​ ""​ -> "​stager at station ccin2p3-grid2 is dying with the error message sent by mail:
 +> > Subject: prd stager died at ccin2p3-grid2:​ status 134
 +> >    ​
 +> > New mail will not be send if the server crashes again
 +> > within the same hour.
 +> > Sam Log Server is invalid or busted: Timed out waiting for ACK from server^G:
 +> > 11/02/06 19:11:08 ccin2p3-grid2.Stager@ccd01.in2p3.fr:​Stager 2028: In constru ​
 +ctor
 +> > Stager constructor:​ semget: No space left on device
 +> > Non-compliant application error detected:
 +> > Object was deleted without an object reference count of zero
 +> > stagerng: ../​../​include/​OB/​Basic.h:​794:​ virtual OBRefCount::​~OBRefCount():​ As 
 +sertion `(int)(ref_ == 0)' failed.
 +
 +</​code>​
 +
  
  • errors_with_samgrid_lcg.txt
  • Dernière modification: 2016/12/16 10:15
  • (modification externe)