Differences

This shows you the differences between two versions of the page.

Link to this comparison view

en:mc-production [2016/12/16 10:16] (current)
Line 1: Line 1:
 +Last modified: Apr 08, 2008 by Kurca\\
 +\\
 +
 +====== Neter MC-production ======
 +
 +\\
 +\\
 +
 +====== Chapitre 1 : Introduction ======
 +
 +Login as a user **d0prod** on the **ccd01** .in2p3.fr node!
 +<​code>​
 +neter
 +neter_v7
 +
 +cd D0runjob
 +source config_bash
 +
 +requestRun
 +</​code>​
 +.... gives list of running requests known to the local DB 
 +====== Chapitre 2 : Get a new request ======
 +
 +Check the current MC- requests from the Dzero web page\\
 +http://​www-d0.fnal.gov/​computing/​mcprod/​mcc.html .\\
 +To get it (make a reservation) for production at CCIN2P3 do:
 +<​code>​
 +cd Production
 +Queue.new.py ​ or
 +Queue.new.py --requestid=<​requestid>​
 +Do you want to run this request (answer yes):yes
 +</​code>​
 +Note the number of events to be produced (add 10% more ...) 
 +====== Chapitre 3 : Prepare new request locally ======
 +
 +<​code>​
 +cd ~/req
 +</​code>​
 +Select an existing requestID with a process and number of events close to the new requirement.
 +<​code>​
 +newReq_no_queue.sh <old requestid>​ <​requestid>​
 +</​code>​
 +The new macro and the differences with the <old requestid>​ are shown. You should edit/change correspondigly all macros.\\
 +\\
 +Attention! Required CardfileVersion (e.g. v00-09-33) should be in the $THRONG_DIR/​dist/​packages/​cardfiles\\
 +If it is missing, install it before finishing the procedure newReq_no_queue.sh\\
 +\\
 +Example:
 +<​code>​
 +setup d0cvs
 +cvs co -r v00-09-33 cardfiles
 +Enter passphrase for RSA key '/​afs/​in2p3.fr/​home/​l/​lebrun/​.ssh/​identity':​
 +mv cardfiles ​ v00-09-33 ????? 
 +</​code>​
 +
 +Generally, you have to modify **pythia-gen.macro** (cfg pythia)Don'​t modify :\\
 +cfg pythia define int RunNumber <​runnumber>​\\
 +cfg pythia define int NumRecords <​numrecords>​\\
 +cfg pythia define int UseMaxopt 1\\
 +\\
 +Remove: .... really ?????????\\
 +cfg pythia define int RunNumber 0\\
 +cfg pythia define int UseMaxopt 1\\
 +cfg pythia define string SamDeclareOutput on
 +Modify **d0simreco.macro** :Pay attention to :\\
 +attach d0sim\\
 +cfg d0sim define string MinBiDataset\\
 +/​afs/​in2p3.fr/​throng/​d0/​info_data/​mcp17/​Zerobias_p17MC_avelumi_1_rename2.py
 +Modify local_request_<​requestid>​.pyAsk for about 10% events more than requested in order to have a safety margin\\
 +for crashing jobs.\\
 +"​numevents/​nfirstjobs"​ should be close to the integer number 13000 and they have to be\\
 +a multiple of 250.\\
 +Example for 200000 events:
 +<​code>​
 +recphys={
 +        '​type':'​mcprod',​
 +        '​nfirstjobs':'​14',​
 +        '​numevents':'​210000'​
 +</​code>​
 +(For 150000 events --> 12 and 153000...)\\
 +\\
 +Check:
 +<​code>​
 +adminphysid=[
 +         ​['​lebrun','​no','​ok'​],​
 +         ​['​kurca','​no','​no'​],​
 +         ​['​d0prod','​ok','​ok'​]
 +         ]
 +..... and the options for job type:
 +
 +     ​saminfos={
 +        '​gen':'​locate',​
 +        '​d0gstar':'​locate',​
 +        '​d0sim':'​declare',​
 +        '​thumb':'​declare',​
 +        '​mrg':'​store'​
 +        }
 +
 +</​code>​
 +At the end of this procedure do:
 +<​code>​
 +neter setNumberOfJobs --requestid=<​requestid>​ --jobtype=d0gstar --number=0
 +</​code>​
 +This commad will set all **d0gstar** jobs into //standby// state, i.e. jobs will not run until\\
 +they are freed.
 +
 +=====  Check Random Number Seeds  =====
 +
 +Before enabling running of those jobs you should check if Pythia generator didn't used the same\\
 +random number. In order to do it you should wait until all jobs have finished and the output files are\\
 +declared.\\
 +Then you can do:
 +<​code>​
 +request=<​requestid>​
 +     neter qjob --requestid=$request
 +     ​for ​ jobname in `d0select.py --sep="​ " ​ "​select jobname from 
 +     ​d0_admin.samjobs where requestid in $request and type in 
 +     ​('​gen','​alpgenpythia'​)"​`
 +     do
 +     ​cmd="​grep RanSeed $GROUP_DIR/​prodOracle/​$jobname/​td_gen-*.conf"​
 +     echo $cmd
 +     eval $cmd
 +     done | grep RanSeed | sort
 +</​code>​
 +If, e.g. 2 files were produced with the same random number //​(RanSeed2)//​ ,\\
 +take one of those files and find out the corresponding job name (with the same command as\\
 +above, but without a grep pipe).\\
 +Change the status of the found job in the Oracle DB :
 +<​code>​
 +oraclea
 +<​password>​
 +update jobs set job_status='​error'​ where jobname='<​jobname>';​
 +commit;
 +exit
 +
 +</​code>​
 +<​code>​
 +neter cleanErrors --requestid=<​requestid>​
 +</​code>​
 +in this case you don't need to check again the random number seed used!
 +=====  Release d0gstar jobs  =====
 +
 +i.e. enable that they could run. They will be submitted by a cron job xxxx ???? to the batch system.
 +<​code>​
 +neter setNumberOfJobs --requestid=<​requestid>​ --jobtype=d0gstar --number=1000
 +</​code>​
 +
 +====== Chapitre 4 : Prepare new request with input files (Alpgen) ======
 +
 +It is a request with generated files input prepared in separate step.\\
 +\\
 +Example:\\
 +dataset=pythia_p17.06.02_qcd_emjet_Pt20_30_mcp17_part6_1\\
 +request=34462
 +<​code>​
 +neter createProject --action=alpgen --datasetname=$dataset ​
 +--requestid=$request
 +</​code>​
 +<​code>​
 +nohup getDataset.py --action=alpgen --datasetname=$dataset ​
 +--station=ccin2p3-analysis ​ >> /​tmp/​${LOGNAME}_getDataset.log ​  &
 +</​code>​
 +<​code>​
 +cond="​requestid in (34460,​34461,​34462) and physid=100"​
 +d0select.py "​select requestid,​filename,​status,​filesize,​numev from 
 +d0_admin.samjobs where $cond order by requestid"​
 +d0select.py "​select datasetname,​requestid,​status from d0_admin.projects  ​
 +where $cond order by requestid"​
 +</​code>​
 +<​code>​
 +neter setStatusDataset --status=ready --datasetname=$dataset
 +neter checkFiles
 +</​code>​
 +Voir ensuite : /​afs/​in2p3.fr/​home/​d/​d0prod/​req/​Alpgen/​JES_5 ​
 +====== Chapitre 5 : Cron jobs for automatic submission and checks ======
 +
 +Jobs could be declared and started on the following worker nodes (they have to be known to SAM!):\\
 +**ccwall01, ccwall02, ccwl0035, ccwl0099 - ccwl0107**\\
 +Prefered ones: ccwl0099-ccwl0107
 +=====  Start run job  =====
 +
 +<​code>​
 +submit_sam_survey.sh ccwl0100 ​ (e.g.)
 +</​code>​
 +=====  Start job submission ​ =====
 +
 +In order to submit jobs one needs **d0survey** under ID-account\\
 +with jobs submission addmission
 +<​code>​
 +cd ~/​production
 +submit_survey.sh
 +</​code>​
 +=====  Start store jobs  =====
 +
 +There should run at least 3 of them on **ccd0** !\\
 +Log-files can be found in /tmp directory.
 +<​code>​
 +cd ~/​Production
 +sam_store_survey.sh
 +sam_store_survey.sh
 +...3x
 +</​code>​
 +=====  Start check jobs  =====
 +
 +For different (???? which ones???) tasks **chkFiles** has to be running
 +<​code>​
 +cd ~/​Production
 +submit_checkFiles.sh
 +submit_checkSpace.sh
 +</​code>​
 +=====  Status of Check jobs  =====
 +
 +<​code>​
 +d0select.py "​select to_char(lastupdate,'​DD-MM-YY HH24:​MI:​SS'​),​
 +name,​status,​owner from d0_admin.process"​
 +</​code>​
 +Answer e.g.:\\
 +['​05-07-06 11:​12:​09',​ '​checkSpace',​ '​free',​ '​lebrun'​]\\
 +['​09-01-04 09:​38:​41',​ '​copyToD0mino',​ '​free',​ None]\\
 +['​05-07-06 11:​16:​04',​ '​checkFiles',​ '​free',​ '​lebrun'​]\\
 +['​06-12-05 07:​24:​19',​ '​startReprocess',​ '​free',​ '​kurca'​] ​
 +====  Check Examples ​ ====
 +
 +<​code>​
 +request=<​requestid>​
 +d0select.py "​select requestid,​status,​locationstatus,​type,​count(*) from 
 +d0_admin.samjobs where requestid=$request group by  ​
 +requestid,​status,​locationstatus,​type order by requestid"​
 +d0select.py "​select requestid,​jobtype,​job_status,​count(*) from 
 +d0_admin.jobs where requestid=$request group by  ​
 +requestid,​jobtype,​job_status order by requestid"​
 +</​code>​
 +Or for the list of requests one do e.g.:
 +<​code>​
 +listreq="​requestid in (32175,​32176,​32177,​32178,​32179,​32180,​32181,​32182)"​
 +d0select.py "​select requestid,​status,​locationstatus,​type,​count(*) from 
 +d0_admin.samjobs where $listreq group by  ​
 +requestid,​status,​locationstatus,​type order by requestid"​
 +d0select.py "​select requestid,​jobtype,​job_status,​count(*) from 
 +d0_admin.jobs where $listreq group by  requestid,​jobtype,​job_status ​
 +order by requestid"​
 +</​code>​
 +To check the production (evolution) of the number of produced files:
 +<​code>​
 +d0select.py "​select requestid,​status,​locationstatus,​type,​count(*) from 
 +d0_admin.samjobs where $listreq group by  ​
 +requestid,​status,​locationstatus,​type order by requestid"​ | grep declaring
 +</​code>​
 +
 +===  Check working space  ===
 +
 +with a command
 +<​code>​
 +qspace prodOracle
 +</​code>​
 +you can check the $GROUP_DIR/​prodOracle directory. It is important that it is cleaned up regulary.\\
 +The result should not come close to 35%.\\
 +\\
 +Check that //​chkSpace//​ runned by user //lebrun// is OK.
 +<​code>​
 +qjob -u lebrun -l chkSpace
 +</​code>​
 +You can see the logfile $GROUP_DIR/​prodOracle/​lebrun_checkspace.log.
 +<​code>​
 +cd prodOracle
 +ls -ltr *log
 +</​code>​
 +-rw-r--r-- 1 lebrun d0 3495466 Apr 18 15:31\\
 +lebrun_ccwl0100_sam_survey.log\\
 +-rw-r--r-- 1 d0prod d0 61803965 Apr 18 16:08 d0prod_d0survey.log\\
 +-rw-r--r-- 1 lebrun d0 16977094 Apr 18 16:09 lebrun_d0survey.log\\
 +-rw-r--r-- 1 lebrun d0 232336 Apr 18 16:12\\
 +lebrun_checkspace.log\\
 +-rw-r--r-- 1 lebrun d0 4467215 Apr 18 16:13\\
 +lebrun_ccwl0102_sam_survey.log\\
 +-rw-r--r-- 1 lebrun d0 310138 Apr 18 16:16\\
 +lebrun_checkfiles.log\\
 +-rw-r--r-- 1 d0prod d0 3043136 Apr 18 16:20\\
 +d0prod_ccwl0101_sam_survey.log\\
 +\\
 +\\
 +It happens frequently that files remain on the scratch space.\\
 +They have to be removed!
 +<​code>​
 +cd $GROUP_DIR/​prodOracle
 +for name in `cat arch*`
 +do
 +rm -rf $name
 +done
 +</​code>​
 +
 +====== Chapitre 6 : Finishing the request ======
 +
 +If all jobs of the request are done, do:
 +<​code>​
 +neter cleanRequest --requestid=$request
 +d0select.py "​select requestid,​status,​locationstatus,​type,​count(*) from 
 +d0_admin.samjobs where requestid=$request group by  ​
 +requestid,​status,​locationstatus,​type order by requestid"​
 +</​code>​
 +=====  Merging ​ =====
 +
 +If the number of produced files (events) is sufficient, all jobs are finished\\
 +and thubnail files are declared in SAM.\\
 +Following command creates merging jobs.
 +<​code>​
 +neter formatMerger --requestid=$request
 +</​code>​
 +<​code>​
 +d0select.py "​select requestid,​status,​locationstatus,​type,​count(*) from 
 +d0_admin.samjobs where requestid=$request group by  ​
 +requestid,​status,​locationstatus,​type order by requestid"​
 +d0select.py "​select requestid,​jobtype,​job_status,​count(*) from 
 +d0_admin.jobs where requestid=$request group by  ​
 +requestid,​jobtype,​job_status order by requestid"​
 +</​code>​
 +=====  Missing Events ​ =====
 +
 +If the number of produced files (events) is not sufficient you can again\\
 +enable running of jobs for that request
 +<​code>​
 +neter setNumberOfJobs --jobtype=d0reco --number=1000 --requestid=$request
 +</​code>​
 +...and/or ...
 +<​code>​
 +neter setNumberOfJobs --jobtype=d0gstar --number=1000 --requestid=$request
 +</​code>​
 +
 +====== Chapitre 7 : Troubleshooting ======
 +
 +=====  SAM Errors ​ =====
 +
 +<​code>​
 +neter samCleanErrors [--requestid=$request]
 +</​code>​
 +... manual declaration:​
 +<​code>​
 +neter  declare [--requestid=$request]
 +</​code>​
 +In the case of general problem with SAM and a lot files has an error status,\\
 +do e.g.:
 +<​code>​
 +oraclea
 +<​password>​
 +
 +update samjobs set status='​pending'​ where requestid=<​requestid>​ and 
 +status='​error';​
 +</​code>​
 +
 +====  Problems with d0declare jobs  ====
 +
 +after being stopped or going to be relaunched:
 +<​code>​
 +update samjobs set status='​pending'​ where requestid=<​requestid>​ and 
 +status='​declaring';​
 +
 +
 +d0select.py "​select jobname from d0_admin.jobs where job_status='​error' ​
 +and requestid=$request and jobtype='​d0gstar'"​
 +
 +neter cleanErrors --requestid=$request
 +
 +
 +</​code>​
 +=====  To change the CPU-time for the request ID  =====
 +
 +<​code>​
 +d0select.py "​select requestid,​max(bastacputime),​count(*) from 
 +d0_admin.statjobs where jobname like '​d0g%'​ and  requestid=31816 ​ group 
 +by requestid order by requestid"​
 +</​code>​
 +The following commande will change the max CPU limit used by **d0gstar**\\
 +for the request 31816\\
 +\\
 +A partir de ce temps on modifie la database avec:
 +<​code>​
 +neter changeJobsdescription --jobtype=d0gstar --cputime=600000  ​
 +--requestid=31816
 +</​code>​
 +To modify this value for jobs in the batch queue , e.g.:
 +<​code>​
 +neter qalter --type=d0gstar --requestid=31816 --arg='​-u lebrun -l T=600000'​
 +</​code>​
 +
 +====== Chapitre 8 : Finishing the request - All is done ======
 +
 +If the request is finished (all files are merged and stored at FNAL)
 +<​code>​
 +neter setRequestidDone --requestid=<​requestid>​
 +</​code>​
 +=====  Cleaning and Verification ​ =====
 +
 +
 +====  Remove individual TMB files from HPSS  ====
 +
 +Login as user **sam** on **ccd0**
 +<​code>​
 +nohup neter removeThumbs > /​tmp/​removeThumbs.log &
 +</​code>​
 +<​code>​
 +listreq="​requestid in 
 +(32175,​32177,​32179,​32181,​32182,​32236,​32266,​32185,​32186,​32187,​32188,​32189,​32190,​32191,​3002003,​3003000
 +,​32362)"​
 +d0select.py "​select requestid,​status,​locationstatus,​type,​count(*) from 
 +d0_admin.samjobs where $listreq group by  ​
 +requestid,​status,​locationstatus,​type order by requestid"​
 +d0select.py "​select requestid,​jobtype,​job_status,​count(*) from 
 +d0_admin.jobs where $listreq group by  requestid,​jobtype,​job_status ​
 +order by requestid"​
 +</​code>​
 +<​code>​
 +neter setNumberOfJobs --jobtype=d0gstar --number=1000 ​ --requestid=32362
 +request=32362
 +d0select.py "​select requestid,​status,​locationstatus,​type,​count(*) from 
 +d0_admin.samjobs where requestid=$request group by  ​
 +requestid,​status,​locationstatus,​type order by requestid"​
 +d0select.py "​select requestid,​jobtype,​job_status,​count(*) from 
 +d0_admin.jobs where requestid=$request group by  ​
 +requestid,​jobtype,​job_status order by requestid"​
 +</​code>​
 +<​code>​
 +d0select.py "​select requestid,​max(bastacputime),​count(*) from 
 +d0_admin.statjobs where jobname like '​d0g%'​ and  requestid=32362 ​ group 
 +by requestid order by requestid"​
 +</​code>​
 +<​code>​
 +neter changeJobsdescription --jobtype=d0gstar --requestid=$request ​
 +neter changeJobsdescription --jobtype=d0gstar --cputime=600000  ​
 +--requestid=$request
 +</​code>​
 +
 +====== Annexe A : Some usefull commands ======
 +
 +Tibor,\\
 +\\
 +peux tu suivre les request 34460, 34461 et 34462 attention chacun\\
 +utilise un seul et même fichier pythia.\\
 +\\
 +Je libère les jobs régulièrment pour 34460 et 34462 avec la commande:\\
 +\\
 +\\
 +\\
 +Enable running of 50 d0gstar jobs for request ID 34460:
 +<​code>​
 +neter setNumberOfJobs --jobtype=d0gstar --number=50 --requestid=34460
 +</​code>​
 +Par contre, par erreur tous les jobs de 34461 sont soumis, je les ai mis\\
 +en hold et j'en libère une centaine avec la commande et en incrémentant\\
 +de 1 le dernier chiffre (sous d0prod) :
 +<​code>​
 +for name in `neter qjob --requestid=34461 | grep HOLD | awk '​{print $2}' ​
 +| sort | grep 3381 `
 +do
 +echo $name
 +qrls $name
 +done
 +
 +</​code>​
 +
  
  • en/mc-production.txt
  • Last modified: 2016/12/16 10:16
  • (external edit)