Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

reprocessing_p20 [2016/12/16 10:15] (Version actuelle)
Ligne 1: Ligne 1:
 +Modifié par Kurca, le 17 Feb 2014\\
 +\\
 +
 +====== Reprocessing p20 ======
 +
 +\\
 +\\
 +
 +====== Chapitre 1 : GRAM_JOB_MGR_xyzw.log in Filling /tmp ======
 +
 +GLOBUS_GRAM_MANAGER fills up default /tmp directory with log-files for running jobs.\\
 +Namely many parallel merge+recocert jobs running more days , resp. their logfiles remain\\
 +in /tmp until jobs finished. Indvidual file size growing up to 20 MB and more.\\
 +...\\
 +each 10s written:
 +<​code>​
 +[root@ccd01 /tmp]$ less  gram_job_mgr_19934.log
 +2/19 09:55:09 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
 +2/19 09:55:09 JMI: testing job manager scripts for type samgrid exist and permissions are ok.
 +2/19 09:55:09 JMI: completed script validation: job manager type is samgrid.
 +2/19 09:55:09 JMI: in globus_gram_job_manager_poll()
 +2/19 09:55:09 JMI: local stdout filename = /​samgrid/​gass-cache/​local/​md5/​81/​fcc6a115a3cc833d6a
 +cbcd5df85eb9/​md5/​ae/​a2671ad0369634a4caa825f046b2ee/​data.
 +2/19 09:55:09 JMI: local stderr filename = /​samgrid/​gass-cache/​local/​md5/​81/​fcc6a115a3cc833d6a
 +cbcd5df85eb9/​md5/​44/​e7cb21a323ef548bf92cd7c7796e6b/​data.
 +2/19 09:55:09 JMI: poll: seeking: https://​ccd01.in2p3.fr:​49651/​19934/​1171816701/​
 +2/19 09:55:09 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts)
 +2/19 09:55:09 JMI: cmd = poll
 +2/19 09:55:09 JMI: returning with success
 +Mon Feb 19 09:55:09 2007 JM_SCRIPT: New Perl JobManager created.
 +Mon Feb 19 09:55:09 2007 JM_SCRIPT: polling job 19934.1171816701
 +2/19 09:55:09 JMI: while return_buf = GRAM_SCRIPT_JOB_STATE = 2
 +2/19 09:55:09 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
 +2/19 09:55:19 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
 +2/19 09:55:19 JMI: testing job manager scripts for type samgrid exist and permissions are ok.
 +2/19 09:55:19 JMI: completed script validation: job manager type is samgrid.
 +2/19 09:55:19 JMI: in globus_gram_job_manager_poll()
 +2/19 09:55:19 JMI: local stdout filename = /​samgrid/​gass-cache/​local/​md5/​81/​fcc6a115a3cc833d6a
 +cbcd5df85eb9/​md5/​ae/​a2671ad0369634a4caa825f046b2ee/​data.
 +2/19 09:55:19 JMI: local stderr filename = /​samgrid/​gass-cache/​local/​md5/​81/​fcc6a115a3cc833d6a
 +cbcd5df85eb9/​md5/​44/​e7cb21a323ef548bf92cd7c7796e6b/​data.
 +2/19 09:55:19 JMI: poll: seeking: https://​ccd01.in2p3.fr:​49651/​19934/​1171816701/​
 +2/19 09:55:19 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts)
 +
 +</​code>​
 +setup vdt\\
 +\\
 +echo $GLOBUS_LOCATION\\
 +\\
 +in $GLOBUS_LOCATION/​libexec/​globus-script-initializer\\
 +redefine:\\
 +tmpdir=/​tmp\\
 +e.g. to /​other/​gram_log
 +<​code>​
 +sam@ccd01:​tcsh[216] pwd
 +/​d0products/​products/​prd/​vdt/​v1_1_14_13/​Linux/​globus/​libexec
 +
 +sam@ccd01:​tcsh[217] less globus-script-initializer
 +
 +exec_prefix=${GLOBUS_LOCATION}
 +prefix=$GLOBUS_LOCATION
 +sbindir=${exec_prefix}/​sbin
 +bindir=${exec_prefix}/​bin
 +libdir=${exec_prefix}/​lib
 +libexecdir=${exec_prefix}/​libexec
 +includedir=${exec_prefix}/​include
 +datadir=${prefix}/​share
 +sysconfdir=${prefix}/​etc
 +sharedstatedir=${prefix}/​com
 +localstatedir=${prefix}/​var
 +tmpdir=/tmp
 +local_tmpdir=/​samgrid/​jim/​jim_tmp
 +secure_tmpdir=${local_tmpdir}
 +
 +DELIM=
 +if [ -n "​${LD_LIBRARY_PATH}"​ ]; then
 +    DELIM=:
 +fi
 +LD_LIBRARY_PATH="​${GLOBUS_LOCATION}/​lib${DELIM}${LD_LIBRARY_PATH}"​
 +
 +DELIM=
 +if [ -n "​${SHLIB_PATH}"​ ]; then
 +    DELIM=:
 +fi
 +SHLIB_PATH="​${GLOBUS_LOCATION}/​lib${DELIM}${SHLIB_PATH}"​
 +
 +DELIM=
 +if [ -n "​${SASL_PATH}"​ ]; then
 +    DELIM=:
 +fi
 +SASL_PATH="​${GLOBUS_LOCATION}/​lib/​sasl${DELIM}${SASL_PATH}"​
 +
 +export LD_LIBRARY_PATH LIBPATH SHLIB_PATH SASL_PATH
 +
 +if [ -n "​${LD_LIBRARYN32_PATH}"​ ]; then
 +    DELIM=""​
 +    if [ "​X${LD_LIBRARYN32_PATH}"​ != "​X"​ ]; then
 +        DELIM=:
 +    fi
 +    LD_LIBRARYN32_PATH="​${GLOBUS_LOCATION}/​lib${DELIM}${LD_LIBRARYN32_PATH}"​
 +    export LD_LIBRARYN32_PATH
 +fi
 +
 +if [ -n "​${LD_LIBRARY64_PATH}"​ ]; then
 +    DELIM=""​
 +    if [ "​X${LD_LIBRARY64_PATH}"​ != "​X"​ ]; then
 +        DELIM=:
 +    fi
 +    LD_LIBRARY64_PATH="​${GLOBUS_LOCATION}/​lib${DELIM}${LD_LIBRARY64_PATH}"​
 +    export LD_LIBRARY64_PATH
 +fi
 +
 +globus_source () {
 +
 +  # Check if file exists and source it
 +  if [ ! -f "​$1"​ ] ; then
 +     ​${GLOBUS_SH_PRINTF-printf} "$1 not found.
 +" >&2
 +     exit 1
 +  fi
 +
 +  . "​$1"​
 +}
 +
 +
 +
 +
 +if [ -z "​${LIBPATH}"​ ]; then
 +    LIBPATH="/​usr/​lib:/​lib"​
 +fi
 +LIBPATH="​${GLOBUS_LOCATION}/​lib:​${LIBPATH}"​
 +
 +</​code>​
 +env variable grami_logfile\\
 +...? defined where ????\\
 +$GLOBUS_LOCATION/​etc/​globus-job-manager.conf
 +<​code>​
 +sam@ccd01:​tcsh[210] less globus-job-manager.conf
 +        -home "/​d0products/​products/​prd/​vdt/​v1_1_14_13/​Linux/​globus"​
 +        -globus-gatekeeper-host ccd01.in2p3.fr
 +        -globus-gatekeeper-port 2119
 +        -globus-gatekeeper-subject "​unavailable at time of install"​
 +        -scratch-dir-base /​samgrid/​globus_scratch
 +        -cache-location /​samgrid/​gass-cache
 +        -globus-host-cputype i686
 +        -globus-host-manufacturer pc
 +        -globus-host-osname Linux
 +        -globus-host-osversion 2.4.21-32.0.1.EL.XFSsmp
 +        -save-logfile on_error
 +        -state-file-dir /​d0products/​products/​prd/​vdt/​v1_1_14_13/​Linux/​globus/​tmp/​gram_job_state
 +        -machine-type unknown
 +sam@ccd01:​tcsh[211] pwd
 +/​d0products/​products/​prd/​vdt/​v1_1_14_13/​Linux/​globus/​etc
 +
 +</​code>​
 +
 +====== Chapitre 2 : Minutes from Reprocessing Meetings ======
 +
 +=====  12-Sep-2006 ​ =====
 +
 +<​code>​
 +- Recap:
 +        - Start date envisioned 31st October.
 +        - ~300M events.
 +        - Duration 2-3000 CPUs for 2-3 months .
 +      - Skimming inside SamGrid won't be available.
 +      - Option to write get rid of bad tick data.
 +        - Writing "dummy events"​ would save CPU and keep infrastructure happy.
 +        - Minimum information to be written needs to be defined.
 +          (Brad: will take care of that)
 +      - Inclusion of OSG needs investigation of functionality of
 +        forwarding mechanism. ​ Time estimate on how much work needed needed.
 +        Special meeting on this to be organised.
 +      - The participating sites should be defined asap.
 +
 +Attendees: Daniel, Mike, Brad, GFavin, Frederic, Parag, Laurent, Michel, ​
 +Tom, herbert, Joel
 +
 +
 +
 +</​code>​
 +=====  19-Sep-2006 ​ =====
 +
 +<​code>​
 +- Writing bad ticks as dummy events is accepted.
 +     - OSG forwarding mechanism was working for dataprocessing in v5.
 +       No code development on forwarding mechanism needed.
 +       ​Deployment work needed (to install v7) -> 1FTE week.
 +     - Participating sites: d0farm + 1 big samgrid site + OSG + MSU(~150CPU).
 +
 +Attendees:
 +Daniel, Sergio, Gabriele, Peter, Mike, Parag, Tom, Andrew, Brad, Herb, 
 +Gavin, Michel, joel
 +
 +</​code>​
 +=====  26-Sep-2006 ​ =====
 +
 +<​code>​
 +Amount of data to be reprocessed:​(info from Mike D)
 +currently we have 300M events on tape and with current rate of data
 +taking, we will have about 400M events by the end of October.
 +Each day of delayed reprocessing gives another 3M events.
 +
 +Status of basic tools:
 +d0repro tools - ready (Daniel)
 +samgrid ​      - ready to test, minor upgrades may be needed
 +d0reco ​       - waiting for final version and calorimeter calibration
 +d0db-proxy servers - should work, but tests needed!
 +
 +v7 deployment:
 +    Big SAMGrid site particpating in reprocessing will be CCIN2P3 Lyon.
 +    Completely new SAM station is being installed, then first tests
 +    and certification will be done.
 +
 +Data distribuion/​prestaging:​
 +  - worries about ability to feed ~2000 CPUs with data
 +  - importance of early datasets definition and their prestaging
 +    on the processing sites stressed
 +  - for OSG sites multiple cashes close to CPUs could be a solution
 +     this needs more detailed investigation and a knowledge of CPU
 +     ​resources location (Gabriele)
 +
 +Attendees:
 +Sergio, Peter, Tibor, Todd, Tom, Frederic, Laurent, Brad, Gabriele, ​
 +Daniel, Parag, Gavin, joel
 +</​code>​
 +=====  03-Oct-2006 ​ =====
 +
 +<​code>​
 +Should have no trouble getting adequate bandwidth from
 +         ​Enstore. ​  ​Biggest problem will be bandwidth to remote sites.
 +
 +         At this point don't believe we will need to stage at Lyon for
 +         LCG usage. ​   Expect will have ~500 processors available.
 +
 +         Ruth has been contacted regarding sites. ​  We need to put together
 +         ​note ​ describing needed resources and forward to OSG council.
 +         Tibor is working on this.    Gabrielle is also working on
 +         ​estimates.
 +
 +         ​Starting date has been moved to December 1st.    Plan is to
 +         first do set for JES people. ​   Need ~10% set with same lum
 +         ​profile as rest of data.   Set has not yet been identified.
 +         Must have everything done by sometime in March, preferably the
 +         ​beginning of March.
 +
 +         ​Should proceed with setting up sites with existing versions of
 +         reco.
 +         Jan Stark will be working on cal calibration. ​  This is determining
 +         the December 1st start date.
 +
 +Attendees:
 +Laurent, Mike, Todd, Brad, Gabriele, Parag, Andrew, Adam, Tom, joel
 +
 +</​code>​
 +
  
  • reprocessing_p20.txt
  • Dernière modification: 2016/12/16 10:15
  • (modification externe)