Workshop on Swiss Tier-2 (11.6.07) ==================================== PK, DF, SH, SG, TG, AU, SM, CG, ZC. P.Kunszt, D.Feichtinger, Sigve Haug, Szymon Gadomski, Tom Guptill, Alessandro Usai, Sergio Maffioletto, C.Grab, Zhiling Cheng See http://twiki.cscs.ch/bin/view/LCGTier2/PhoenixMeeting20070611 I. Reports from T2 and T3 ************************* 1) Tier-2 =========== * Phoenix procurement: legal office ok; cooling missing (700 kCHF), needs another round of ETH-board meeting. - 20.6. * dCache: still a few small problems. * integration of old phoenix and SUN-nodes: cfengine handles them all. * one thumper crash happened: patch upgrades done now. Tom: CSCS sysadmin-team; Alessandro: dCache and services; Sergio: helping to bridge Alessandro Peter: management and help ==> 2 FTE for phoenix exclusively. [ Interviews on 18.6. DPM: less scalable, but may be easier for Tier-3 dCache: scalable, but heavier on management. ............................................................. 2) Tier-3: ========== Geneva: -------- 2 clusters: new and old running NorduGrid NG total: 26 TB and 108 cores workers with 1GB/core - direct link to CERN. - runs Nordugrid batch. special queues for CH-Atlas - interactive local login as lxplus: support ATLAS SW. VITAL ! ( but not desktops or laptops.) - use CERN /afs/cern.ch/. for local purposes - SE thumper crashed running SLC4 on highest load: I/O through 1 Gbit to CERN and 1Gb; --> running Solaris much better. - upgrade till Sep: to 176 cores and 60 GB. some 10 power users now; Bern: ------ 2 clusters: * Bern Atlas T3 = BAC (now 17 cores, up to 40 GB and 50 cores in 08) and * Ubelix (up to 152 cores) have 1 GB/core; first jobs needed more than 1 GB/core ! use (torque, ARC, NFS) does not scale. some 4 power users now; Zuerich: -------- CMS : ----- ZH/Hobg: 3 + 4 servers with: dual core Intels + 6 TB each; 6 core+ 18 TB; next week: 14 CPU + 42 TB; planned: common ZH Tier-3 at PSI ; LHCb: ----- ZH Phys-institut: 20 cores +10 TB by end 2007. openSUSE now... Matterhorn: Opteron cluster SUSE 500 cores; LHCb uses spares. - Uni would match PHYS buying cores into Matterhorn; - all running DIRAC - some problems with VOMS and certificates (not only locally ?). Lausanne: No information ... *** ACTION *** interrogate them ! ............................................................. II. Data transfer and data handling of the 3 experiments ********************************************************* CMS data transfer: ------------------ Data transfer and "physics+production" information are dealt with separately; a) by PHeDex, and b) by DBS2 Here's for a) 1. I select a set (that contains all files from a "physical set") from masks 2. I send request with filename to CMS-CH-contact = Derek 3. He initiates requests (through selection page, if space allows) to central CMS admin, who can approve transfer. Trivial File catalog TFC: only simple rule through "trivial file catalog" to get physical filename; check in : https://cmsdoc.cern.ch:8443/cms/aprom/phedex/prod/Info::Main?view=global on b) : more info about the data themselves are to be found in the DBS2 http://cmsdbs.cern.ch/ Deleting files: only CH-contact (DF) is allowed to do so now.. Q: can we have "backed up " space for small data sets at CSCS? prefer at the moment to keep/copy at the Tier3. ATLAS : ------- use Distributed data managmenet DDM. Implementation by Don Quijote 2: DQ2 (in python), with scripts at T2 and T1, doing FTS. No webinterface yet. Writing files: ie. requesting transfers T1 -> T2 : everybody can do it now. Deleting files: unclear who.. LCHb: ----- - data distribution done centrally, and the job tends to go to the data. - data transfer initiated by DIRAC jobs... eg. dirac-rm-copy local-filename [path-where-to-go] data: one copy of RAW to CERN-T0 *AND* one tier-1 reconstruction and strippint at Tier-1. * data comes to Tier-x, when it is produced, NOT when it is requested. * keep always 2 versions of data locally (all Tier1 and some Tier-2) ........................................................................... dCache: -------- - works on Cells=subservices and domains (JVM) - all is Java based, and highly configurable. - two SE: se03-lcg.porjects.cscs.ch and se02-lcg..... - problems seen: timeouts, authentication, (wrong VO tags...), domain duplication, obscure logging,... see http://storage01-lcg.projects.cscs.ch:2288/ DPM: older code.. only one developer... ........................................................................... III. Data transfer: =================== * Layers in networking: Expt./ Phedex -> FTS -> SRM -> GridFTP -> TCP <-------------------> <----> asynchronous synch. asynch. FTS: controls channels and bandwidth = "batch system for networking" TCP: Kernel settings often wrong: buffer size = RTT (round trip time) x maxBandwidth byte in local buffer only gets deleted, once it has been acknowledged from receiver (after RTT). * Data to T2: passes CERN -> T1 -> CERN -> T2 ............................................