Workshop on Swiss Tier-2   (11.6.07)
====================================

PK, DF, SH, SG, TG, AU, SM, CG, ZC.
P.Kunszt, D.Feichtinger, Sigve Haug, Szymon Gadomski, Tom Guptill,
Alessandro Usai, Sergio Maffioletto, C.Grab, Zhiling Cheng

See http://twiki.cscs.ch/bin/view/LCGTier2/PhoenixMeeting20070611


I. Reports from T2 and T3
*************************

1) Tier-2
===========

* Phoenix  procurement: legal office ok;  cooling missing (700 kCHF), needs another
            round of ETH-board meeting. - 20.6.

* dCache:  still a few small problems.
* integration of old phoenix and SUN-nodes: cfengine handles them all.

* one thumper crash happened: patch upgrades done now.

Tom: CSCS sysadmin-team; 
Alessandro: dCache and services;
Sergio: helping to bridge Alessandro
Peter: management and help 
==> 2 FTE for phoenix exclusively.

[ Interviews on 18.6.

DPM: less scalable, but may be easier for Tier-3
dCache: scalable, but heavier on management.

.............................................................

2) Tier-3:
==========


Geneva:
--------
2 clusters: new and old  running NorduGrid NG
total: 26 TB and 108 cores workers with 1GB/core
- direct link to CERN.
- runs Nordugrid batch. special queues for CH-Atlas
- interactive local login as lxplus: support ATLAS SW.
  VITAL !  ( but not desktops or laptops.)
- use CERN /afs/cern.ch/.  for local purposes
- SE thumper crashed running SLC4 on highest load: 
  I/O through 1 Gbit to CERN and 1Gb;
  --> running Solaris much better.
- upgrade till Sep: to 176 cores and 60 GB.
some 10 power users now;

Bern:
------
2 clusters: 
*  Bern  Atlas T3 = BAC (now 17 cores, up to 40 GB and 50 cores in 08) and
*  Ubelix (up to 152 cores) 
 have 1 GB/core;  first jobs needed more than 1 GB/core !
 use (torque, ARC, NFS) does not scale.
some 4 power users now; 


Zuerich:
--------
CMS :
-----
ZH/Hobg: 3 + 4 servers with: dual core Intels + 6 TB each;
         6 core+ 18 TB;   next week: 14 CPU + 42 TB;

planned: common ZH Tier-3 at PSI ;


LHCb:
-----
ZH Phys-institut:  20 cores +10 TB by end 2007.  openSUSE now...
Matterhorn: Opteron cluster SUSE 500 cores; LHCb uses spares.
- Uni would match PHYS buying cores into Matterhorn;
- all running DIRAC
- some problems with VOMS and certificates (not only locally ?).

Lausanne:
No information ...                 *** ACTION *** interrogate them !

.............................................................

II. Data transfer and data handling of the 3 experiments
*********************************************************

CMS data transfer:
------------------
Data transfer and "physics+production" information are dealt with separately;
a) by PHeDex, and b) by DBS2

Here's for a)
1. I select a set (that contains all files from a "physical set") from masks
2. I send request with filename to CMS-CH-contact = Derek
3. He initiates requests (through selection page, if space allows) to central CMS admin,
   who can approve transfer.


Trivial File catalog TFC:  only simple rule through "trivial file catalog" to get 
    physical filename;  check in :
https://cmsdoc.cern.ch:8443/cms/aprom/phedex/prod/Info::Main?view=global

on b) : more info about the data themselves are to be found in the DBS2
http://cmsdbs.cern.ch/

Deleting files: only CH-contact (DF) is allowed to do so now..

Q: can we have "backed up " space for small data sets at CSCS?
   prefer at the moment to keep/copy at the Tier3.

ATLAS : 
-------
use Distributed data managmenet DDM.
Implementation by Don Quijote 2: DQ2 (in python), with scripts at T2 and T1,
doing FTS.
No webinterface yet.

Writing files: ie. requesting transfers T1 -> T2 : everybody can do it now.
Deleting files: unclear who..

LCHb:
-----
- data distribution done centrally, and the job tends to go to the data.
- data transfer initiated by DIRAC jobs...  
eg. dirac-rm-copy local-filename <site-SE> [path-where-to-go]

data: one copy of RAW to CERN-T0 *AND* one tier-1
   reconstruction and strippint at Tier-1.
* data comes to Tier-x, when it is produced, NOT when it is requested.

* keep always 2 versions of data locally (all Tier1 and some Tier-2)

...........................................................................


dCache: 
--------
- works on Cells=subservices and domains (JVM)
- all is Java based, and highly configurable.
- two SE: se03-lcg.porjects.cscs.ch  and se02-lcg.....

- problems seen: timeouts, authentication, (wrong VO tags...),
  domain duplication, obscure logging,...

see http://storage01-lcg.projects.cscs.ch:2288/


DPM: older code.. only one developer...
...........................................................................


III. Data transfer:
===================

* Layers in networking:

Expt./
Phedex -> FTS -> SRM -> GridFTP -> TCP
<------------------->            <----> 
   asynchronous         synch.    asynch.

FTS: controls channels and bandwidth = "batch system for networking"

TCP: Kernel settings often wrong:
buffer size = RTT (round trip time) x  maxBandwidth
byte in local buffer only gets deleted, once it has been acknowledged
from receiver (after RTT).


* Data to T2: passes CERN -> T1 -> CERN -> T2

............................................