Researchers conducting large scale particle physics experiments at the Large Hadron Collider (LHC) face a number of supercomputing issues. The European Organization for Nuclear Research (CERN), which houses and manages the LHC, implemented a massive international GRID computing project to support LHC experiments. The combined storage requirements for data and metadata produced each year numbers in the tens of petabytes for each experiment. While the software evolution within an experiment eventually plateaus, there is an ongoing effort to modify and improve both the acquisition and analysis software throughout life of the experiments. Upgrades tend to cause relatively major shifts in data structures as well as software. All of these factors create complexities for data archival planning.

Recent examples of fruitful secondary analysis efforts being conducted on archived data from previous particle physics experiments has fanned the flames of the movements by many funding agencies and governments to insist on deliberate efforts to archive all data taken at great national expense for possible future mining. Clearly the costs associated with data production from the LHC experiments is staggering, and that creates a justifiable interest in preserving any potential untapped information embedded within it to the extent possible in a form that would enable future data miners a reasonable opportunity to retrieve it.

To that extent, the LHC experiments are engaged in an effort to adopt a harmonized, if not coordinated policy regarding how this preservation will be accomplished. Data archival mandates are to some extent unfunded mandates, and it is becoming increasingly clear that substantial current manpower needs to be committed to the task to ensure its success. Such manpower is rarely available with any priority during the active data-taking periods of the experiments, so compromises are being considered to achieve an appropriate level of future access to data.


These images are from the ALICE event display of the particle tracks resulting from the collision of 2 ultra-relativistic Pb (lead) nuclei where the counter-rotating beams in the LHC cross in the center of the ALICE detector. The tracks have been reconstructed from the date read-out from the various detector elements. The maximum event rate is ~4000 Hz because of the use of a Time Projection Chamber, which must clear each time before the next such event occurs. A massive CPU farm reconstructs each event online to determine which ones should be preserved globally. At a bandwidth of 1.5 GB/s to permanent storage, researchers can only record the full data for ~10% of the events under worst case conditions.