Presentation Open Access
Large scale simulations easily produce vast amounts of data that cannot always be evaluated in-situ. At that point parallel file systems come into play, but their per node performance is essentially limited to about the speed of a USB 2.0 thumb drive (e.g. the Spider file system at OLCF provides over 1 TB/s write bandwidth, but with 18000+ nodes of Titan writing simultaneously, this number is reduced to about 50 MB/s per node). Making the most out of such a limited resource requires I/O libraries that actually scale. In addition such libraries also offer on the fly data transformations (e.g. compression) to better utilize the raw I/O bandwidth, albeit, opening a new can of worms by trading compression throughput with compression ratios for performance. We will present a detailed study of I/O performance and various compression techniques at OLCF and compare them against smaller local I/O installations, demonstrating the highest achieved I/O performance for real world applications at OLCF. Furthermore, we demonstrate that the best performing I/O setup can be determined prior to starting the job based on hardware characteristics.
Now that you have your data on disk the clock starts ticking and you are fighting against the deadline until your data will be purged, since most centers only offer the high performing storage spaces on a temporary basis. Extracting all valuable information out of a petabyte sized data set requires parallel processing as well and induces wait times until the resources are available and quite naturally a lot of trial-and-error for the evaluation. The time constraint for keeping the temporary data becomes even more troublesome when trying to compare multiple large simulations that naturally have a delay of multiple days until they are scheduled and write their results. And ideally analysis could embrace the data of multiple simulations of a quarterly accounted, yet year-long computing campaign. Another challenge for actually conducting scientific discoveries comes when utilizing multiple compute sites. This seems to be rather usual for research groups as they will use all the compute clock cycles they
can get wherever that may be. For comparative studies the data sets now need to be available at the same time for analysis, e.g. via archiving solutions or transfer to one location. The achievable transfer bandwidth between data centers is in our experience still much lower than expected. The talk will also present on the experiences of evaluating petabyte sized data sets in such a diverse environment.