Abstracts
From MegajobBOF
The following abstracts were selected for short presentations. No additional abstracts will be considered.
[edit]
Accepted Abstracts
[edit]
Falkon, a Fast and Light-weight tasK executiON framework for Clusters, Grids, and Supercomputers
- Presentation
- Speaker: Ioan Raicu, University of Chicago, Department of Computer Science
- Abstract: To enable the rapid execution of many tasks on compute clusters, Grids, and supercomputers, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) dynamic resource provisioning – multi-level scheduling techniques to enable separate treatments of resource provisioning and the dispatch of user tasks to those resources; (2) streamlined task dispatching – which is able to achieve orders-of-magnitude higher task dispatch rates than conventional schedulers; and (3) data diffusion – which performs data caching and uses a data-aware scheduler to leverage the co-located computational and storage resources to minimize the use of shared storage. Falkon’s integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. Falkon has been Globus Incubator projects since 2007. It has been used in a variety of environments from clusters, to multi-site Grids (i.e. TeraGrid), to specialized large machines (SiCortex), and supercomputers (i.e. IBM BlueGene/P). Large-scale applications from many domains (i.e. astronomy, medicine, chemistry, molecular dynamics, economic modeling, and data mining) have been run at scales of millions of jobs on hundreds of thousands of processors. Over the past year, Falkon has executed 164M tasks for a total of 1.4M CPU hours with an average task execution time of 31 seconds. Falkon is able to efficiently handle a wide variety of workloads, with tasks execution times as small as 100s of milliseconds; even at the largest scale of 160K processors we have experience in running Falkon, it can efficiently handle tasks as small as 60 seconds. Micro-benchmarks have shown Falkon to achieve over 15K tasks/sec throughput, scale to millions of queued tasks, and to execute 1 billion tasks in under 18 hours. Data diffusion has also shown to improve applications scalability and performance, with its ability to achieve up to 80Gb/s I/O rates at modest scales. Falkon is actively being developed at University of Chicago with funding from NSF, DOE, and NASA. For more information, please see http://dev.globus.org/wiki/Incubator/Falkon.
[edit]
MyCluster: An Underware for Managing Large-Scale Job Ensembles
- Speaker: Ed Walker, Texas Advanced Computing Center
- Abstract:MyCluter provides a user-space throughput scavenging infrastructure for deploying personal clusters with commodity job managers like Condor and Sun Grid Engine. The infrastructure provisions compute resources for these personal clusters by submitting job proxies to distributed host clusters and adapting the distribution of job proxies over time based on evolving load conditions on the participating host clusters. To date the infrastructure has been used to support large scientific experiments on the TeraGrid, consuming millions of hours of computation time.
[edit]
HTC from the LHC to you: Experience with the compute engine designed for the world's largest and most complex scientific instrument
- Speaker: John McGee, Renaissance Computing Institute, OSG Engagement Coordinator
- Abstract: As the US contribution to the worldwide processing engine for the experiments analyzing data from the Large Hadron Collider, the Open Science Grid (OSG) regularly processes 300k+ production jobs per day. OSG is available to all science domains via the Engagement program, which has served 3 million opportunistic cycles across 750k jobs during the previous year to a diverse range of scientific disciplines including: biology, chemistry, mathematics, genetics, mechanical engineering, economics, and computational finance. A number of workload management systems are employed on OSG to achieve this level of system throughput, including PANDA, the OSG MatchMaker (OSGMM), Condor DAGMan, and pilot/glide-in systems. PANDA alone manages 100k+ OSG jobs per day on a regular basis. All of these jobs are serviced by a distributed set of resources at university and government labs across more than 80 locations and numerous administrative domains. International partners that have integrated workload management systems into OSG with help from the Engagement team include WISDOM, a European drug discovery platform, and Nimrod. OSGMM, a tool used by the Engagement team to bootstrap new users, is a lightweight match maker installed on top of the regular OSG client software stack, with Condor as the underlying job management engine. The Condor SchedD has been proven to submit and run as many as 1 million jobs. OSGMM receives site information from the Resource Selection Service (ReSS), validates the information by submitting probe jobs to the sites, gathers additional information from the environment, and merges these findings into the Condor ClassAdds for the sites. OSGMM provides a number of fault tolerance and performance features, by maintaining recent history and using success/throughput rates for both individual users and sites overall to make future match making decisions. Using OSGMM, the Engagement team can rapidly adapt a scientist's workflow for large scale HTC, and shield scientists from many of the complexities of large scale distributed computing.
[edit]
Gracie: Grid Resource Virtualization and Customization Infrastructure
- Presentation
- Speaker: Li Hui, Peking University , Department of Computer Science
- Abstract: We present a lightweight execution framework for efficiently executing massive independent tasks in parallel on distributed computational resources. It dynamically partitions a set of tasks of different granularities and dispatches tasks onto distributed computational resources concurrently. Three optimization strategies have been devised to improve the performance of Grid system. One strategy is to pack up to thousands of tasks into one request. Another is to share the effort in resource discovery and allocation among requests by separating resource allocations from request submissions. The third strategy is to pack variable numbers of tasks into different requests, where the task number is a function of the destination resource’s computability. This framework has been implemented in Gracie, a computational grid software platform developed by Peking University. We have developed two bioinformatics application (gSVAP and BLAST) with Gracie for CBI (China Biology Information Center). Since Sept.2007, gSVAP has processed more than 152,000 batch jobs and created more than 246,000 results on a small computational Grid that consists of 10 SMP servers provided by six China universities. Any further information about Gracie, please contact lihui(at)net.pku.edu.cn
[edit]
High Throughput Computing for the IBM Blue Gene/P Solution
- Presentation
- Speaker: Ruth Poole, IBM Corporation
- Abstract: The IBM Blue Gene/P High Throughput Computing (HTC) mode provides a complete, integrated solution that gives users a simple, flexible mechanism for submitting single-node jobs on Blue Gene/P. Running HTC makes Blue Gene/P look like a traditional "cluster" from an applications point of view. It enables a new class of workloads that use many single-node jobs. making Blue Gene/P a viable solution for a wider spectrum of customers. Blue Gene/P supports hybrid application environment of both traditional High Performance Computing (MPI) and HTC apps simultaneously. HTC mode has the capability to run in any of three ways, one job on each quad-core processor (SMP Mode), two jobs on each processor (Dual Mode) each using two cores, or four jobs on each processor (Virtual Node Mode) each job using a single core. Test results show throughput stays nearly constant at ~70 jobs/sec and that HTC scales very well even for short duration jobs.
[edit]
A million questions or a few good answers?
- Presentation
- Speaker: David Abramson, Monash University, Monash eScience and Grid Engineering Lab
- Abstract: The Over the past 15 years we have constructed a tool set called Nimrod, that automates the process of finding good solutions to demanding computational experiments. Importantly, Nimrod is more than a job distribution system; it is a high level environment for conducting search across complex spaces. Nimrod includes tools that perform a complete parameter sweep across all possible combinations (Nimrod/G), or search using non-linear optimization algorithms (Nimrod/O) or experimental design techniques (Nimrod/E). A new family member, Nimod/K, combined Kepler workflows with these search abilities. The number of jobs, and thus the parallelism, can be varied at run time, and the Nimrod scheduler places tasks on the available resources at run time. Nimrod had been applied to a wide range of disciplines from public health policy to quantum chemistry. In this talks we will discuss the current achievements and talk about our future directions.
[edit]
Experiences with running large scale applications using the Swift parallel programming system
- Presentation
- Speaker: Ben Clifford, University of Chicago, Computation Institute
- Abstract: There are many components to allow scientists to run a large number of jobs, but application scientists generally want a higher level abstraction for expressing their application. Swift attempts to provide such an abstraction - a loosely-coupled distributed scripting language and a corresponding runtime environment. Swift attempts to provide an abstraction that allows scientists to express their applications at a relatively high level whilst integrating a number of underlying systems (such as various versions of GRAM, Falkon, GridFTP and shared POSIX filesystems). This abstraction is both a blessing and a curse: as a blessing, it can isolate the application user from specifics of underlying platforms, promoting portability; can facilitate useful features such as restart/checkpointing; and can provide users with a model to implement their application that is considerably simpler to understand than the underlying systems. However, providing an abstraction that can always be translated into an efficient strategy for a given runtime situation is challenging, and the abstraction can sometimes be an impediment to the scaling up of applications despite excellent demonstrated scalability in some metrics of the underlying systems. Sometimes, one poorly behaved component will restrict the scalability of an application no matter how well another component behaves. Sometimes the implementation of the abstraction is overly naive or inappropriate, causing degenerate behaviour in an underlying component. This is analogous to situations faced by a compiler when generating code. This presentation will showcase a number of scaling-up problems encountered in application work with Swift as it interacts with distributed execution environments, techniques used to diagnose such problems, and proposed and implemented solutions. Such problems include: the cost of reliability; cases when jobs cannot be run fast enough; and coping with large numbers of sites of varying quality.
