Discussion Questions
From MegajobBOF
- Research Questions
- Hasn't Research Group X already solved this?
- What are the open research problems?
- What are the closed research problems?
- How to best handle interactive applications on batch scheduled systems?
- Where does the relatively high overhead of existing local resource managers come from?
- What architecture should LRM's have so that they scale well as we push to millions of processors?
- Do we have to allocate one Grid resource for every Grid quests respectiveily ?
- Technical Questions
- What can the TeraGrid, OSG, etc do to improve this sort of job submission?
- What are the hard practical bottlenecks that you have (or need to) overcome?
- What information services are useful?
- Things that Users Want to Know When They Want to Run Millions of Trivially Parallel Jobs Associated With a Single Project
- Can users submit millions of jobs simply by using a template from which jobs are created and submitted automatically? (Expected from parameter sweeps and such in Nimrod) And, does the system provide a facility for arranging input files for a set of a million jobs?
- What facility does the system provide for identifying failed jobs in a set of a million jobs?
- What facility does the system provide for accessing the return codes of the user program that was run in a set of a million jobs? Or, does the user have to craft jobs to store return codes and then write scripts to pick through the return codes for failures in user space?
- More technical (and TeraGrid centric) questions
- How can we learn the disk and file quotas of the TeraGrid accounts on various sites?
- How can we get information about the job queue for each sites.(Max # of jobs, behavior of the queues when the large number of jobs are submitted, etc.)?
- Notification mechanism: Is current option Yes/No enough?
- Error report mechanism: Some unified error report from sites. Not all sites treat job failures the same way, making it hard to build a consistent system with GRAM, for example.
- Data Managing. How do people handle their data during the large number of job submission? HTTP,FTP,GridFTP, or using SRB?
- Monitoring the sites.