Discussion Questions

From MegajobBOF

  1. Research Questions
    1. Hasn't Research Group X already solved this?
    2. What are the open research problems?
    3. What are the closed research problems?
    4. How to best handle interactive applications on batch scheduled systems?
    5. Where does the relatively high overhead of existing local resource managers come from?
    6. What architecture should LRM's have so that they scale well as we push to millions of processors?
    7. Do we have to allocate one Grid resource for every Grid quests respectiveily ?
  2. Technical Questions
    1. What can the TeraGrid, OSG, etc do to improve this sort of job submission?
    2. What are the hard practical bottlenecks that you have (or need to) overcome?
    3. What information services are useful?
  3. Things that Users Want to Know When They Want to Run Millions of Trivially Parallel Jobs Associated With a Single Project
    1. Can users submit millions of jobs simply by using a template from which jobs are created and submitted automatically? (Expected from parameter sweeps and such in Nimrod) And, does the system provide a facility for arranging input files for a set of a million jobs?
    2. What facility does the system provide for identifying failed jobs in a set of a million jobs?
    3. What facility does the system provide for accessing the return codes of the user program that was run in a set of a million jobs? Or, does the user have to craft jobs to store return codes and then write scripts to pick through the return codes for failures in user space?
  4. More technical (and TeraGrid centric) questions
    1. How can we learn the disk and file quotas of the TeraGrid accounts on various sites?
    2. How can we get information about the job queue for each sites.(Max # of jobs, behavior of the queues when the large number of jobs are submitted, etc.)?
    3. Notification mechanism: Is current option Yes/No enough?
    4. Error report mechanism: Some unified error report from sites. Not all sites treat job failures the same way, making it hard to build a consistent system with GRAM, for example.
    5. Data Managing. How do people handle their data during the large number of job submission? HTTP,FTP,GridFTP, or using SRB?
    6. Monitoring the sites.