1. What are the target applications and what is the expected sophistication of the target users (naive, knowledgeable, expert)?

Ideally the target application has the potential for lightweight parallelism and existing dynamic and adaptive behavior resulting in load imbalance.  Current users should be knowledgeable.

2.  List the set of features/concepts of your programming model and divide them into two sets: What is the minimum set of things a new user needs to learn to become productive?  What are the more advanced features a user can potentially take advantage of?

Minimal: lightweight threads, continuation passing, LCO programming, global memory allocation
Advanced: parallel for constructs, dynamic process collectives, unconstrained continuations, manual global memory mapping

3.  How do you decide whether a new application is a good fit?  What metrics do you use to evaluate whether an application is implemented well in your model?

Applications with dynamic and adaptive behavior resulting in load imbalance are a good fit.

4. What is the plan for interoperability with MPI, OpenMP, Kokkos, etc.?  If you could add requirements to MPI, what would those be?

HPX-5 already interoperates with MPI, in terms of MPI+HPX, in iterative hpx_run collective epochs, or during mixed operation. HPX-5 lightweight threads can run code generated from Kokkos, CUDA, OpenCL, etc. It isn't clear that adding something to MPI would help us.  Part of our observation is that some things cannot effectively be managed with an API, but require deeper runtime integration.

5. What is the plan for performance portability?

Given that work moves around by posting parcels, performance portability should fall out naturally.  GPUs for instance, just look like a network card.  We send parcels to them and look for completions.   That said, we do have to write HPX actions in CUDA now, but we are working toward using OpenCL.

6. What is the plan for fault tolerance?

Microcheckpointing.

7.  What static analyses and transformations could you do?  What do you do today?

Nothing today.  At the lowest level, there is not much we could do.

8.  Questions about task graphs:
      * When is the task graph generated (compile-time, load-time, run-time)?
      * How do you manage task graph generation vs. task graph execution?
      * What is the value of non-ready tasks in the DAG?
      * Do you exploit the repetitiveness of iterative applications that repeatedly execute the same task graph?

Dataflow is a subset of the execution model and is encoded dynamically at runtime by the application as collections of LCOs with parcel continuations. These graphs can evolve at any point, as they execution, or under external thread control. LCOs are persistent and thus networks can be reused if desired. The runtime can monitor such networks at runtime and can use such information to dynamically distribute such networks, but does not do so by default. The runtime does not exploit knowledge about graphs during scheduling.

9.  Questions about tasks:
      * How is task granularity managed?
      * What is the life-cycle of a task?

Tasks and task granularity is under user control.
Parcel allocation & send
(optional, at any time) migrate globally to target address
Stack bind & begin
(optional, at any time) block for long latency ops or not-ready dynamic dependencies
(optional, at any time) yield
(optional, post-context-switch) migrate within node
Continue a result
Deallocation

10.  What is the relationship between task and data parallelism --- can one be invoked from the other arbitrarily or are there restrictions?

In HPX-5, SIMD data parallelism is a property of the compiled lightweight thread code, e.g., vectorized C/Fortran, CUDA, etc. Local and hierarchical parallel for operations are encoded at runtime through task parallelism, and can invoke their own tasks as necessarily.

11.  Where exactly is concurrency (meaning the ability to have races and deadlocks) exposed to the programmer, if at all?

Lightweight threads are explicitly concurrent. Synchronization through LCOs is data-race-free. Access to the global address space via put/get or local-pointer-translation exposes the same data-race potentials as the user’s chosen programming language (C/C++/Fortan/Haskell/etc).