From 3c97ea9b28a0b93052e6fa79d6b5038be6950d4b Mon Sep 17 00:00:00 2001
From: Michael Voss <MichaelJ.Voss@intel.com>
Date: Wed, 23 Oct 2024 09:56:30 -0500
Subject: [PATCH 1/9] Added numa_support rfc

---
 .../simplified_numa_support/README.md         | 179 ++++++++++++++++++
 1 file changed, 179 insertions(+)
 create mode 100755 rfcs/proposed/simplified_numa_support/README.md
diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md
new file mode 100755
index 0000000000..fbac6efd62
--- /dev/null
+++ b/rfcs/proposed/simplified_numa_support/README.md
@@ -0,0 +1,179 @@
+# Simplified NUMA support in oneTBB
+
+## Introduction
+
+In Non-Uniform Memory Access (NUMA) systems, the cost of memory accesses depends on the
+*nearness* of the processor to the memory resource on which the accessed data resides. 
+While oneTBB has core support that enables developers to tune for Non-Uniform Memory 
+Access (NUMA) systems, we believe this support can be simplified and improved to provide 
+an improved user experience.  
+
+This early proposal recommends addressing for areas for improvement:
+
+1. improved reliability of HWLOC-dependent topology and pinning support in,
+2. addition of a NUMA-aware allocation,
+3. simplified approaches to associate task distribution with data placement and 
+4. where possible, improved out-of-the-box performance for high-level oneTBB features.
+
+We expect that this draft proposal may be broken into smaller proposals based on feedback 
+and prioritization of the suggested features.
+
+The features for NUMA tuning already available in the oneTBB 1.3 specification include:
+
+- Functions in the `tbb::info` namespace **[info_namespace]** 
+  - `std::vector<numa_node_id> numa_nodes()`
+  - `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)`
+- `tbb::task_arena::constraints` in **[scheduler.task_arena]**
+
+Below is the example that demonstrates the use of these APIs to pin threads to different 
+arenas to each of the NUMA nodes available on a system, submit work across those `task_arena` 
+objects and into associated `task_group`` objects, and then wait for work again using both 
+the `task_arena` and `task_group` objects.
+
+    #include "oneapi/tbb/task_group.h"
+    #include "oneapi/tbb/task_arena.h"
+
+    #include <vector>
+
+    int main() {
+        std::vector<oneapi::tbb::numa_node_id> numa_nodes = oneapi::tbb::info::numa_nodes();
+        std::vector<oneapi::tbb::task_arena> arenas(numa_nodes.size());
+        std::vector<oneapi::tbb::task_group> task_groups(numa_nodes.size());
+
+        // Initialize the arenas and place memory
+        for (int i = 0; i < numa_nodes.size(); i++) {
+            arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i]));
+            arenas[i].execute([i] {
+              // allocate/place memory on NUMA node i
+            });
+        }
+        
+        for (int j 0; j < NUM_STEPS; ++i) {
+
+          // Distribute work across the arenas / NUMA nodes
+          for (int i = 0; i < numa_nodes.size(); i++) {
+            arenas[i].execute([&task_groups, i] {
+              task_groups[i].run([] {
+                /* executed by the thread pinned to specified NUMA node */
+              });
+            });
+          }
+
+          // Wait for the work in each arena / NUMA node to complete
+          for (int i = 0; i < numa_nodes.size(); i++) {
+            arenas[i].execute([&task_groups, i] {
+                task_groups[i].wait();
+            });
+          }
+        }
+
+        return 0;
+    }
+
+### The need for application-specific knowledge
+
+In general when tuning a parallel application for NUMA systems, the goal is to expose sufficient
+parallelism while minimizing (or at least controlling) data access and communication costs. The 
+tradeoffs involved in this tuning often rely on application-specific knowledge. 
+
+In particular, NUMA tuning typically involves:
+
+1. Understanding the overall application problem and its use of algorithms and data containers
+2. Placement of data container objects onto memory resources
+3. Distribution of tasks to hardware resources that optimize for data placement
+
+As shown in the previous example, the oneTBB 1.3 specification only provides low-level
+support for NUMA optimization. The `tbb::info` namespace provides topology discovery. And the
+combination of `task_arena`, `task_arena::constraints` and `task_group` provide a mechanism for
+placing tasks onto specific processors. There is no high-level support for memory allocation
+or placement, or for guiding the task distribution of algorithms.
+
+### Issues that should be resolved in the oneTBB library
+
+**The behavior of existing features is not always predictable.** There is a note in 
+section **[info_namespace]** of the oneTBB specification that describes
+the function `std::vector<numa_node_id> numa_nodes()`, "If error occurs during system topology 
+parsing, returns vector containing single element that equals to `task_arena::automatic`."  
+
+In practice, the error often occurs because HWLOC is not detected on the system. While the 
+oneTBB documentation states in several places that HWLOC is required for NUMA support and 
+even provides guidance on 
+[how to check for HWLOC](https://www.intel.com/content/www/us/en/docs/onetbb/get-started-guide/2021-12/next-steps.html), 
+the failure to resolve HWLOC at runtime silently returns a default of `task_arena::automatic`. This
+default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding 
+example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort.
+
+**Getting good performance using these tools requres notable manual coding effort by users.** As we 
+can see in the preceding example, if we want to spread work across the NUMA nodes in 
+a system we need to query the topology using functions in the `tbb::info` namespace, create
+one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an
+extra loop that iterates overs these `task_arena` and `task_group` objects to execute the
+work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific
+APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes.
+
+**The out-of-the-box performance of the generic TBB APIs on NUMA systems is not good enough.**
+Should the oneTBB library do anything special be default if the system is a NUMA system?  Or should 
+regular random stealing distribute the work across all of the cores, regardless of which NUMA first 
+touched the data?
+
+Is it reasonable for a developer to expect that a series of loops, such as the ones that follow, will
+try to create a NUMA-friendly distribution of tasks so that accesses to the same elements of `b` and `c`
+in the two loops are from the same NUMA nodes? Or is this too much to expect without providing hints? 
+
+    tbb::parallel_for(0, N, 
+      [](int i) { 
+        b[i] = f(i);
+        c[i] = g(i); 
+      });
+
+    tbb::parallel_for(0, N, 
+      [](int i) { 
+        a[i] = b[i] + c[i]; 
+      });
+
+## Proposal
+
+### Increased availability of NUMA support
+
+The oneTBB 1.3 specification states for `tbb::info::numa_nodes`, "If error occurs during system 
+topology parsing, returns vector containing single element that equals to task_arena::automatic."
+
+Since the oneTBB library dynamically loads the HWLOC library, a misconfiguration can cause the HWLOC
+to fail to be found. In that case, a call like:
+
+    std::vector<oneapi::tbb::numa_node_id> numa_nodes = oneapi::tbb::info::numa_nodes();
+
+will return a vector with a single element of `task_arena::automatic`. This behavior, as we have noticed
+through user questions, can lead to unexpected performance from NUMA optimizations. When running
+on a NUMA system, a developer that has not fully read the documentation may expect that `numa_nodes()`
+will give a proper accounting of the NUMA nodes. When the code, without raising any alarm, returns only 
+a single, valid element due to the environmental configuation (such as lack of HWLOCK), it is too easy 
+for developers to not notice that the code is acting in a valid, but unexpected way.
+
+We propose that the oneTBB library implementation include, wherever possibly, a statically-linked fallback 
+to decrease that likelihood of such failures. The oneTBB specification will remain unchanged.
+
+### NUMA-aware allocation
+
+We will define allocators of other features that simplify the process of allocating or places data onto
+specific NUMA nodes.
+
+### Simplified approaches to associate task distribution with data placement
+
+As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures.
+We also need to deliver mechanisms to guide task distribution so that tasks are executed on execution
+resources that are near to the data they access. oneTBB already provides low-level support through
+`tbb::info` and `tbb::task_arena`, but we should up-level this support into the high-level algorithms,
+flow graph and containers where appropriate.
+
+### Improved out-of-the-box performance for high-level oneTBB features.
+
+For high-level oneTBB features that are modified to provide improved NUMA support, we should try to 
+align default behaviors for those features with user-expectations when used on NUMA systems.
+
+## Open Questions
+
+1. Do we need simplified support, or are users that want NUMA support in oneTBB
+willing to, or perhaps even prefer, to manage the details manually?
+2. Is it reasonable to expect good out-of-the-box performance on NUMA systems 
+without user hints or guidance.

From de552dfc9da577f6f8efad205fa9e90ef5242289 Mon Sep 17 00:00:00 2001
From: Mike Voss <michaelj.voss@intel.com>
Date: Wed, 13 Nov 2024 08:07:44 -0600
Subject: [PATCH 2/9] Update rfcs/proposed/simplified_numa_support/README.md

Co-authored-by: Aleksei Fedotov <aleksei.fedotov@intel.com>
---
 rfcs/proposed/simplified_numa_support/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md
index fbac6efd62..a45d77c611 100755
--- a/rfcs/proposed/simplified_numa_support/README.md
+++ b/rfcs/proposed/simplified_numa_support/README.md
@@ -8,7 +8,7 @@ While oneTBB has core support that enables developers to tune for Non-Uniform Me
 Access (NUMA) systems, we believe this support can be simplified and improved to provide 
 an improved user experience.  
 
-This early proposal recommends addressing for areas for improvement:
+This early proposal recommends addressing four areas for improvement:
 
 1. improved reliability of HWLOC-dependent topology and pinning support in,
 2. addition of a NUMA-aware allocation,

From 54ae854675ea9ac5da078520545c152624834c55 Mon Sep 17 00:00:00 2001
From: Mike Voss <michaelj.voss@intel.com>
Date: Wed, 13 Nov 2024 08:08:28 -0600
Subject: [PATCH 3/9] Update rfcs/proposed/simplified_numa_support/README.md

Co-authored-by: Aleksei Fedotov <aleksei.fedotov@intel.com>
---
 rfcs/proposed/simplified_numa_support/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md
index a45d77c611..b4bfc0742b 100755
--- a/rfcs/proposed/simplified_numa_support/README.md
+++ b/rfcs/proposed/simplified_numa_support/README.md
@@ -103,7 +103,7 @@ the failure to resolve HWLOC at runtime silently returns a default of `task_aren
 default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding 
 example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort.
 
-**Getting good performance using these tools requres notable manual coding effort by users.** As we 
+**Getting good performance using these tools requires notable manual coding effort by users.** As we 
 can see in the preceding example, if we want to spread work across the NUMA nodes in 
 a system we need to query the topology using functions in the `tbb::info` namespace, create
 one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an

From 87cf469767469d068a2fa9c848819e6ff0cc55b5 Mon Sep 17 00:00:00 2001
From: Mike Voss <michaelj.voss@intel.com>
Date: Wed, 13 Nov 2024 08:08:43 -0600
Subject: [PATCH 4/9] Update rfcs/proposed/simplified_numa_support/README.md

Co-authored-by: Aleksei Fedotov <aleksei.fedotov@intel.com>
---
 rfcs/proposed/simplified_numa_support/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md
index b4bfc0742b..806e53ccba 100755
--- a/rfcs/proposed/simplified_numa_support/README.md
+++ b/rfcs/proposed/simplified_numa_support/README.md
@@ -107,7 +107,7 @@ example and be unaware that a HWLOC installation error (or lack of HWLOC) has un
 can see in the preceding example, if we want to spread work across the NUMA nodes in 
 a system we need to query the topology using functions in the `tbb::info` namespace, create
 one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an
-extra loop that iterates overs these `task_arena` and `task_group` objects to execute the
+extra loop that iterates over these `task_arena` and `task_group` objects to execute the
 work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific
 APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes.
 

From 94d0d357a8e89d045b3f3db3de92b1b0317fcb7f Mon Sep 17 00:00:00 2001
From: Mike Voss <michaelj.voss@intel.com>
Date: Wed, 13 Nov 2024 08:08:54 -0600
Subject: [PATCH 5/9] Update rfcs/proposed/simplified_numa_support/README.md

Co-authored-by: Aleksei Fedotov <aleksei.fedotov@intel.com>
---
 rfcs/proposed/simplified_numa_support/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md
index 806e53ccba..b297a68722 100755
--- a/rfcs/proposed/simplified_numa_support/README.md
+++ b/rfcs/proposed/simplified_numa_support/README.md
@@ -112,7 +112,7 @@ work on the desired NUMA nodes. We also need to handle all container allocations
 APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes.
 
 **The out-of-the-box performance of the generic TBB APIs on NUMA systems is not good enough.**
-Should the oneTBB library do anything special be default if the system is a NUMA system?  Or should 
+Should the oneTBB library do anything special by default if the system is a NUMA system?  Or should 
 regular random stealing distribute the work across all of the cores, regardless of which NUMA first 
 touched the data?
 

From aa14760efb40ef87cb089a79058f04dfa6164c08 Mon Sep 17 00:00:00 2001
From: Mike Voss <michaelj.voss@intel.com>
Date: Wed, 13 Nov 2024 08:09:05 -0600
Subject: [PATCH 6/9] Update rfcs/proposed/simplified_numa_support/README.md

Co-authored-by: Aleksei Fedotov <aleksei.fedotov@intel.com>
---
 rfcs/proposed/simplified_numa_support/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md
index b297a68722..ca36b262db 100755
--- a/rfcs/proposed/simplified_numa_support/README.md
+++ b/rfcs/proposed/simplified_numa_support/README.md
@@ -147,7 +147,7 @@ will return a vector with a single element of `task_arena::automatic`. This beha
 through user questions, can lead to unexpected performance from NUMA optimizations. When running
 on a NUMA system, a developer that has not fully read the documentation may expect that `numa_nodes()`
 will give a proper accounting of the NUMA nodes. When the code, without raising any alarm, returns only 
-a single, valid element due to the environmental configuation (such as lack of HWLOCK), it is too easy 
+a single, valid element due to the environmental configuation (such as lack of HWLOC), it is too easy 
 for developers to not notice that the code is acting in a valid, but unexpected way.
 
 We propose that the oneTBB library implementation include, wherever possibly, a statically-linked fallback 

From 6a57193b81eb372f89cd3027d5680c375e6285f7 Mon Sep 17 00:00:00 2001
From: Mike Voss <michaelj.voss@intel.com>
Date: Wed, 13 Nov 2024 09:28:52 -0600
Subject: [PATCH 7/9] Renamed numa_support RFC

---
 .../README.md                                 | 23 ++++++++++---------
 1 file changed, 12 insertions(+), 11 deletions(-)
 rename rfcs/proposed/{simplified_numa_support => numa_support}/README.md (90%)

diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/numa_support/README.md
similarity index 90%
rename from rfcs/proposed/simplified_numa_support/README.md
rename to rfcs/proposed/numa_support/README.md
index ca36b262db..0a0b822830 100755
--- a/rfcs/proposed/simplified_numa_support/README.md
+++ b/rfcs/proposed/numa_support/README.md
@@ -1,4 +1,4 @@
-# Simplified NUMA support in oneTBB
+# NUMA support
 
 ## Introduction
 
@@ -8,15 +8,15 @@ While oneTBB has core support that enables developers to tune for Non-Uniform Me
 Access (NUMA) systems, we believe this support can be simplified and improved to provide 
 an improved user experience.  
 
-This early proposal recommends addressing four areas for improvement:
+This RFC acts as an umbrella for sub-proposals that address four areas for improvement:
 
 1. improved reliability of HWLOC-dependent topology and pinning support in,
 2. addition of a NUMA-aware allocation,
 3. simplified approaches to associate task distribution with data placement and 
 4. where possible, improved out-of-the-box performance for high-level oneTBB features.
 
-We expect that this draft proposal may be broken into smaller proposals based on feedback 
-and prioritization of the suggested features.
+We expect that this draft proposal will spawn sub-proposals that will progress
+independently based on feedback and prioritization of the suggested features.
 
 The features for NUMA tuning already available in the oneTBB 1.3 specification include:
 
@@ -25,10 +25,11 @@ The features for NUMA tuning already available in the oneTBB 1.3 specification i
   - `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)`
 - `tbb::task_arena::constraints` in **[scheduler.task_arena]**
 
-Below is the example that demonstrates the use of these APIs to pin threads to different 
-arenas to each of the NUMA nodes available on a system, submit work across those `task_arena` 
-objects and into associated `task_group`` objects, and then wait for work again using both 
-the `task_arena` and `task_group` objects.
+Below is the example based on existing oneTBB documentation that demonstrates the use 
+of these APIs to pin threads to different arenas to each of the NUMA nodes available 
+on a system, submit work across those `task_arena` objects and into associated 
+`task_group`` objects, and then wait for work again using both the `task_arena` 
+and `task_group` objects.
 
     #include "oneapi/tbb/task_group.h"
     #include "oneapi/tbb/task_arena.h"
@@ -42,7 +43,7 @@ the `task_arena` and `task_group` objects.
 
         // Initialize the arenas and place memory
         for (int i = 0; i < numa_nodes.size(); i++) {
-            arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i]));
+            arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i]),0);
             arenas[i].execute([i] {
               // allocate/place memory on NUMA node i
             });
@@ -79,7 +80,7 @@ tradeoffs involved in this tuning often rely on application-specific knowledge.
 In particular, NUMA tuning typically involves:
 
 1. Understanding the overall application problem and its use of algorithms and data containers
-2. Placement of data container objects onto memory resources
+2. Placement/allocation of data container objects onto memory resources
 3. Distribution of tasks to hardware resources that optimize for data placement
 
 As shown in the previous example, the oneTBB 1.3 specification only provides low-level
@@ -155,7 +156,7 @@ to decrease that likelihood of such failures. The oneTBB specification will rema
 
 ### NUMA-aware allocation
 
-We will define allocators of other features that simplify the process of allocating or places data onto
+We will define allocators or other features that simplify the process of allocating or placing data onto
 specific NUMA nodes.
 
 ### Simplified approaches to associate task distribution with data placement

From 3a2f55b85d2635c21760425b9bb9a6eb904a063b Mon Sep 17 00:00:00 2001
From: "Fedotov, Aleksei" <aleksei.fedotov@intel.com>
Date: Tue, 12 Nov 2024 13:06:57 +0100
Subject: [PATCH 8/9] Add RFC for creation and use of NUMA arenas

---
 .../numa-arenas-creation-and-use.org          | 147 ++++++++++++++++++
 1 file changed, 147 insertions(+)
 create mode 100644 rfcs/proposed/numa_support/numa-arenas-creation-and-use.org

diff --git a/rfcs/proposed/numa_support/numa-arenas-creation-and-use.org b/rfcs/proposed/numa_support/numa-arenas-creation-and-use.org
new file mode 100644
index 0000000000..78fa560a14
--- /dev/null
+++ b/rfcs/proposed/numa_support/numa-arenas-creation-and-use.org
@@ -0,0 +1,147 @@
+#+title: API to Facilitate Instantiation and Use of oneTBB's Task Arenas Constrained to NUMA Nodes
+
+*Note:* This is a sub-RFC of the https://github.com/oneapi-src/oneTBB/pull/1535.
+
+* Introduction
+Let's consider the example from "Setting the preferred NUMA node" section of the
+[[https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Guiding_Task_Scheduler_Execution.html][Guiding Task Scheduler Execution]] page of oneTBB Developer Guide.
+
+** Motivating example
+#+begin_src C++
+std::vector<tbb::numa_node_id> numa_indexes = tbb::info::numa_nodes();   // [0]
+std::vector<tbb::task_arena> arenas(numa_indexes.size());                // [1]
+std::vector<tbb::task_group> task_groups(numa_indexes.size());           // [2]
+
+for(unsigned j = 0; j < numa_indexes.size(); j++) {
+    arenas[j].initialize(tbb::task_arena::constraints(numa_indexes[j])); // [3]
+    arenas[j].execute([&task_groups, &j](){                              // [4]
+        task_groups[j].run([](){/*some parallel stuff*/});
+    });
+}
+
+for(unsigned j = 0; j < numa_indexes.size(); j++) {
+    arenas[j].execute([&task_groups, &j](){ task_groups[j].wait(); });   // [5]
+}
+#+end_src
+
+Usually the users of oneTBB employ this technique to tie oneTBB worker threads
+up within NUMA nodes and yet have all the parallelism of a platform utilized.
+The pattern allows to find out how many NUMA nodes are on the system. With that
+number user creates that many ~tbb::task_arena~ objects, constraining each to a
+dedicated NUMA node. Along with ~tbb::task_arena~ objects user instantiates the
+same number of ~tbb::task_group~ objects, with which the oneTBB tasks are going
+to be associated. The ~tbb::task_group~ objects are needed because they allow
+waiting for the work completion as the ~tbb::task_arena~ class does not provide
+synchronization semantics on its own. Then the work gets submitted in each of
+arena objects, and waited upon their finish at the end.
+
+** Interface issues and inconveniences:
+- [0] - Getting the number of NUMA nodes is not the task by itself, but rather a
+  necessity to know how many objects to initialize further.
+- [1] - Explicit step for creating the number of ~tbb::task_arena~ objects per
+  each NUMA node. Note that by default the arena objects are constructed with a
+  slot reserved for master thread, which in this particular example usually
+  results in undersubscription issue as the master thread can join only one
+  arena at a time to help with work processing.
+- [2] - Separate step for instantiation the same number of ~tbb::task_group~
+  objects, in which the actual work is going to be submitted. Note that user
+  also needs to make sure the size of ~arenas~ matches the size of
+  ~task_groups~.
+- [3] - Actual tying of ~tbb::task_arena~ instances with corresponding NUMA
+  nodes. Note that user needs to make sure the indices of ~tbb::task_arena~
+  objects match corresponding indices of NUMA nodes.
+- [4] - Actual work submission point. It is relatively easy to make a mistake
+  here by using the ~tbb::task_arena::enqueue~ method instead. In this case not
+  only the work submission might be done after the synchronization point [5],
+  but also the loop counter ~j~ can be mistakenly captured by reference, which
+  at least results in submission of the work into incorrect ~tbb::task_group~,
+  and at most a segmentation fault, since the loop counter might not exist by
+  the time the functor starts its execution.
+- [5] - Synchronization point, where user needs to again make sure corresponding
+  indices are used. Otherwise, the waiting might be done in unrelated
+  ~tbb::task_arena~. It is also possible to mistakenly use
+  ~tbb::task_arena::enqueue~ method with the same consequences as were outlined
+  in the previous bullet, but since it is a synchronization point, usually the
+  blocking call is used.
+
+The proposal below addresses these issues.
+
+* Proposal
+Introduce simplified interface to:
+- Contstrain a task arena to specific NUMA node,
+- Submit work into constrained task arenas, and
+- To wait for completion of the submitted work.
+
+Since the new interface represents a constrained ~tbb::task_arena~ , the
+proposed name is ~tbb::constrained_task_arena~. Not including the word "numa"
+into the name would allow it for extension in the future for other types of
+constraints.
+
+** Usage Example
+#+begin_src C++
+std::vector<tbb::constrained_task_arena> numa_arenas =
+    tbb::initialize_numa_constrained_arenas();
+
+for(unsigned j = 0; j < numa_arenas.size(); j++) {
+    numa_arenas[j].enqueue( (){/*some parallel stuff*/} );
+}
+
+for(unsigned j = 0; j < numa_arenas.size(); j++) {
+    numa_arenas[j].wait();
+}
+#+end_src
+
+** New arena interface
+The example above requires new class named ~tbb::constrained_task_arena~. On one
+hand, it is a ~tbb::task_arena~ class that isolates the work execution from
+other parallel stuff executed by oneTBB. On the other hand, it is a constrained
+arena that represents an arena associated to a certain NUMA node and allows
+efficient and error-prone work submission in this particular usage scenario.
+
+#+begin_src C++
+namespace tbb {
+
+class constrained_task_arena : protected task_arena {
+public:
+    using task_arena::is_active();
+    using task_arena::terminate();
+
+    using task_arena::max_concurrency();
+
+    using task_arena::enqueue;
+
+    void wait();
+private:
+    constrained_task_arena(tbb::task_arena::constraints);
+    friend std::vector<constrained_task_arena> initialize_numa_constrained_arenas();
+};
+
+}
+#+end_src
+
+The interface exposes only necessary methods to allow submission and waiting of
+a parallel work. Most of the exposed function members are taken from the base
+~tbb::task_arena~ class. Implementation-wise, the new task arena would include
+associated ~tbb::task_group~ instance, with which enqueued work will be
+implicitly associated.
+
+The ~tbb::constrained_task_arena::wait~ method waits for the work in associated
+~tbb::task_group~ to finish, if any was submitted using the
+~tbb::constrained_task_arena::enqueue~ method.
+
+The instance of the ~tbb::constrained_task_arena~ class can be created only by
+~tbb::initialize_numa_constrained_arenas~ function, whose sole purpose is to
+instantiate a ~std::vector~ of initialized ~tbb::constrained_task_arena~
+instances, each constrained to its own NUMA node of the platform and does not
+include reserved slots, and return this vector back to caller.
+
+* Open Questions
+1. Should the interface for creation of constrained task arenas support other
+   construction parameters (e.g., max_concurrency, number of reserved slots,
+   priority, other constraints) from the very beginning or it is enough as the
+   first iteration and these parameters can be added in the future when the need
+   arise?
+2. Should the new task arena allow initializing it with, probably, different
+   parameters after its creation?
+3. Should the new task arena interface allow copying of its settings by exposing
+   its copy-constructor similarly to what ~tbb::task_arena~ does.

From a96d1b414dedc66340840624d0d1c9a9016fbebc Mon Sep 17 00:00:00 2001
From: "Fedotov, Aleksei" <aleksei.fedotov@intel.com>
Date: Thu, 14 Nov 2024 12:01:39 +0100
Subject: [PATCH 9/9] Address Mike's remarks

---
 rfcs/proposed/numa_support/numa-arenas-creation-and-use.org | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/rfcs/proposed/numa_support/numa-arenas-creation-and-use.org b/rfcs/proposed/numa_support/numa-arenas-creation-and-use.org
index 78fa560a14..038a9c5718 100644
--- a/rfcs/proposed/numa_support/numa-arenas-creation-and-use.org
+++ b/rfcs/proposed/numa_support/numa-arenas-creation-and-use.org
@@ -26,8 +26,8 @@ for(unsigned j = 0; j < numa_indexes.size(); j++) {
 
 Usually the users of oneTBB employ this technique to tie oneTBB worker threads
 up within NUMA nodes and yet have all the parallelism of a platform utilized.
-The pattern allows to find out how many NUMA nodes are on the system. With that
-number user creates that many ~tbb::task_arena~ objects, constraining each to a
+The pattern starts by finding the number of NUMA nodes on the system. With that
+number, user creates that many ~tbb::task_arena~ objects, constraining each to a
 dedicated NUMA node. Along with ~tbb::task_arena~ objects user instantiates the
 same number of ~tbb::task_group~ objects, with which the oneTBB tasks are going
 to be associated. The ~tbb::task_group~ objects are needed because they allow