Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SYCL backend __parallel_for performance for large input sizes #1870

Closed
wants to merge 26 commits into from

Conversation

mmichel11
Copy link
Contributor

@mmichel11 mmichel11 commented Sep 20, 2024

Summary

This PR improves __parallel_for performance for large input sizes by switching to an nd-range kernel to process multiple inputs per work item which enables us to use the full hardware bandwidth.

Details

On some target architectures, we are currently not hitting roofline memory bandwidth performance in our __parallel_for pattern. The cause is that our SYCL basic kernel implementation only processes a single element per item. This is insufficient to fully utilize memory bandwidth on some target architectures. Processing multiple inputs per work item enables us to perform enough loads / stores to saturate the hardware bandwidth. Explicitly using a coalesced pattern through either a sub-group or work-group stride ensures that a good access pattern is achieved.

A nd-range kernel has been added for large input sizes that uses a heuristic based upon the smallest sized type in the set of provided ranges to determine the number of iterations to process per input item. This drastically improves performance on target architectures for large inputs across nearly all for-based algorithms.

A second kernel has been added as opposed to merging both paths within a single kernel to prevent extra runtime dispatch within the kernel which hurt performance for small inputs. There is a smaller runtime overhead for selecting the best path from the host and compiling two kernels. For small-to-medium inputs, the SYCL basic kernel performs the best.

@SergeyKopienko
Copy link
Contributor

@mmichel11 I have take a look to the history of this branch, probably make sense to rebase your branch from the current main state or merge main branch into your PR: we have a lot of new commits in the main branch now.

@mmichel11 mmichel11 force-pushed the dev/mmichel11/parallel_for_sub_group_stride branch from e4b40a5 to 8b9c3c9 Compare September 25, 2024 13:22
@mmichel11 mmichel11 added this to the 2022.8.0 milestone Sep 25, 2024
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some things not directly related to the main part of the PR, which looks good to me.

const bool __is_full_sub_group =
__sub_group_start_idx + __iters_per_work_item * __sub_group_size <= __count;
const std::size_t __work_item_idx = __sub_group_start_idx + __sub_group_local_id;
return std::make_tuple(__work_item_idx, __sub_group_size, __is_full_sub_group);
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include <tuple>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, the usage of std::make_tuple for primitive types doesn't make sense at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just std::tuple is now used in the new PR.

Comment on lines 788 to 808
template <typename _Tuple>
class __min_tuple_type_size;

template <typename _T>
class __min_tuple_type_size<std::tuple<_T>>
{
public:
static constexpr std::size_t value = sizeof(_T);
};

template <typename _T, typename... _Ts>
class __min_tuple_type_size<std::tuple<_T, _Ts...>>
{
static constexpr std::size_t __min_type_value_ts = __min_tuple_type_size<std::tuple<_Ts...>>::value;

public:
static constexpr std::size_t value = std::min(sizeof(_T), __min_type_value_ts);
};

template <typename _Tuple>
inline constexpr std::size_t __min_tuple_type_size_v = __min_tuple_type_size<_Tuple>::value;
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified:

template<typename _Tuple>
struct __min_tuple_type_size;

template<typename... Ts>
struct __min_tuple_type_size<std::tuple<Ts...>> {
    static constexpr std::size_t value = std::min({sizeof(Ts)...});
};

_v alias is not-necessary as it is used only once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have replaced this __min_tuple_type_size with __min_nested_type_size which avoids having to flatten the tuple first and applied these ideas there.

@@ -793,6 +793,25 @@ struct __decay_with_tuple_specialization<::std::tuple<_Args...>>
template <typename... _Args>
using __decay_with_tuple_specialization_t = typename __decay_with_tuple_specialization<_Args...>::type;

// Flatten nested std::tuple or oneapi::dpl::__internal::tuple types into a single std::tuple.
template <typename _T>
struct __flatten_std_or_internal_tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional suggestion: __flatten_std_or_internal_tuple -> __flatten_tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This utility has been removed

@@ -793,6 +793,25 @@ struct __decay_with_tuple_specialization<::std::tuple<_Args...>>
template <typename... _Args>
using __decay_with_tuple_specialization_t = typename __decay_with_tuple_specialization<_Args...>::type;

// Flatten nested std::tuple or oneapi::dpl::__internal::tuple types into a single std::tuple.
template <typename _T>
struct __flatten_std_or_internal_tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend moving __flatten_std_or_internal_tuple into utils.h. It is a niche utility not related to the core part of the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally done since tuple_impl.h includes utils.h, so we would have to forward declare our internal tuple otherwise to avoid a circular dependency.

The new utility supports on arbitrary type and doesn't require any specializations for our internal tuple, so it can be easily placed in utils.h

__stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item,
std::size_t __work_group_size)
{
if constexpr (oneapi::dpl::__internal::__is_spirv_target_v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include utils.h where __is_spirv_target_v is defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
128 byte memory operations are performed instead of 512 after inspecting
the assembly. Processing 512 bytes per sub-group still seems to be the
best value after experimentation.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
* Move __stride_recommender into __parallel_for_large_submitter
* Use {} to invoke constructor
* Simplify if-else statements in for dispatch

Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
@mmichel11 mmichel11 force-pushed the dev/mmichel11/parallel_for_sub_group_stride branch from 8a95b24 to 33337f8 Compare November 6, 2024 19:46
@mmichel11 mmichel11 marked this pull request as draft November 6, 2024 19:52
@mmichel11
Copy link
Contributor Author

Thanks for the reviews everyone. I have addressed all current comments.

As discussed offline, we will use the current state of the PR as a starting point to introduce vectorized load / store paths where it is performant by rewriting our bricks for parallel for. Likely, I will have a second PR into this branch with these changes once they are complete.

// Compile two kernels: one for small-to-medium inputs and a second for large. This avoids runtime checks within a
// single kernel that worsen performance for small cases. If the number of iterations of the large submitter is 1,
// then only compile the basic kernel as the two versions are effectively the same.
if constexpr (__large_submitter::__iters_per_work_item > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this if constexpr together with the next if?
Will we have some real profit from these two conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining the two would make it a runtime if conditional. Even if __iters_per_work_item is known at compile-time and the compiler can optimize it out, there may still be a chance for the kernel to be unnecessarily compiled. I think it is best to keep the if constexpr, so we can be sure to avoid compiling the large submitter if possible.

{
assert(oneapi::dpl::__ranges::__get_first_range_size(__rngs...) > 0);
_PRINT_INFO_IN_DEBUG_MODE(__exec);
auto __event = __exec.queue().submit([&__rngs..., &__brick, &__exec, __count](sycl::handler& __cgh) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we haven't any other cases where we capture policy into submit call.
May be better to eval

            const std::size_t __work_group_size =
                oneapi::dpl::__internal::__max_work_group_size(__exec, __max_work_group_size);

outside of submit and capture __work_group_size by value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, applied to new PR (this one will be closed soon).

const std::size_t __work_item_idx = __work_group_start_idx + __item.get_local_linear_id();
const bool __is_full_work_group =
__work_group_start_idx + __iters_per_work_item * __work_group_size <= __count;
return std::make_tuple(__work_item_idx, __work_group_size, __is_full_work_group);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just std::tuple is now used in the new PR.

Comment on lines +334 to +336
const std::size_t __num_items = __num_groups * __work_group_size;
__cgh.parallel_for<_Name...>(
sycl::nd_range(sycl::range<1>(__num_items), sycl::range<1>(__work_group_size)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::size_t __num_items = __num_groups * __work_group_size;
__cgh.parallel_for<_Name...>(
sycl::nd_range(sycl::range<1>(__num_items), sycl::range<1>(__work_group_size)),
__cgh.parallel_for<_Name...>(
sycl::nd_range(sycl::range<1>(__num_groups * __work_group_size), sycl::range<1>(__work_group_size)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more readable from my point of view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it is in the new PR.

if constexpr (oneapi::dpl::__internal::__is_spirv_target_v)
{
const __dpl_sycl::__sub_group __sub_group = __item.get_sub_group();
const std::uint32_t __sub_group_size = __sub_group.get_local_linear_range();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these functions returns std::size_t. Could you please explain why you are using std::uint32_t instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature for the sub-group member function is uint32_t get_local_linear_range() const and related functions also return uint32_t. Were you thinking about the group class maybe?

{
const std::uint8_t __adjusted_iters_per_work_item =
oneapi::dpl::__internal::__dpl_ceiling_div(__count - __idx, __stride);
for (std::uint8_t __i = 0; __i < __adjusted_iters_per_work_item; ++__i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use _ONEDPL_PRAGMA_UNROLL for this for-loop too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this case (there is a similar case in the strided loop utility in the new PR). Because the loop end variable is computed at run-time, the loop cannot be unrolled.

This path is called for the last sub-group / work-group, so the performance impact is negligible.

@mmichel11 mmichel11 modified the milestone: 2022.8.0 Dec 17, 2024
mmichel11 added a commit that referenced this pull request Dec 20, 2024
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
@mmichel11
Copy link
Contributor Author

Closing this as it is replaced by: #1976

@mmichel11 mmichel11 closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants