Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix layout of non-power-of-two length vectors #422

Merged
merged 8 commits into from
Aug 13, 2024

Conversation

calebzulawski
Copy link
Member

@calebzulawski calebzulawski commented Jun 3, 2024

Fixes #63, fixes #319

Copy link
Member

@programmerjake programmerjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! once the test failures are fixed, feel free to merge

@calebzulawski
Copy link
Member Author

Is this failure a codegen problem? A handful of architectures work (unfortunately I can't replicate right now, I'm on mac where it passes)

@programmerjake
Copy link
Member

looks like aarch64 cross failed due to OOM...

@calebzulawski
Copy link
Member Author

I guess testing every length is excessive, I'll reduce it

@programmerjake
Copy link
Member

looks like aarch64 cross failed due to OOM...

it could be an aarch64 backend bug in llvm, since so far only aarch64 linux/mac has OOM-ed.

@programmerjake
Copy link
Member

I guess testing every length is excessive, I'll reduce it

iirc I tried to pick lengths that are around powers of 2 and around 3 * powers of 2...so like 15,16,17,23,24,25,31,32,33...

@programmerjake
Copy link
Member

well, looks like powerpc-unknown-linux-gnu has a non-trivial failure (not OOM):
https://github.com/rust-lang/portable-simd/actions/runs/9359687687/job/25763803238?pr=422

@programmerjake
Copy link
Member

if you can minimize the bugs you encountered with aarch64 and powerpc, I think submitting a bug report to LLVM would be good!

@calebzulawski calebzulawski force-pushed the non-power-of-two-layout branch from dbe18a4 to e3dabf5 Compare June 6, 2024 01:23
@workingjubilee
Copy link
Member

The PowerPC errors are genuine.

@workingjubilee
Copy link
Member

I don't know if they're actually incorrect, however, as they are likely to be an endianness problem.

@calebzulawski
Copy link
Member Author

calebzulawski commented Jun 6, 2024

There are basically 3 different classes of errors here:

  • random crashes, I think this is OOM etc due to too many tests, reduced by the second commit
  • failures on most (but not all) architectures for bitmask vectors for non-powers-of-two. I'm not sure if this is llvm or rustc, but I worked around it by extending to powers of two
  • the powerpc bitmask vector error. this is with the workaround, looks like endianness, but the code does account for endianness.

All of these should be "fixed" now, since we've removed the bitmask vectors. I am curious what was causing the second error but not sure I'll get the chance to look into it yet

@programmerjake
Copy link
Member

(I attempted to quote reply, but accidentally edited your comment instead, sorry. replied this time)

There are basically 3 different classes of errors here:

  • random crashes, I think this is OOM etc due to too many tests, reduced by the second commit

we're still getting SIGKILL on aarch64 -- it could be too many tests, it could also be an excessive memory usage bug for non-excessive input code with weird vector lengths in the aarch64 llvm backend.

@calebzulawski
Copy link
Member Author

calebzulawski commented Jun 6, 2024

I'm seeing that too. I can replicate it if I build for aarch64. It seems to be an infinite loop. Even building with --emit=llvm-ir I can't get it to complete (the tests that fail are cast, u8_ops, and i8_ops).
I do have a stack trace I was able to extract:

314.44 Gc  100.0%	-	 	rustc (22655)
314.44 Gc  100.0%	-	 	 thread_start
314.44 Gc  100.0%	-	 	  _pthread_start
314.44 Gc  100.0%	-	 	   std::sys::pal::unix::thread::Thread::new::thread_start::h3d442a96f4a94842
314.44 Gc  100.0%	-	 	    _RNSNvYNCINvMNtCsemj25UseQJj_3std6threadNtBa_7Builder16spawn_unchecked_NCINvXs0_CsfrAtrMCWRw_18rustc_codegen_llvmNtB1f_18LlvmCodegenBackendNtNtNtCsdjrb6H688DD_17rustc_codegen_ssa6traits7backend19ExtraBackendMethods18spawn_named_threadNCINvNtNtB2i_4back5write10spawn_workB1M_E0uE0uEs0_0INtNtNtCs9mSAhCB19GO_4core3ops8function6FnOnceuE9call_once6vtableB1f_
314.44 Gc  100.0%	-	 	     _RINvNtNtCsemj25UseQJj_3std10sys_common9backtrace28___rust_begin_short_backtraceNCINvXs0_CsfrAtrMCWRw_18rustc_codegen_llvmNtB1o_18LlvmCodegenBackendNtNtNtCsdjrb6H688DD_17rustc_codegen_ssa6traits7backend19ExtraBackendMethods18spawn_named_threadNCINvNtNtB2r_4back5write10spawn_workB1V_E0uE0uEB1o_
314.44 Gc  100.0%	-	 	      _RINvNtNtCsdjrb6H688DD_17rustc_codegen_ssa4back5write24finish_intra_module_workNtCsfrAtrMCWRw_18rustc_codegen_llvm18LlvmCodegenBackendEB1g_
314.44 Gc  100.0%	-	 	       _RNvNtNtCsfrAtrMCWRw_18rustc_codegen_llvm4back5write7codegen
314.44 Gc  100.0%	-	 	        _RNvNtNtCsfrAtrMCWRw_18rustc_codegen_llvm4back5write17write_output_file
314.44 Gc  100.0%	-	 	         LLVMRustWriteOutputFile
314.44 Gc  100.0%	-	 	          llvm::legacy::PassManagerImpl::run(llvm::Module&)
314.44 Gc  100.0%	-	 	           llvm::FPPassManager::runOnModule(llvm::Module&)
314.44 Gc  100.0%	-	 	            llvm::FPPassManager::runOnFunction(llvm::Function&)
314.44 Gc  100.0%	-	 	             llvm::MachineFunctionPass::runOnFunction(llvm::Function&)
314.44 Gc  100.0%	-	 	              llvm::Legalizer::runOnMachineFunction(llvm::MachineFunction&)
314.19 Gc   99.9%	7.12 Gc	 	               llvm::Legalizer::legalizeMachineFunction(llvm::MachineFunction&, llvm::LegalizerInfo const&, llvm::ArrayRef<llvm::GISelChangeObserver*>, llvm::LostDebugLocObserver&, llvm::MachineIRBuilder&, llvm::GISelKnownBits*)
114.05 Gc   36.2%	3.68 Gc	 	                llvm::LegalizerHelper::moreElementsVector(llvm::MachineInstr&, unsigned int, llvm::LLT)
107.85 Gc   34.2%	9.40 Gc	 	                llvm::LegalizationArtifactCombiner::tryCombineInstruction(llvm::MachineInstr&, llvm::SmallVectorImpl<llvm::MachineInstr*>&, llvm::GISelObserverWrapper&)
50.57 Gc   16.0%	2.71 Gc	 	                llvm::eraseInstrs(llvm::ArrayRef<llvm::MachineInstr*>, llvm::MachineRegisterInfo&, llvm::LostDebugLocObserver*)
17.07 Gc    5.4%	1.52 Gc	 	                llvm::LegalizerHelper::legalizeInstrStep(llvm::MachineInstr&, llvm::LostDebugLocObserver&)
13.21 Gc    4.2%	7.18 Gc	 	                llvm::isTriviallyDead(llvm::MachineInstr const&, llvm::MachineRegisterInfo const&)
1.82 Gc    0.5%	1.09 Gc	 	                llvm::LostDebugLocObserver::checkpoint(bool)
1.01 Gc    0.3%	1.01 Gc	 	                llvm::detail::DenseMapPair<llvm::MachineInstr*, unsigned int>* llvm::DenseMapBase<llvm::DenseMap<llvm::MachineInstr*, unsigned int, llvm::DenseMapInfo<llvm::MachineInstr*, void>, llvm::detail::DenseMapPair<llvm::MachineInstr*, unsigned int> >, llvm::MachineInstr*, unsigned int, llvm::DenseMapInfo<llvm::MachineInstr*, void>, llvm::detail::DenseMapPair<llvm::MachineInstr*, unsigned int> >::InsertIntoBucket<llvm::MachineInstr* const&, unsigned long>(llvm::detail::DenseMapPair<llvm::MachineInstr*, unsigned int>*, llvm::MachineInstr* const&, unsigned long&&)
537.93 Mc    0.1%	537.93 Mc	 	                free
408.48 Mc    0.1%	-	 	                0xfffffffffffffffe
363.53 Mc    0.1%	-	 	                llvm::SmallVectorBase<unsigned int>::grow_pod(void*, unsigned long, unsigned long)
124.90 Mc    0.0%	124.90 Mc	 	                default_zone_free_definite_size
31.51 Mc    0.0%	31.51 Mc	 	                llvm::saveUsesAndErase(llvm::MachineInstr&, llvm::MachineRegisterInfo&, llvm::LostDebugLocObserver*, llvm::GISelWorkList<4u>&)
10.62 Mc    0.0%	10.62 Mc	 	                llvm::allocate_buffer(unsigned long, unsigned long)
6.45 Mc    0.0%	6.45 Mc	 	                DYLD-STUB$$operator new(unsigned long)
6.20 Mc    0.0%	6.20 Mc	 	                std::__1::__tree<llvm::DebugLoc, std::__1::less<llvm::DebugLoc>, std::__1::allocator<llvm::DebugLoc> >::destroy(std::__1::__tree_node<llvm::DebugLoc, void*>*)
2.02 Mc    0.0%	2.02 Mc	 	                operator delete(void*)
1.07 Mc    0.0%	1.07 Mc	 	                DYLD-STUB$$free
31.51 Kc    0.0%	31.51 Kc	 	                DYLD-STUB$$free
15.15 Kc    0.0%	15.15 Kc	 	                DYLD-STUB$$operator delete(void*)
223.93 Mc    0.0%	-	 	               0xfffffffffffffffe
30.24 Mc    0.0%	30.24 Mc	 	               llvm::LegalizerHelper::legalizeInstrStep(llvm::MachineInstr&, llvm::LostDebugLocObserver&)

@programmerjake
Copy link
Member

(apparently I can't click in the right spot today, since I edited your comment again)

I'm seeing that too. I can replicate it if I build for aarch64. It seems to be an infinite loop. Even building with --emit=llvm-ir I can't get it to complete (the tests that fail are cast, u8_ops, and i8_ops).

can you get llvm-ir with --emit=llvm-ir -O -C no-prepopulate-passes -C codegen-units=1?
since once you have llvm ir, it may be easier to try to reduce it to a minimal llvm test case. rustc may even be generating invalid llvm ir.

@calebzulawski
Copy link
Member Author

Looks like only -0 was necessary to get it to emit LLVM. I played around with using the pass arguments to opt but it doesn't seem to accept the flags rust is emitting from -Zprint-llvm-passes, I'm probably doing something wrong

@programmerjake
Copy link
Member

Looks like only -0 was necessary to get it to emit LLVM.

I think you meant -O (letter O, not zero)

I played around with using the pass arguments to opt but it doesn't seem to accept the flags rust is emitting from -Zprint-llvm-passes, I'm probably doing something wrong

try running opt with just the input file and --verify, this will run LLVM's module verification pass which will tell you if you gave it invalid LLVM IR. if that passes, you can also try running opt with -O2 --verify-each, which runs opt's default optimization pipeline and verifies the LLVM IR after every pass.

if the llvm-ir is small enough, it would be great if you would put it in llvm.godbolt.org and share it here, which would let us try and figure it out without having to compile everything locally, plus Compiler Explorer has nice features for showing what changed in which passes (technically you can do that with the command line, but the website is much more friendly).

@calebzulawski
Copy link
Member Author

calebzulawski commented Jun 6, 2024

So it looks like the problem is only with opt-level=0, 1+ works fine, so the problem is probably more related to lowering/isel than optimizations. This is the closest I've come to replicating it, though I'm not 100% sure it's the same cause: https://llvm.godbolt.org/z/c8xGsaqdn

@programmerjake
Copy link
Member

ok, I reduced the problem: https://llvm.godbolt.org/z/1Ehsh97nP

@programmerjake
Copy link
Member

so, maybe add a workaround for fp to int only on aarch64 that expands element count to the next power of two? idk if that will fix it, the backend bug may also occur for other ops.

@calebzulawski
Copy link
Member Author

That did fix it, there was also another codegen bug--Rem has a failure only for i8 and u8, non-powers-of-two, when the second argument is all 0s (it works fine for not all zeros, only the rem_zero_panic test fails)

@calebzulawski calebzulawski force-pushed the non-power-of-two-layout branch from d513647 to e8a56e4 Compare June 7, 2024 01:15
@@ -99,7 +99,7 @@ use crate::simd::{
// directly constructing an instance of the type (i.e. `let vector = Simd(array)`) should be
// avoided, as it will likely become illegal on `#[repr(simd)]` structs in the future. It also
// causes rustc to emit illegal LLVM IR in some cases.
#[repr(simd)]
#[repr(simd, packed)]
Copy link
Member

@RalfJung RalfJung Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan for simd without packed? Miri currently ICEs when such a type is used with a simd intrinsic and the size is not a power of 2. If portable-simd doesn't need support for that then do we need to have it at all? Can we just make simd itself have the behavior that simd, packed now has?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simd without packed is used by stdarch. portable-simd might use it in the future too, though imo probably won't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK stdarch only uses power-of-2 vectors, where packed makes no difference?

@@ -639,43 +627,30 @@ macro_rules! test_lanes_panic {
core_simd::simd::LaneCount<$lanes>: core_simd::simd::SupportedLaneCount,
$body

// test some odd and even non-power-of-2 lengths on miri
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have any even non-power-of-2. Maybe replace 5 by 6?

(Though I am also not sure why odd vs even would be an interesting difference here.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look down further it tests length 3 on miri, the idea is we want to catch bugs caused by repr(simd, packed) having alignment smaller than repr(simd) which only happens for non-power-of-2 sizes. even non-power-of-2 sizes cover where the alignment is in between the element alignment and the non-packed alignment. @calebzulawski can you add back in length 6 since that's the smallest length where that occurs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah 3 and 6 would probably be reasonable then. To cut down on CI times I'd remove 5.

@RalfJung
Copy link
Member

RalfJung commented Jun 8, 2024

That did fix it, there was also another codegen bug--Rem has a failure only for i8 and u8, non-powers-of-two, when the second argument is all 0s (it works fine for not all zeros, only the rem_zero_panic test fails)

Is there an issue for that?

the powerpc bitmask vector error. this is with the workaround, looks like endianness, but the code does account for endianness.

I think the code did account for endianess in the wrong way, see rust-lang/rust#126171.

@calebzulawski calebzulawski force-pushed the non-power-of-two-layout branch 2 times, most recently from 6e03d63 to ce73c96 Compare June 23, 2024 19:23
@calebzulawski calebzulawski force-pushed the non-power-of-two-layout branch from 98f923e to a49f77e Compare August 7, 2024 05:24
unsafe { core::intrinsics::simd::$simd_call($lhs, rhs) }

// aarch64 div fails for arbitrary `v % 0`, mod fails when rhs is MIN, for non-powers-of-two
// these operations aren't vectorized on aarch64 anyway
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are LLVM backend bugs, right? simd_div/simd_rem still should work the same on all targets?

That seems worth tracking somewhere, having subtly buggy intrinsics is no good.

Copy link
Member

@programmerjake programmerjake Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, theoretically LLVM should be able to generate SIMD code for division/remainder by a constant, by using the exact same fancy math as it would use for scalars (which it unfortunately currently does after scalarization of div ops for non-power-of-2 vectors), so once LLVM's bugs are fixed, I think we should switch back to generating SIMD ops.

https://clang.godbolt.org/z/MxK47TWGs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are definitely backend bugs

Co-authored-by: Ralf Jung <post@ralfj.de>
@calebzulawski calebzulawski force-pushed the non-power-of-two-layout branch from 400e6e8 to 2a3b8ad Compare August 9, 2024 01:14
@calebzulawski
Copy link
Member Author

Any ideas why the proptest variable doesn't seem to make it into cross? Or maybe it is, but it's not the number of cases that make the tests slow?

@programmerjake
Copy link
Member

using the github actions feature that shows a timestamp for each line of output (get to it by clicking the settings gear on that actions job page), it looks like running the debug tests is taking almost all of the time...

@calebzulawski
Copy link
Member Author

That gave me an idea, turns out you can set the optimization level of all dependencies outside of the workspace. I suspected maybe proptest itself was the slow part. If we're okay with this change, it dramatically improves test times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

non-power-of-2 Simd types have wrong size Support non-power-of-two vector lengths.
4 participants