Skip to content

Commit

Permalink
Fixed broken triangle rendering with some Intel iGPUs (driver bug wor…
Browse files Browse the repository at this point in the history
…karound in marching_cubes), added new GPU voxelization, added Quadro GV100 benchmarks, added tool to print current camera position (key_H), refactoring in setups
  • Loading branch information
ProjectPhysX committed Sep 29, 2022
1 parent 10a16c7 commit 4f3c5d4
Show file tree
Hide file tree
Showing 11 changed files with 317 additions and 196 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The fastest and most memory efficient lattice Boltzmann CFD software, running on any GPU via [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper "OpenCL-Wrapper").

<a href="https://youtu.be/5AzxwQpng0M"><img src="https://img.youtube.com/vi/5AzxwQpng0M/maxresdefault.jpg" alt="10 billion voxel Space Shuttle simulation" width="50%"></img></a><a href="https://youtu.be/NJnMQwp3wBI"><img src="https://img.youtube.com/vi/NJnMQwp3wBI/maxresdefault.jpg" alt="700 million voxel raindrop simulation" width="50%"></img></a><br>
<a href="https://youtu.be/5AzxwQpng0M"><img src="https://img.youtube.com/vi/5AzxwQpng0M/maxresdefault.jpg" alt="10 billion voxel Space Shuttle simulation" width="50%"></img></a><a href="https://youtu.be/VadLwt9OqMo"><img src="https://img.youtube.com/vi/VadLwt9OqMo/maxresdefault.jpg" alt="1 billion voxel raindrop simulation" width="50%"></img></a><br>
<a href="https://youtu.be/NQPgumd3Ei8"><img src="https://img.youtube.com/vi/NQPgumd3Ei8/maxresdefault.jpg" alt="Hydraulic jump simulation" width="50%"></img></a><a href="https://youtu.be/3JNVBQyetMA"><img src="https://img.youtube.com/vi/3JNVBQyetMA/maxresdefault.jpg" alt="Star Wars X-wing simulation" width="50%"></img></a>


Expand Down Expand Up @@ -175,6 +175,7 @@ In consequence, the arithmetic intensity of this implementation is 2.13 (FP32/FP
| Nvidia A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) |
| Nvidia A100 PCIe 40GB | 19.49 | 40 | 1555 | 8526 (84%) | 16035 (79%) | 11088 (55%) |
| Nvidia Tesla V100 16GB | 14.13 | 16 | 900 | 5128 (87%) | 10325 (88%) | 7683 (66%) |
| Nvidia Quadro GV100 | 16.66 | 32 | 870 | 3442 (61%) | 6641 (59%) | 5863 (52%) |
| Nvidia Tesla P100 16GB | 9.52 | 16 | 732 | 3295 (69%) | 5950 (63%) | 4176 (44%) |
| Nvidia Tesla P100 12GB | 9.52 | 12 | 549 | 2427 (68%) | 4141 (58%) | 3999 (56%) |
| Nvidia Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) |
Expand All @@ -198,7 +199,7 @@ In consequence, the arithmetic intensity of this implementation is 2.13 (FP32/FP
| Nvidia GeForce GTX 960M | 1.51 | 4 | 80 | 442 (84%) | 872 (84%) | 627 (60%) |
| Nvidia Quadro K2000 | 0.73 | 2 | 64 | 312 (75%) | 444 (53%) | 171 (21%) |
| Nvidia GeForce GT 630 (OEM) | 0.46 | 2 | 29 | 151 (81%) | 185 (50%) | 78 (21%) |
| Apple M1 Pro 16-Core 16GB | 4.10 | 11 | 200 | 1204 (92%) | 2329 (90%) | 1855 (71%) |
| Apple M1 Pro GPU 16C 16GB | 4.10 | 11 | 200 | 1204 (92%) | 2329 (90%) | 1855 (71%) |
| AMD Radeon Vega 8 Graphics | 1.23 | 7 | 38 | 157 (63%) | 282 (57%) | 288 (58%) |
| Intel UHD Graphics 630 | 0.46 | 7 | 51 | 151 (45%) | 301 (45%) | 187 (28%) |
| Intel HD Graphics 5500 | 0.35 | 3 | 26 | 75 (45%) | 192 (58%) | 108 (32%) |
Expand Down
14 changes: 7 additions & 7 deletions src/defines.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@
#define SRT // choose single-relaxation-time LBM collision operator; (default)
//#define TRT // choose two-relaxation-time LBM collision operator

//#define FP16S // compress LBM DDFs to range-shifted IEEE-754 FP16; number conversion is done in hardware; all arithmetic is still done in FP32
//#define FP16C // compress LBM DDFs to more accurate custom FP16C format; number conversion is emulated in software; all arithmetic is still done in FP32

#define BENCHMARK // disable all extensions and setups and run benchmark setup instead

//#define VOLUME_FORCE // enables global force per volume in one direction, specified in the LBM class constructor; the force can be changed on-the-fly between time steps at no performance cost
//#define FORCE_FIELD // enables a force per volume for each lattice point independently; allocates an extra 12 Bytes/node; enables computing the forces from the fluid on solid boundaries with lbm.calculate_force_on_boundaries();
//#define MOVING_BOUNDARIES // enables moving solids: set solid nodes to TYPE_S and set their velocity u unequal to zero
Expand All @@ -18,17 +23,12 @@
//#define TEMPERATURE // enables temperature extension; set fixed-temperature nodes with TYPE_T (similar to EQUILIBRIUM_BOUNDARIES); allocates an extra 32 (FP32) or 18 (FP16) Bytes/node
//#define SUBGRID // enables Smagorinsky-Lilly subgrid turbulence model to keep simulations with very large Reynolds number stable

//#define FP16S // compress LBM DDFs to range-shifted IEEE-754 FP16; number conversion is done in hardware; all arithmetic is still done in FP32
//#define FP16C // compress LBM DDFs to more accurate custom FP16C format; number conversion is emulated in software; all arithmetic is still done in FP32

//#define WINDOWS_GRAPHICS // enable interactive graphics in Windows; start/pause the simulation by pressing P
//#define CONSOLE_GRAPHICS // enable interactive graphics in the console; start/pause the simulation by pressing P
//#define GRAPHICS // run FluidX3D in the console, but still enable graphics functionality for writing rendered frames to the hard drive

#define BENCHMARK // disable all extensions and setups and run benchmark setup instead

#define GRAPGICS_FRAME_WIDTH 15360 // set frame width if only GRAPHICS is enabled
#define GRAPGICS_FRAME_HEIGHT 8640 // set frame height if only GRAPHICS is enabled
#define GRAPHICS_FRAME_WIDTH 3840 // set frame width if only GRAPHICS is enabled
#define GRAPHICS_FRAME_HEIGHT 2160 // set frame height if only GRAPHICS is enabled
#define GRAPHICS_BACKGROUND_COLOR 0x000000 // set background color; black background (default) = 0x000000, white background = 0xFFFFFF
#define GRAPHICS_U_MAX 0.15f // maximum velocity for velocity coloring in units of LBM lattice speed of sound (c=1/sqrt(3)) (default: 0.15f)
#define GRAPHICS_Q_CRITERION 0.0001f // Q-criterion value for Q-criterion isosurface visualization (default: 0.0001f)
Expand Down
4 changes: 2 additions & 2 deletions src/graphics.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -901,8 +901,8 @@ void draw_label(const Color& c, const string& s, const int x, const int y) {}
int main(int argc, char* argv[]) {
main_arguments = get_main_arguments(argc, argv);
camera.fps_limit = 60u; // find out screen refresh rate
camera.width = GRAPGICS_FRAME_WIDTH; // must be divisible by 8
camera.height = GRAPGICS_FRAME_HEIGHT; // must be divisible by 8
camera.width = GRAPHICS_FRAME_WIDTH; // must be divisible by 8
camera.height = GRAPHICS_FRAME_HEIGHT; // must be divisible by 8
camera.fov = 100.0f;
set_zoom(1.0f); // set initial zoom
camera.update_matrix();
Expand Down
10 changes: 10 additions & 0 deletions src/info.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,16 @@ void Info::print_update() const {
(steps==max_ulong ? alignr(17, lbm->get_t()) : alignr(12, lbm->get_t())+" "+print_percentage((double)(lbm->get_t()-steps_last)/(double)steps))+" | "+ // current step
alignr(19, print_time(time()))+" |" // either elapsed time or remaining time
);
#ifdef GRAPHICS
if(key_H) { // print camera settings
const string camera_position = "float3("+to_string(camera.pos.x/(float)lbm->get_Nx(), 6u)+"f*(float)Nx, "+to_string(camera.pos.y/(float)lbm->get_Ny(), 6u)+"f*(float)Ny, "+to_string(camera.pos.z/(float)lbm->get_Nz(), 6u)+"f*(float)Nz)";
const string camera_rx_ry_fov = to_string(degrees(camera.rx)-90.0f, 1u)+"f, "+to_string(180.0f-degrees(camera.ry), 1u)+"f, "+to_string(camera.fov, 1u)+"f";
const string camera_zoom = to_string(camera.zoom*(float)fmax(fmax(lbm->get_Nx(), lbm->get_Ny()), lbm->get_Nz())/(float)min(camera.width, camera.height), 6u)+"f";
if(camera.free) print_info("lbm.graphics.set_camera_free("+camera_position+", "+camera_rx_ry_fov+");");
else print_info("lbm.graphics.set_camera_centered("+camera_rx_ry_fov+", "+camera_zoom+");");
key_H = false;
}
#endif // GRAPHICS
}
void Info::print_finalize() {
allow_rendering = false;
Expand Down
50 changes: 48 additions & 2 deletions src/kernel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -471,9 +471,9 @@ string opencl_c_container() { return R( // ########################## begin of O
cube *= 15u;
uint i; // number of triangle vertices
for(i=0u; i<15u&&triangle_table(cube+i)!=15u; i+=3u) { // create the triangles
triangles[i+2u] = vertex[triangle_table(cube+i+2u)];
triangles[i+1u] = vertex[triangle_table(cube+i+1u)];
triangles[i ] = vertex[triangle_table(cube+i )];
triangles[i+1u] = vertex[triangle_table(cube+i+1u)];
triangles[i+2u] = vertex[triangle_table(cube+i+2u)];
}
return i/3u; // return number of triangles
}
Expand Down Expand Up @@ -1864,6 +1864,52 @@ string opencl_c_container() { return R( // ########################## begin of O



)+R(kernel void voxelize_mesh(global uchar* flags, const uchar flag, const global float* p0, const global float* p1, const global float* p2, const uint triangle_number, float x0, float y0, float z0, float x1, float y1, float z1) { // voxelize triangle mesh
const uint n = get_global_id(0); // n = x+(y+z*Ny)*Nx
const float3 p = position(coordinates(n))+(float3)(0.5f*(float)def_Nx-0.5f, 0.5f*(float)def_Ny-0.5f, 0.5f*(float)def_Nz-0.5f);
const bool condition = p.x<x0||p.y<y0||p.z<z0||p.x>x1||p.y>y1||p.z>z1;
volatile local uint workgroup_condition;
workgroup_condition = 1u;
barrier(CLK_LOCAL_MEM_FENCE);
atomic_and(&workgroup_condition, (uint)condition);
barrier(CLK_LOCAL_MEM_FENCE);
const bool workgroup_all = (bool)workgroup_condition;
if(workgroup_all) return; // return straight away if grid point is outside the bounds of the mesh (~4x faster)
const float3 r0_origin = p;
const float3 r1_origin = p;
const float3 r0_direction = (float3)(+0.01f, +0.04f, +1.03f); // from each grid point, shoot an outward ray and count how often it intersects the mesh, odd number -> grid point is inside mesh
const float3 r1_direction = (float3)(-0.05f, -0.06f, -1.07f); // to eliminate errors, repeat with a second ray in a different random direction
uint intersections_0=0u, intersections_1=0u;
const uint lid = get_local_id(0);
local float3 cache_p0[def_workgroup_size]; // use local memory (~25% faster)
local float3 cache_p1[def_workgroup_size];
local float3 cache_p2[def_workgroup_size];
for(uint i=0u; i<triangle_number; i+=def_workgroup_size) {
const uint tx=3u*(i+lid), ty=tx+1u, tz=ty+1u;
cache_p0[lid] = (float3)(p0[tx], p0[ty], p0[tz]);
cache_p1[lid] = (float3)(p1[tx], p1[ty], p1[tz]);
cache_p2[lid] = (float3)(p2[tx], p2[ty], p2[tz]);
barrier(CLK_LOCAL_MEM_FENCE);
for(int j=0; j<def_workgroup_size&&i+j<triangle_number; j++) {
const float3 p0i=cache_p0[j], p1i=cache_p1[j], p2i=cache_p2[j];
const float3 u=p1i-p0i, v=p2i-p0i;
{
const float3 w=r0_origin-p0i, h=cross(r0_direction, v), q=cross(w, u);
const float f=1.0f/dot(u, h), s=f*dot(w, h), t=f*dot(r0_direction, q);
intersections_0 += (uint)(s>=0.0f&&s<=1.0f&&t>=0.0f&&s+t<=1.0f&&f*dot(v, q)>0.0f);
} {
const float3 w=r1_origin-p0i, h=cross(r1_direction, v), q=cross(w, u);
const float f=1.0f/dot(u, h), s=f*dot(w, h), t=f*dot(r1_direction, q);
intersections_1 += (uint)(s>=0.0f&&s<=1.0f&&t>=0.0f&&s+t<=1.0f&&f*dot(v, q)>0.0f);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(intersections_0%2u&&intersections_1%2u) flags[n] = flag;
} // voxelize_mesh()



// ################################################## graphics code ##################################################

)+"#ifdef GRAPHICS"+R(
Expand Down
48 changes: 38 additions & 10 deletions src/lbm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ LBM::LBM(const uint Nx, const uint Ny, const uint Nz, const float nu, const floa
#endif // GRAPHICS
int select_device = -1;
if((int)main_arguments.size()>0) select_device = to_int(main_arguments[0]);
Device device(select_device<0 ? select_device_with_most_flops() : select_device_with_id((uint)select_device), opencl_c_code);
const vector<Device_Info>& devices = get_devices();
this->device = Device(select_device<0 ? select_device_with_most_flops(devices) : select_device_with_id((uint)select_device, devices), opencl_c_code);
allocate(device); // lbm first
#ifdef GRAPHICS
graphics.allocate(device); // graphics after lbm
Expand Down Expand Up @@ -383,33 +384,60 @@ void LBM::T_write_device_to_vtk(const string& path) {
}
#endif // TEMPERATURE

void LBM::voxelize_mesh(const Mesh* mesh, const uchar flag) { // voxelize triangle mesh
print_info("Voxelizing mesh. This may take a few minutes.");
Memory<float3> p0(device, mesh->triangle_number, 1u, mesh->p0);
Memory<float3> p1(device, mesh->triangle_number, 1u, mesh->p1);
Memory<float3> p2(device, mesh->triangle_number, 1u, mesh->p2);
const float x0=mesh->pmin.x, y0=mesh->pmin.y, z0=mesh->pmin.z, x1=mesh->pmax.x, y1=mesh->pmax.y, z1=mesh->pmax.z; // use bounding box of mesh to speed up voxelization
Kernel kernel_voxelize_mesh(device, get_N(), "voxelize_mesh", flags, flag, p0, p1, p2, mesh->triangle_number, x0, y0, z0, x1, y1, z1);
p0.write_to_device();
p1.write_to_device();
p2.write_to_device();
flags.write_to_device();
kernel_voxelize_mesh.run();
flags.read_from_device();
}
void LBM::voxelize_stl(const string& path, const float3& center, const float3x3& rotation, const float size, const uchar flag) { // voxelize triangle mesh
const Mesh* mesh = read_stl(path, float3(get_Nx(), get_Ny(), get_Nz()), center, rotation, size);
voxelize_mesh(mesh, flag);
delete mesh;
}
void LBM::voxelize_stl(const string& path, const float3x3& rotation, const float size, const uchar flag) { // read and voxelize binary .stl file (place in box center)
voxelize_stl(path, center(), rotation, size, flag);
}
void LBM::voxelize_stl(const string& path, const float3& center, const float size, const uchar flag) { // read and voxelize binary .stl file (no rotation)
voxelize_stl(path, center, float3x3(1.0f), size, flag);
}
void LBM::voxelize_stl(const string& path, const float size, const uchar flag) { // read and voxelize binary .stl file (place in box center, no rotation)
voxelize_stl(path, center(), float3x3(1.0f), size, flag);
}

string LBM::device_defines() const { return
"\n #define def_Nx "+to_string(Nx)+"u"
"\n #define def_Ny "+to_string(Ny)+"u"
"\n #define def_Nz "+to_string(Nz)+"u"
"\n #define def_N " +to_string(get_N())+"ul"

"\n #define def_velocity_set "+to_string(velocity_set)+"u" // D2Q9/D3Q15/D3Q19/D3Q27
"\n #define def_c 0.57735027f" // lattice speed of sound c = 1/sqrt(3)*dt
"\n #define def_w " +to_string(1.0f/(3.0f*nu+0.5f))+"f" // relaxation rate w = dt/tau = dt/(nu/c^2+dt/2) = 1/(3*nu+1/2)
"\n #define D"+to_string(dimensions)+"Q"+to_string(velocity_set)+"" // D2Q9/D3Q15/D3Q19/D3Q27
"\n #define def_velocity_set "+to_string(velocity_set)+"u" // LBM velocity set (D2Q9/D3Q15/D3Q19/D3Q27)
"\n #define def_dimensions "+to_string(dimensions)+"u" // number spatial dimensions (2D or 3D)

"\n #define def_c 0.57735027f" // lattice speed of sound c = 1/sqrt(3)*dt
"\n #define def_w " +to_string(1.0f/get_tau())+"f" // relaxation rate w = dt/tau = dt/(nu/c^2+dt/2) = 1/(3*nu+1/2)
#if defined(D2Q9)
"\n #define D2Q9"
"\n #define def_w0 (1.0f/2.25f)" // center (0)
"\n #define def_ws (1.0f/9.0f)" // straight (1-4)
"\n #define def_we (1.0f/36.0f)" // edge (5-8)
#elif defined(D3Q15)
"\n #define D3Q15"
"\n #define def_w0 (1.0f/4.5f)" // center (0)
"\n #define def_ws (1.0f/9.0f)" // straight (1-6)
"\n #define def_wc (1.0f/72.0f)" // corner (7-14)
#elif defined(D3Q19)
"\n #define D3Q19"
"\n #define def_w0 (1.0f/3.0f)" // center (0)
"\n #define def_ws (1.0f/18.0f)" // straight (1-6)
"\n #define def_we (1.0f/36.0f)" // edge (7-18)
#elif defined(D3Q27)
"\n #define D3Q27"
"\n #define def_w0 (1.0f/3.375f)" // center (0)
"\n #define def_ws (1.0f/13.5f)" // straight (1-6)
"\n #define def_we (1.0f/54.0f)" // edge (7-18)
Expand Down Expand Up @@ -479,9 +507,9 @@ string LBM::device_defines() const { return

#ifdef TEMPERATURE
"\n #define TEMPERATURE"
"\n #define def_T_avg "+to_string(T_avg)+"f" // average temperature
"\n #define def_beta "+to_string(beta)+"f" // thermal expansion coefficient
"\n #define def_w_T "+to_string(1.0f/(2.0f*alpha+0.5f))+"f" // wT = dt/tauT = 1/(2*alpha+1/2), alpha = thermal diffusion coefficient
"\n #define def_beta "+to_string(beta)+"f" // thermal expansion coefficient
"\n #define def_T_avg "+to_string(T_avg)+"f" // average temperature
#endif // TEMPERATURE

#ifdef SUBGRID
Expand Down
14 changes: 14 additions & 0 deletions src/lbm.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

class LBM {
private:
Device device; // OpenCL device associated with this LBM object
Kernel kernel_initialize; // initialization kernel
Kernel kernel_stream_collide; // main LBM kernel
Kernel kernel_update_fields; // reads DDFs and updates (rho, u, T) in device memory
Expand All @@ -14,12 +15,16 @@ class LBM {

#if defined(D2Q9)
const uint velocity_set = 9u;
const uint dimensions = 2u;
#elif defined(D3Q15)
const uint velocity_set = 15u;
const uint dimensions = 3u;
#elif defined(D3Q19)
const uint velocity_set = 19u;
const uint dimensions = 3u;
#elif defined(D3Q27)
const uint velocity_set = 27u;
const uint dimensions = 3u;
#endif // D3Q27

#ifdef FORCE_FIELD
Expand Down Expand Up @@ -116,6 +121,9 @@ class LBM {
coordinates(n, x, y, z);
return position(x, y, z);
}
float3 size() const { // returns size of box
return float3((float)Nx, (float)Ny, (float)Nz);
}
float3 center() const { // returns center of box
return float3(0.5f*(float)Nx-0.5f, 0.5f*(float)Ny-0.5f, 0.5f*(float)Nz-0.5f);
}
Expand Down Expand Up @@ -170,6 +178,12 @@ class LBM {
void T_write_device_to_vtk(const string& path=""); // write binary .vtk file
#endif // TEMPERATURE

void voxelize_mesh(const Mesh* mesh, const uchar flag=TYPE_S); // voxelize mesh
void voxelize_stl(const string& path, const float3& center, const float3x3& rotation, const float size=-1.0f, const uchar flag=TYPE_S); // read and voxelize binary .stl file
void voxelize_stl(const string& path, const float3x3& rotation, const float size=-1.0f, const uchar flag=TYPE_S); // read and voxelize binary .stl file (place in box center)
void voxelize_stl(const string& path, const float3& center, const float size=-1.0f, const uchar flag=TYPE_S); // read and voxelize binary .stl file (no rotation)
void voxelize_stl(const string& path, const float size=-1.0f, const uchar flag=TYPE_S); // read and voxelize binary .stl file (place in box center, no rotation)

#ifdef GRAPHICS
class Graphics {
private:
Expand Down
Loading

0 comments on commit 4f3c5d4

Please sign in to comment.