The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
(click on images to show videos on YouTube)
v1.0 (04.08.2022) changes (public release)
public release
v1.1 (29.09.2022) changes (GPU voxelization)
added solid voxelization on GPU (slow algorithm)
added tool to print current camera position (keyG)
minor bug fix (workaround for Intel iGPU driver bug with triangle rendering)
v1.2 (24.10.2022) changes (force/torque compuatation)
added functions to compute force/torque on objects
added function to translate Mesh
added Stokes drag validation setup
v1.3 (10.11.2022) changes (minor bug fixes)
added unit conversion functions for torque
FORCE_FIELD
and VOLUME_FORCE
can now be used independently
minor bug fix (workaround for AMD legacy driver bug with binary number literals)
v1.4 (14.12.2022) changes (Linux graphics)
complete rewrite of C++ graphics library to minimize API dependencies
added interactive graphics mode on Linux with X11
fixed streamline visualization bug in 2D
v2.0 (09.01.2023) changes (multi-GPU upgrade)
added (cross-vendor) multi-GPU support on a single node (PC/laptop/server)
v2.1 (15.01.2023) changes (fast voxelization)
made solid voxelization on GPU lightning fast (new algorithm, from minutes to milliseconds)
v2.2 (20.01.2023) changes (velocity voxelization)
added option to voxelize moving/rotating geometry on GPU, with automatic velocity initialization for each grid point based on center of rotation, linear velocity and rotational velocity
cells that are converted from solid->fluid during re-voxelization now have their DDFs properly initialized
added option to not auto-scale mesh during read_stl(...)
, with negative size
parameter
added kernel for solid boundary rendering with marching-cubes
v2.3 (30.01.2023) changes (particles)
added particles with immersed-boundary method (either passive or 2-way-coupled, only supported with single-GPU)
minor optimization to GPU voxelization algorithm (workgroup threads outside mesh bounding-box return after ray-mesh intersections have been found)
displayed GPU memory allocation size is now fully accurate
fixed bug in write_line()
function in src/utilities.hpp
removed .exe
file extension for Linux/macOS
v2.4 (11.03.2023) changes (UI improvements)
added a help menu with keyHthat shows keyboard/mouse controls, visualization settings and simulation stats
improvements to keyboard/mouse control (+/-for zoom,mouseclickfrees/locks cursor)
added suggestion of largest possible grid resolution if resolution is set larger than memory allows
minor optimizations in multi-GPU communication (insignificant performance difference)
fixed bug in temperature equilibrium function for temperature extension
fixed erroneous double literal for Intel iGPUs in skybox color functions
fixed bug in make.sh where multi-GPU device IDs would not get forwarded to the executable
minor bug fixes in graphics engine (free cursor not centered during rotation, labels in VR mode)
fixed bug in LBM::voxelize_stl()
size parameter standard initialization
v2.5 (11.04.2023) changes (raytracing overhaul)
implemented light absorption in fluid for raytracing graphics (no performance impact)
improved raytracing framerate when camera is inside fluid
fixed skybox pole flickering artifacts
fixed bug where moving objects during re-voxelization would leave an erroneous trail of solid grid cells behind
v2.6 (16.04.2023) changes (Intel Arc patch)
patched OpenCL issues of Intel Arc GPUs: now VRAM allocations >4GB are possible and correct VRAM capacity is reported
v2.7 (29.05.2023) changes (visualization upgrade)
added slice visualization (key2/ key3modes, then switch through slice modes with keyT, move slice with keysQ/E)
made flag wireframe / solid surface visualization kernels toggleable with key1
added surface pressure visualization (key1when FORCE_FIELD
is enabled and lbm.calculate_force_on_boundaries();
is called)
added binary .vtk
export function for meshes with lbm.write_mesh_to_vtk(Mesh* mesh);
added time_step_multiplicator
for integrate_particles()
function in PARTICLES extension
made correction of wrong memory reporting on Intel Arc more robust
fixed bug in write_file()
template functions
reverted back to separate cl::Context
for each OpenCL device, as the shared Context otherwise would allocate extra VRAM on all other unused Nvidia GPUs
removed Debug and x86 configurations from Visual Studio solution file (one less complication for compiling)
fixed bug that particles could get too close to walls and get stuck, or leave the fluid phase (added boundary force)
v2.8 (24.06.2023) changes (documentation + polish)
finally added more documentation
cleaned up all sample setups in setup.cpp
for more beginner-friendliness, and added required extensions in defines.hpp
as comments to all setups
improved loading of composite .stl
geometries, by adding an option to omit automatic mesh repositioning, added more functionality to Mesh
struct in utilities.hpp
added uint3 resolution(float3 box_aspect_ratio, uint memory)
function to compute simulation box resolution based on box aspect ratio and VRAM occupation in MB
added bool lbm.graphics.next_frame(...)
function to export images for a specified video length in the main_setup
compute loop
added VIS_...
macros to ease setting visualization modes in headless graphics mode in lbm.graphics.visualization_modes
simulation box dimensions are now automatically made equally divisible by domains for multi-GPU simulations
fixed Info/Warning/Error message formatting for loading files and made Info/Warning/Error message labels colored
added Ahmed body setup as an example on how body forces and drag coefficient are computed
added Cessna 172 and Bell 222 setups to showcase loading composite .stl geometries and revoxelization of moving parts
added optional semi-transparent rendering mode (#define GRAPHICS_TRANSPARENCY 0.7f
in defines.hpp
)
fixed flickering of streamline visualization in interactive graphics
improved smooth positioning of streamlines in slice mode
fixed bug where mass
and massex
in SURFACE
extension were also allocated in CPU RAM (not required)
fixed bug in Q-criterion rendering of halo data in multi-GPU mode, reduced gap width between domains
removed shared memory optimization from mesh voxelization kernel, as it crashes on Nvidia GPUs with new GPU drivers and is incompatible with old OpenCL 1.0 GPUs
fixed raytracing attenuation color when no surface is at the simulation box walls with periodic boundaries
v2.9 (31.07.2023) changes (multithreading)
added cross-platform parallel_for
implementation in utilities.hpp
using std::threads
significantly (>4x) faster simulation startup with multithreaded geometry initialization and sanity checks
faster calculate_force_on_object()
and calculate_torque_on_object()
functions with multithreading
added total runtime and LBM runtime to lbm.write_status()
fixed bug in voxelization ray direction for re-voxelizing rotating objects
fixed bug in Mesh::get_bounding_box_size()
fixed bug in print_message()
function in utilities.hpp
v2.10 (05.11.2023) changes (frustrum culling)
improved rasterization performance via frustrum culling when only part of the simulation box is visible
improved switching between centered/free camera mode
refactored OpenCL rendering library
unit conversion factors are now automatically printed in console when units.set_m_kg_s(...)
is used
faster startup time for FluidX3D benchmark
miner bug fix in voxelize_mesh(...)
kernel
fixed bug in shading(...)
replaced slow (in multithreading) std::rand()
function with standard C99 LCG
more robust correction of wrong VRAM capacity reporting on Intel Arc GPUs
fixed some minor compiler warnings
v2.11 (07.12.2023) changes (improved Linux graphics)
interactive graphics on Linux are now in fullscreen mode too, fully matching Windows
made CPU/GPU buffer initialization significantly faster with std::fill
and enqueueFillBuffer
(overall ~8% faster simulation startup)
added operating system info to OpenCL device driver version printout
fixed flickering with frustrum culling at very small field of view
fixed bug where rendered/exported frame was not updated when visualization_modes
changed
v2.12 (18.01.2024) changes (faster startup)
~3x faster source code compiling on Linux using multiple CPU cores if make
is installed
significantly faster simulation initialization (~40% single-GPU, ~15% multi-GPU)
minor bug fix in Memory_Container::reset()
function
v2.13 (11.02.2024) changes (improved .vtk export)
data in exported .vtk
files is now automatically converted to SI units
~2x faster .vtk
export with multithreading
added unit conversion functions for TEMPERATURE
extension
fixed graphical artifacts with axis-aligned camera in raytracing
fixed get_exe_path()
for macOS
fixed X11 multi-monitor issues on Linux
workaround for Nvidia driver bug: enqueueFillBuffer
is broken for large buffers on Nvidia GPUs
fixed slow numeric drift issues caused by -cl-fast-relaxed-math
fixed wrong Maximum Allocation Size reporting in LBM::write_status()
fixed missing scaling of coordinates to SI units in LBM::write_mesh_to_vtk()
v2.14 (03.03.2024) changes (visualization upgrade)
coloring can now be switched between velocity/density/temperature with keyZ
uniform improved color palettes for velocity/density/temperature visualization
color scale with automatic unit conversion can now be shown with keyH
slice mode for field visualization now draws fully filled-in slices instead of only lines for velocity vectors
shading in VIS_FLAG_SURFACE
and VIS_PHI_RASTERIZE
modes is smoother now
make.sh
now automatically detects operating system and X11 support on Linux and only runs FluidX3D if last compilation was successful
fixed compiler warnings on Android
fixed make.sh
failing on some systems due to nonstandard interpreter path
fixed that make
would not compile with multiple cores on some systems
v2.15 (09.04.2024) changes (framerate boost)
eliminated one frame memory copy and one clear frame operation in rendering chain, for 20-70% higher framerate on both Windows and Linux
enabled g++
compiler optimizations for faster startup and higher rendering framerate
fixed bug in multithreaded sanity checks
fixed wrong unit conversion for thermal expansion coefficient
fixed density to pressure conversion in LBM units
fixed bug that raytracing kernel could lock up simulation
fixed minor visual artifacts with raytracing
fixed that console sometimes was not cleared before INTERACTIVE_GRAPHICS_ASCII
rendering starts
v2.16 (02.05.2024) changes (bug fixes)
simplified 10% faster marching-cubes implementation with 1D interpolation on edges instead of 3D interpolation, allowing to get rid of edge table
added faster, simplified marching-cubes variant for solid surface rendering where edges are always halfway between grid cells
refactoring in OpenCL rendering kernels
fixed that voxelization failed in Intel OpenCL CPU Runtime due to array out-of-bounds access
fixed that voxelization did not always produce binary identical results in multi-GPU compared to single-GPU
fixed that velocity voxelization failed for free surface simulations
fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (fma
) with a*b+c
fixed thatY/Zkeys were incorrect for QWERTY
keyboard layout in Linux
fixed that free camera movement speed in help overlay was not updated in stationary image when scrolling
fixed that cursor would sometimes flicker when scrolling on trackpads with Linux-X11 interactive graphics
fixed flickering of interactive rendering with multi-GPU when camera is not moved
fixed missing XInitThreads()
call that could crash Linux interactive graphics on some systems
fixed z-fighting between graphics_rasterize_phi()
and graphics_flags_mc()
kernels
v2.17 (05.06.2024) changes (unlimited domain resolution)
domains are no longer limited to 4.29 billion (2³², 1624³) grid cells or 225 GB memory; if more are used, the OpenCL code will automatically compile with 64-bit indexing
new, faster raytracing-based field visualization for single-GPU simulations
added GPU Driver and OpenCL Runtime installation instructions to documentation
refactored INTERACTIVE_GRAPHICS_ASCII
fixed memory leak in destructors of floatN
, floatNxN
, doubleN
, doubleNxN
(all unused)
made camera movement/rotation/zoom behavior independent of framerate
fixed that smart_device_selection()
would print a wrong warning if device reports 0 MHz clock speed
v2.18 (21.07.2024) changes (more bug fixes)
added support for high refresh rate monitors on Linux
more compact OpenCL Runtime installation scripts in Documentation
driver/runtime installation instructions will now be printed to console if no OpenCL devices are available
added domain information to LBM::write_status()
added LBM::index
function for uint3
input parameter
fixed that very large simulations sometimes wouldn't render properly by increasing maximum render distance from 10k to 2.1M
fixed mouse input stuttering at high screen refresh rate on Linux
fixed graphical artifacts in free surface raytracing on Intel CPU Runtime for OpenCL
fixed runtime estimation printed in console for setups with multiple lbm.run(...)
calls
fixed density oscillations in sample setups (too large lbm_u
)
fixed minor graphical artifacts in raytrace_phi()
fixed minor graphical artifacts in ray_grid_traverse_sum()
fixed wrong printed time step count on raindrop sample setup
v2.19 (07.09.2024) changes (camera splines)
the camera can now fly along a smooth path through a list of provided keyframe camera placements, using Catmull-Rom splines
more accurate remaining runtime estimation that includes time spent on rendering
enabled FP16S memory compression by default
printed camera placement using keyGis now formatted for easier copy/paste
added benchmark chart in Readme using mermaid gantt chart
placed memory allocation info during simulation startup at better location
fixed threading conflict between INTERACTIVE_GRAPHICS
and lbm.graphics.write_frame();
fixed maximum buffer allocation size limit for AMD GPUs and in Intel CPU Runtime for OpenCL
fixed wrong Re<Re_max
info printout for 2D simulations
minor fix in bandwidth_bytes_per_cell_device()
Read the FluidX3D Documentation!
streaming (part 2/2)
f0temp(x,t) = f0(x, t)
fitemp(x,t) = f(t%2 ? i : (i%2 ? i+1 : i-1))(i%2 ? x : x-ei, t) for i ∈ [1, q-1]
collision
ρ(x,t) = (Σi fitemp(x,t)) + 1
u(x,t) = 1∕ρ(x,t) Σi ci fitemp(x,t)
fieq-shifted(x,t) = wi ρ · ((u°ci)2∕(2c4) - (u°u)∕(2c2) + (u°ci)∕c2) + wi (ρ-1)
fitemp(x, t+Δt) = fitemp(x,t) + Ωi(fitemp(x,t), fieq-shifted(x,t), τ)
streaming (part 1/2)
f0(x, t+Δt) = f0temp(x, t+Δt)
f(t%2 ? (i%2 ? i+1 : i-1) : i)(i%2 ? x+ei : x, t+Δt) = fitemp(x, t+Δt) for i ∈ [1, q-1]
variable | SI units | defining equation | description |
---|---|---|---|
x | m | x = (x,y,z)T | 3D position in Cartesian coordinates |
t | s | - | time |
ρ | kg∕m³ | ρ = (Σi fi)+1 | mass density of fluid |
p | kg∕m s² | p = c² ρ | pressure of fluid |
u | m∕s | u = 1∕ρ Σi ci fi | velocity of fluid |
ν | m²∕s | ν = μ∕ρ | kinematic shear viscosity of fluid |
μ | kg∕m s | μ = ρ ν | dynamic viscosity of fluid |
fi | kg∕m³ | - | shifted density distribution functions (DDFs) |
Δx | m | Δx = 1 | lattice constant (in LBM units) |
Δt | s | Δt = 1 | simulation time step (in LBM units) |
c | m∕s | c = 1∕√3 Δx∕Δt | lattice speed of sound (in LBM units) |
i | 1 | 0 ≤ i < q | LBM streaming direction index |
q | 1 | q ∈ { 9,15,19,27 } | number of LBM streaming directions |
ei | m | D2Q9 / D3Q15/19/27 | LBM streaming directions |
ci | m∕s | ci = ei∕Δt | LBM streaming velocities |
wi | 1 | Σi wi = 1 | LBM velocity set weights |
Ωi | kg∕m³ | SRT or TRT | LBM collision operator |
τ | s | τ = ν∕c² + Δt∕2 | LBM relaxation time |
velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27
collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT)
DDF-shifting and other algebraic optimization to minimize round-off error
???????????????????????????????????????????????????????
(density ?, velocity ?, flags ?, DDFs ?; each square = 1 Byte)
allows for 19 Million cells per 1 GB VRAM
in-place streaming with Esoteric-Pull: eliminates redundant copy of density distribution functions (DDFs) in memory; almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries; offers optimal memory access patterns for single-cell in-place streaming
decoupled arithmetic precision (FP32) and memory precision (FP32 or FP16S or FP16C): all arithmetic is done in FP32 for compatibility on all hardware, but DDFs in memory can be compressed to FP16S or FP16C: almost cuts memory demand in half again and almost doubles performance, without impacting overall accuracy for most setups
TYPE_S
(stationary or moving) solid boundaries
TYPE_E
equilibrium boundaries (inflow/outflow)
TYPE_T
temperature boundaries
TYPE_F
free surface (fluid)
TYPE_I
free surface (interface)
TYPE_G
free surface (gas)
TYPE_X
remaining for custom use or further extensions
TYPE_Y
remaining for custom use or further extensions
(density ?, velocity ?, flags ?, 2 copies of DDFs ?/?; each square = 1 Byte)
allows for 3 Million cells per 1 GB VRAM
traditional LBM (D3Q19) with FP64 requires ~344 Bytes/cell
FluidX3D (D3Q19) requires only 55 Bytes/cell with Esoteric-Pull+FP16
large cost saving: comparison of maximum single-GPU grid resolution for D3Q19 LBM
GPU VRAM capacity | 1 GB | 2 GB | 3 GB | 4 GB | 6 GB | 8 GB | 10 GB | 11 GB | 12 GB | 16 GB | 20 GB | 24 GB | 32 GB | 40 GB | 48 GB | 64 GB | 80 GB | 94 GB | 128 GB | 192 GB | 256 GB |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
approximate GPU price | $25 GT 210 | $25 GTX 950 | $12 GTX 1060 | $50 GT 730 | $35 GTX 1060 | $70 RX 470 | $500 RTX 3080 | $240 GTX 1080 Ti | $75 Tesla M40 | $75 Instinct MI25 | $900 RX 7900 XT | $205 Tesla P40 | $600 Instinct MI60 | $5500 A100 | $2400 RTX 8000 | $10k Instinct MI210 | $11k A100 | >$40k H100 NVL | ? GPU Max 1550 | ~$10k MI300X | - |
traditional LBM (FP64) | 144³ | 182³ | 208³ | 230³ | 262³ | 288³ | 312³ | 322³ | 330³ | 364³ | 392³ | 418³ | 460³ | 494³ | 526³ | 578³ | 624³ | 658³ | 730³ | 836³ | 920³ |
FluidX3D (FP32/FP32) | 224³ | 282³ | 322³ | 354³ | 406³ | 448³ | 482³ | 498³ | 512³ | 564³ | 608³ | 646³ | 710³ | 766³ | 814³ | 896³ | 966³ | 1018³ | 1130³ | 1292³ | 1422³ |
FluidX3D (FP32/FP16) | 266³ | 336³ | 384³ | 424³ | 484³ | 534³ | 574³ | 594³ | 610³ | 672³ | 724³ | 770³ | 848³ | 912³ | 970³ | 1068³ | 1150³ | 1214³ | 1346³ | 1540³ | 1624³ |
domain decomposition allows pooling VRAM from multiple GPUs for much larger grid resolution
GPUs don't have to be identical (not even from the same vendor), but similar VRAM capacity/bandwidth is recommended
domain communication architecture (simplified)
++ .-----------------------------------------------------------------. ++++ | GPU 0 | ++++ | LBM Domain 0 | ++++ '-----------------------------------------------------------------' ++++ | selective /| ++++ |/ in-VRAM copy | ++++ .-------------------------------------------------------. ++++ | GPU 0 - Transfer Buffer 0 | ++++ '-------------------------------------------------------' ++!! | PCIe /| !!!! |/ copy | !!@@ .-------------------------. .-------------------------. @@@@ | CPU - Transfer Buffer 0 | | CPU - Transfer Buffer 1 | @@@@ '-------------------------' /'-------------------------' @@@@ pointer X swap @@@@ .-------------------------./ .-------------------------. @@@@ | CPU - Transfer Buffer 1 | | CPU - Transfer Buffer 0 | @@@@ '-------------------------' '-------------------------' @@!! /| PCIe | !!!! | copy |/ !!++ .-------------------------------------------------------. ++++ | GPU 1 - Transfer Buffer 1 | ++++ '-------------------------------------------------------' ++++ /| selective | ++++ | in-VRAM copy |/ ++++ .-----------------------------------------------------------------. ++++ | GPU 1 | ++++ | LBM Domain 1 | ++++ '-----------------------------------------------------------------' ++## | #### domain synchronization barrier #### | ##|| -------------------------------------------------------------> time ||
domain communication architecture (detailed)
++ .-----------------------------------------------------------------. ++++ | GPU 0 | ++++ | LBM Domain 0 | ++++ '-----------------------------------------------------------------' ++++ | selective in- /| | selective in- /| | selective in- /| ++++ |/ VRAM copy (X) | |/ VRAM copy (Y) | |/ VRAM copy (Z) | ++++ .---------------------.---------------------.---------------------. ++++ | GPU 0 - TB 0X+ | GPU 0 - TB 0Y+ | GPU 0 - TB 0Z+ | ++++ | GPU 0 - TB 0X- | GPU 0 - TB 0Y- | GPU 0 - TB 0Z- | ++++ '---------------------'---------------------'---------------------' ++!! | PCIe /| | PCIe /| | PCIe /| !!!! |/ copy | |/ copy | |/ copy | !!@@ .---------. .---------.---------. .---------.---------. .---------. @@@@ | CPU 0X+ | | CPU 1X- | CPU 0Y+ | | CPU 3Y- | CPU 0Z+ | | CPU 5Z- | @@@@ | CPU 0X- | | CPU 2X+ | CPU 0Y- | | CPU 4Y+ | CPU 0Z- | | CPU 6Z+ | @@@@ '--------- /---------'--------- /---------'--------- /---------' @@@@ pointer X swap (X) pointer X swap (Y) pointer X swap (Z) @@@@ .---------/ ---------.---------/ ---------.---------/ ---------. @@@@ | CPU 1X- | | CPU 0X+ | CPU 3Y- | | CPU 0Y+ | CPU 5Z- | | CPU 0Z+ | @@@@ | CPU 2X+ | | CPU 0X- | CPU 4Y+ | | CPU 0Y- | CPU 6Z+ | | CPU 0Z- | @@@@ '---------' '---------'---------' '---------'---------' '---------' @@!! /| PCIe | /| PCIe | /| PCIe | !!!! | copy |/ | copy |/ | copy |/ !!++ .--------------------..---------------------..--------------------. ++++ | GPU 1 - TB 1X- || GPU 3 - TB 3Y- || GPU 5 - TB 5Z- | ++++ :====================::=====================::====================: ++++ | GPU 2 - TB 2X+ || GPU 4 - TB 4Y+ || GPU 6 - TB 6Z+ | ++++ '--------------------''---------------------''--------------------' ++++ /| selective in- | /| selective in- | /| selective in- | ++++ | VRAM copy (X) |/ | VRAM copy (Y) |/ | VRAM copy (Z) |/ ++++ .--------------------..---------------------..--------------------. ++++ | GPU 1 || GPU 3 || GPU 5 | ++++ | LBM Domain 1 || LBM Domain 3 || LBM Domain 5 | ++++ :====================::=====================::====================: ++++ | GPU 2 || GPU 4 || GPU 6 | ++++ | LBM Domain 2 || LBM Domain 4 || LBM Domain 6 | ++++ '--------------------''---------------------''--------------------' ++## | | | #### | domain synchronization barriers | #### | | | ##|| -------------------------------------------------------------> time ||
single-GPU/CPU benchmarks
multi-GPU benchmarks
D3Q7 subgrid for thermal DDFs
in-place streaming with Esoteric-Pull for thermal DDFs
optional FP16S or FP16C compression for thermal DDFs with DDF-shifting
volume-of-fluid model
fully analytic PLIC for efficient curvature calculation
improved mass conservation
ultra efficient implementation with only 4 kernels additionally to stream_collide()
kernel
optional computation of forces from the fluid on solid boundaries
stationary mid-grid bounce-back boundaries (stationary solid boundaries)
moving mid-grid bounce-back boundaries (moving solid boundaries)
equilibrium boundaries (non-reflective inflow/outflow)
temperature boundaries (fixed temperature)
boundary types
global force per volume (Guo forcing), can be modified on-the-fly
local force per volume (force field)
state-of-the-art free surface LBM (FSLBM) implementation:
thermal LBM to simulate thermal convection
Smagorinsky-Lilly subgrid turbulence LES model to keep simulations with very large Reynolds number stable
Παβ = Σi eiα eiβ (fi - fieq-shifted)
Q = Σαβ Παβ2
______________________
τ = ½ (τ0 + √ τ02 + (16√2)∕(3π2) √Q∕ρ )
particles with immersed-boundary method (either passive or 2-way-coupled, single-GPU only)
FluidX3D can do simulations so large that storing the volumetric data for later rendering becomes unmanageable (like 120GB for a single frame, hundreds of TeraByte for a video)
instead, FluidX3D allows rendering raw simulation data directly in VRAM, so no large volumetric files have to be exported to the hard disk (see my technical talk)
the rendering is so fast that it works interactively in real time for both rasterization and raytracing
rasterization and raytracing are done in OpenCL and work on all GPUs, even the ones without RTX/DXR raytracing cores or without any rendering hardware at all (like A100, MI200, ...)
if no monitor is available (like on a remote Linux server), there is an ASCII rendering mode to interactively visualize the simulation in the terminal (even in WSL and/or through SSH)
rendering is fully multi-GPU-parallelized via seamless domain decomposition rasterization
with interactive graphics mode disabled, image resolution can be as large as VRAM allows for (4K/8K/16K and above)
(interacitive) visualization modes:
flag wireframe / solid surface (and force vectors on solid cells or surface pressure if the extension is used)
velocity field (with slice mode)
streamlines (with slice mode)
velocity-colored Q-criterion isosurface
rasterized free surface with marching-cubes
raytraced free surface with fast ray-grid traversal and marching-cubes, either 1-4 rays/pixel or 1-10 rays/pixel
FluidX3D is written in OpenCL 1.2, so it runs on all hardware from all vendors (Nvidia, AMD, Intel, ...):
world's fastest datacenter GPUs: MI300X, H100 (NVL), A100, MI200, MI100, V100(S), GPU Max 1100, ...
gaming GPUs (desktop/laptop): Nvidia GeForce, AMD Radeon, Intel Arc
professional/workstation GPUs: Nvidia Quadro, AMD Radeon Pro / FirePro, Intel Arc Pro
integrated GPUs
CPUs (requires installation of Intel CPU Runtime for OpenCL)
Intel Xeon Phi (requires installation of Intel CPU Runtime for OpenCL)
smartphone ARM GPUs
native cross-vendor multi-GPU implementation
uses PCIe communication, so no SLI/Crossfire/NVLink/InfinityFabric required
single-node parallelization, so no MPI installation required
GPUs don't even have to be from the same vendor, but similar memory capacity and bandwidth are recommended
works on Windows and Linux with C++17, with limited support also for macOS and Android
supports importing and voxelizing triangle meshes from binary .stl
files, with fast GPU voxelization
supports exporting volumetric data as binary .vtk
files
supports exporting triangle meshes as binary .vtk
files
supports exporting rendered images as .png
/.qoi
/.bmp
files; encoding runs in parallel on the CPU while the simulation on GPU can continue without delay
Here are performance benchmarks on various hardware in MLUPs/s, or how many million lattice cells are updated per second. The settings used for the benchmark are D3Q19 SRT with no extensions enabled (only LBM with implicit mid-grid bounce-back boundaries) and the setup consists of an empty cubic box with sufficient size (typically 256³). Without extensions, a single lattice cell requires:
a memory capacity of 93 (FP32/FP32) or 55 (FP32/FP16) Bytes
a memory bandwidth of 153 (FP32/FP32) or 77 (FP32/FP16) Bytes per time step
363 (FP32/FP32) or 406 (FP32/FP16S) or 1275 (FP32/FP16C) FLOPs per time step (FP32+INT32 operations counted combined)
In consequence, the arithmetic intensity of this implementation is 2.37 (FP32/FP32) or 5.27 (FP32/FP16S) or 16.56 (FP32/FP16C) FLOPs/Byte. So performance is only limited by memory bandwidth. The table in the left 3 columns shows the hardware specs as found in the data sheets (theoretical peak FP32 compute performance, memory capacity, theoretical peak memory bandwidth). The right 3 columns show the measured FluidX3D performance for FP32/FP32, FP32/FP16S, FP32/FP16C floating-point precision settings, with the (roofline model efficiency) in round brackets, indicating how much % of theoretical peak memory bandwidth are being used.
If your GPU/CPU is not on the list yet, you can report your benchmarks here.