FluidX3D Download - FluidX3D Source code download

FluidX3D

Other source code

v2.19 (camera splines)

Download

FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.

(click on images to show videos on YouTube)

Update History

v1.0 (04.08.2022) changes (public release)

public release

v1.1 (29.09.2022) changes (GPU voxelization)

added solid voxelization on GPU (slow algorithm)
added tool to print current camera position (keyG)
minor bug fix (workaround for Intel iGPU driver bug with triangle rendering)

v1.2 (24.10.2022) changes (force/torque compuatation)

added functions to compute force/torque on objects
added function to translate Mesh
added Stokes drag validation setup

v1.3 (10.11.2022) changes (minor bug fixes)

added unit conversion functions for torque
FORCE_FIELD and VOLUME_FORCE can now be used independently
minor bug fix (workaround for AMD legacy driver bug with binary number literals)

v1.4 (14.12.2022) changes (Linux graphics)

complete rewrite of C++ graphics library to minimize API dependencies
added interactive graphics mode on Linux with X11
fixed streamline visualization bug in 2D

v2.0 (09.01.2023) changes (multi-GPU upgrade)

added (cross-vendor) multi-GPU support on a single node (PC/laptop/server)

v2.1 (15.01.2023) changes (fast voxelization)

made solid voxelization on GPU lightning fast (new algorithm, from minutes to milliseconds)

v2.2 (20.01.2023) changes (velocity voxelization)

added option to voxelize moving/rotating geometry on GPU, with automatic velocity initialization for each grid point based on center of rotation, linear velocity and rotational velocity
cells that are converted from solid->fluid during re-voxelization now have their DDFs properly initialized
added option to not auto-scale mesh during read_stl(...), with negative size parameter
added kernel for solid boundary rendering with marching-cubes

v2.3 (30.01.2023) changes (particles)

added particles with immersed-boundary method (either passive or 2-way-coupled, only supported with single-GPU)
minor optimization to GPU voxelization algorithm (workgroup threads outside mesh bounding-box return after ray-mesh intersections have been found)
displayed GPU memory allocation size is now fully accurate
fixed bug in write_line() function in src/utilities.hpp
removed .exe file extension for Linux/macOS

v2.4 (11.03.2023) changes (UI improvements)

added a help menu with keyHthat shows keyboard/mouse controls, visualization settings and simulation stats
improvements to keyboard/mouse control (+/-for zoom,mouseclickfrees/locks cursor)
added suggestion of largest possible grid resolution if resolution is set larger than memory allows
minor optimizations in multi-GPU communication (insignificant performance difference)
fixed bug in temperature equilibrium function for temperature extension
fixed erroneous double literal for Intel iGPUs in skybox color functions
fixed bug in make.sh where multi-GPU device IDs would not get forwarded to the executable
minor bug fixes in graphics engine (free cursor not centered during rotation, labels in VR mode)
fixed bug in LBM::voxelize_stl() size parameter standard initialization

v2.5 (11.04.2023) changes (raytracing overhaul)

implemented light absorption in fluid for raytracing graphics (no performance impact)
improved raytracing framerate when camera is inside fluid
fixed skybox pole flickering artifacts
fixed bug where moving objects during re-voxelization would leave an erroneous trail of solid grid cells behind

v2.6 (16.04.2023) changes (Intel Arc patch)

patched OpenCL issues of Intel Arc GPUs: now VRAM allocations >4GB are possible and correct VRAM capacity is reported

v2.7 (29.05.2023) changes (visualization upgrade)

added slice visualization (key2/ key3modes, then switch through slice modes with keyT, move slice with keysQ/E)
made flag wireframe / solid surface visualization kernels toggleable with key1
added surface pressure visualization (key1when FORCE_FIELD is enabled and lbm.calculate_force_on_boundaries(); is called)
added binary .vtk export function for meshes with lbm.write_mesh_to_vtk(Mesh* mesh);
added time_step_multiplicator for integrate_particles() function in PARTICLES extension
made correction of wrong memory reporting on Intel Arc more robust
fixed bug in write_file() template functions
reverted back to separate cl::Context for each OpenCL device, as the shared Context otherwise would allocate extra VRAM on all other unused Nvidia GPUs
removed Debug and x86 configurations from Visual Studio solution file (one less complication for compiling)
fixed bug that particles could get too close to walls and get stuck, or leave the fluid phase (added boundary force)

v2.8 (24.06.2023) changes (documentation + polish)

finally added more documentation
cleaned up all sample setups in setup.cpp for more beginner-friendliness, and added required extensions in defines.hpp as comments to all setups
improved loading of composite .stl geometries, by adding an option to omit automatic mesh repositioning, added more functionality to Mesh struct in utilities.hpp
added uint3 resolution(float3 box_aspect_ratio, uint memory) function to compute simulation box resolution based on box aspect ratio and VRAM occupation in MB
added bool lbm.graphics.next_frame(...) function to export images for a specified video length in the main_setup compute loop
added VIS_... macros to ease setting visualization modes in headless graphics mode in lbm.graphics.visualization_modes
simulation box dimensions are now automatically made equally divisible by domains for multi-GPU simulations
fixed Info/Warning/Error message formatting for loading files and made Info/Warning/Error message labels colored
added Ahmed body setup as an example on how body forces and drag coefficient are computed
added Cessna 172 and Bell 222 setups to showcase loading composite .stl geometries and revoxelization of moving parts
added optional semi-transparent rendering mode (#define GRAPHICS_TRANSPARENCY 0.7f in defines.hpp)
fixed flickering of streamline visualization in interactive graphics
improved smooth positioning of streamlines in slice mode
fixed bug where mass and massex in SURFACE extension were also allocated in CPU RAM (not required)
fixed bug in Q-criterion rendering of halo data in multi-GPU mode, reduced gap width between domains
removed shared memory optimization from mesh voxelization kernel, as it crashes on Nvidia GPUs with new GPU drivers and is incompatible with old OpenCL 1.0 GPUs
fixed raytracing attenuation color when no surface is at the simulation box walls with periodic boundaries

v2.9 (31.07.2023) changes (multithreading)

added cross-platform parallel_for implementation in utilities.hpp using std::threads
significantly (>4x) faster simulation startup with multithreaded geometry initialization and sanity checks
faster calculate_force_on_object() and calculate_torque_on_object() functions with multithreading
added total runtime and LBM runtime to lbm.write_status()
fixed bug in voxelization ray direction for re-voxelizing rotating objects
fixed bug in Mesh::get_bounding_box_size()
fixed bug in print_message() function in utilities.hpp

v2.10 (05.11.2023) changes (frustrum culling)

improved rasterization performance via frustrum culling when only part of the simulation box is visible
improved switching between centered/free camera mode
refactored OpenCL rendering library
unit conversion factors are now automatically printed in console when units.set_m_kg_s(...) is used
faster startup time for FluidX3D benchmark
miner bug fix in voxelize_mesh(...) kernel
fixed bug in shading(...)
replaced slow (in multithreading) std::rand() function with standard C99 LCG
more robust correction of wrong VRAM capacity reporting on Intel Arc GPUs
fixed some minor compiler warnings

v2.11 (07.12.2023) changes (improved Linux graphics)

interactive graphics on Linux are now in fullscreen mode too, fully matching Windows
made CPU/GPU buffer initialization significantly faster with std::fill and enqueueFillBuffer (overall ~8% faster simulation startup)
added operating system info to OpenCL device driver version printout
fixed flickering with frustrum culling at very small field of view
fixed bug where rendered/exported frame was not updated when visualization_modes changed

v2.12 (18.01.2024) changes (faster startup)

~3x faster source code compiling on Linux using multiple CPU cores if make is installed
significantly faster simulation initialization (~40% single-GPU, ~15% multi-GPU)
minor bug fix in Memory_Container::reset() function

v2.13 (11.02.2024) changes (improved .vtk export)

data in exported .vtk files is now automatically converted to SI units
~2x faster .vtk export with multithreading
added unit conversion functions for TEMPERATURE extension
fixed graphical artifacts with axis-aligned camera in raytracing
fixed get_exe_path() for macOS
fixed X11 multi-monitor issues on Linux
workaround for Nvidia driver bug: enqueueFillBuffer is broken for large buffers on Nvidia GPUs
fixed slow numeric drift issues caused by -cl-fast-relaxed-math
fixed wrong Maximum Allocation Size reporting in LBM::write_status()
fixed missing scaling of coordinates to SI units in LBM::write_mesh_to_vtk()

v2.14 (03.03.2024) changes (visualization upgrade)

coloring can now be switched between velocity/density/temperature with keyZ
uniform improved color palettes for velocity/density/temperature visualization
color scale with automatic unit conversion can now be shown with keyH
slice mode for field visualization now draws fully filled-in slices instead of only lines for velocity vectors
shading in VIS_FLAG_SURFACE and VIS_PHI_RASTERIZE modes is smoother now
make.sh now automatically detects operating system and X11 support on Linux and only runs FluidX3D if last compilation was successful
fixed compiler warnings on Android
fixed make.sh failing on some systems due to nonstandard interpreter path
fixed that make would not compile with multiple cores on some systems

v2.15 (09.04.2024) changes (framerate boost)

eliminated one frame memory copy and one clear frame operation in rendering chain, for 20-70% higher framerate on both Windows and Linux
enabled g++ compiler optimizations for faster startup and higher rendering framerate
fixed bug in multithreaded sanity checks
fixed wrong unit conversion for thermal expansion coefficient
fixed density to pressure conversion in LBM units
fixed bug that raytracing kernel could lock up simulation
fixed minor visual artifacts with raytracing
fixed that console sometimes was not cleared before INTERACTIVE_GRAPHICS_ASCII rendering starts

v2.16 (02.05.2024) changes (bug fixes)

simplified 10% faster marching-cubes implementation with 1D interpolation on edges instead of 3D interpolation, allowing to get rid of edge table
added faster, simplified marching-cubes variant for solid surface rendering where edges are always halfway between grid cells
refactoring in OpenCL rendering kernels
fixed that voxelization failed in Intel OpenCL CPU Runtime due to array out-of-bounds access
fixed that voxelization did not always produce binary identical results in multi-GPU compared to single-GPU
fixed that velocity voxelization failed for free surface simulations
fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (fma) with a*b+c
fixed thatY/Zkeys were incorrect for QWERTY keyboard layout in Linux
fixed that free camera movement speed in help overlay was not updated in stationary image when scrolling
fixed that cursor would sometimes flicker when scrolling on trackpads with Linux-X11 interactive graphics
fixed flickering of interactive rendering with multi-GPU when camera is not moved
fixed missing XInitThreads() call that could crash Linux interactive graphics on some systems
fixed z-fighting between graphics_rasterize_phi() and graphics_flags_mc() kernels

v2.17 (05.06.2024) changes (unlimited domain resolution)

domains are no longer limited to 4.29 billion (2³², 1624³) grid cells or 225 GB memory; if more are used, the OpenCL code will automatically compile with 64-bit indexing
new, faster raytracing-based field visualization for single-GPU simulations
added GPU Driver and OpenCL Runtime installation instructions to documentation
refactored INTERACTIVE_GRAPHICS_ASCII
fixed memory leak in destructors of floatN, floatNxN, doubleN, doubleNxN (all unused)
made camera movement/rotation/zoom behavior independent of framerate
fixed that smart_device_selection() would print a wrong warning if device reports 0 MHz clock speed

v2.18 (21.07.2024) changes (more bug fixes)

added support for high refresh rate monitors on Linux
more compact OpenCL Runtime installation scripts in Documentation
driver/runtime installation instructions will now be printed to console if no OpenCL devices are available
added domain information to LBM::write_status()
added LBM::index function for uint3 input parameter
fixed that very large simulations sometimes wouldn't render properly by increasing maximum render distance from 10k to 2.1M
fixed mouse input stuttering at high screen refresh rate on Linux
fixed graphical artifacts in free surface raytracing on Intel CPU Runtime for OpenCL
fixed runtime estimation printed in console for setups with multiple lbm.run(...) calls
fixed density oscillations in sample setups (too large lbm_u)
fixed minor graphical artifacts in raytrace_phi()
fixed minor graphical artifacts in ray_grid_traverse_sum()
fixed wrong printed time step count on raindrop sample setup

v2.19 (07.09.2024) changes (camera splines)

the camera can now fly along a smooth path through a list of provided keyframe camera placements, using Catmull-Rom splines
more accurate remaining runtime estimation that includes time spent on rendering
enabled FP16S memory compression by default
printed camera placement using keyGis now formatted for easier copy/paste
added benchmark chart in Readme using mermaid gantt chart
placed memory allocation info during simulation startup at better location
fixed threading conflict between INTERACTIVE_GRAPHICS and lbm.graphics.write_frame();
fixed maximum buffer allocation size limit for AMD GPUs and in Intel CPU Runtime for OpenCL
fixed wrong Re<Re_max info printout for 2D simulations
minor fix in bandwidth_bytes_per_cell_device()

How to get started?

Read the FluidX3D Documentation!

Compute Features - Getting the Memory Problem under Control

CFD model: lattice Boltzmann method (LBM)

streaming (part 2/2)
f₀^temp(x,t) = f₀(x, t)
f_i^temp(x,t) = f_{(t%2 ? i : (i%2 ? i+1 : i-1))}(i%2 ? x : x-e_i, t) for i ∈ [1, q-1]
collision
ρ(x,t) = (Σ_i f_i^temp(x,t)) + 1

u(x,t) = ¹∕_ρ(x,t) Σ_i c_i f_i^temp(x,t)

f_i^eq-shifted(x,t) = w_i ρ · (^(u_°c_i)²∕_(2c⁴) - ^(u_°u)∕_(2c²) + ^(u_°c_i)∕_c²) + w_i (ρ-1)

f_i^temp(x, t+Δt) = f_i^temp(x,t) + Ω_i(f_i^temp(x,t), f_i^eq-shifted(x,t), τ)
streaming (part 1/2)
f₀(x, t+Δt) = f₀^temp(x, t+Δt)
f_{(t%2 ? (i%2 ? i+1 : i-1) : i)}(i%2 ? x+e_i : x, t+Δt) = f_i^temp(x, t+Δt) for i ∈ [1, q-1]

variables and notation

variable	SI units	defining equation	description

x	m	x = (x,y,z)^T	3D position in Cartesian coordinates
t	s	-	time
ρ	^kg∕_m³	ρ = (Σ_i f_i)+1	mass density of fluid
p	^kg∕_m s²	p = c² ρ	pressure of fluid
u	^m∕_s	u = ¹∕_ρ Σ_i c_i f_i	velocity of fluid
ν	^m²∕_s	ν = ^μ∕_ρ	kinematic shear viscosity of fluid
μ	^kg∕_m s	μ = ρ ν	dynamic viscosity of fluid

f_i	^kg∕_m³	-	shifted density distribution functions (DDFs)
Δx	m	Δx = 1	lattice constant (in LBM units)
Δt	s	Δt = 1	simulation time step (in LBM units)
c	^m∕_s	c = ¹∕_√3 ^Δx∕_Δt	lattice speed of sound (in LBM units)
i	1	0 ≤ i < q	LBM streaming direction index
q	1	q ∈ { 9,15,19,27 }	number of LBM streaming directions
e_i	m	D2Q9 / D3Q15/19/27	LBM streaming directions
c_i	^m∕_s	c_i = ^e_i∕_Δt	LBM streaming velocities
w_i	1	Σ_i w_i = 1	LBM velocity set weights
Ω_i	^kg∕_m³	SRT or TRT	LBM collision operator
τ	s	τ = ^ν∕_c² + ^Δt∕₂	LBM relaxation time

velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27
collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT)
DDF-shifting and other algebraic optimization to minimize round-off error

optimized to minimize VRAM footprint to 1/6 of other LBM codes

???????????????????????????????????????????????????????
(density ?, velocity ?, flags ?, DDFs ?; each square = 1 Byte)
allows for 19 Million cells per 1 GB VRAM
in-place streaming with Esoteric-Pull: eliminates redundant copy of density distribution functions (DDFs) in memory; almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries; offers optimal memory access patterns for single-cell in-place streaming
decoupled arithmetic precision (FP32) and memory precision (FP32 or FP16S or FP16C): all arithmetic is done in FP32 for compatibility on all hardware, but DDFs in memory can be compressed to FP16S or FP16C: almost cuts memory demand in half again and almost doubles performance, without impacting overall accuracy for most setups
only 8 flag bits per lattice point (can be used independently / at the same time)
TYPE_S (stationary or moving) solid boundaries
TYPE_E equilibrium boundaries (inflow/outflow)
TYPE_T temperature boundaries
TYPE_F free surface (fluid)
TYPE_I free surface (interface)
TYPE_G free surface (gas)
TYPE_X remaining for custom use or further extensions
TYPE_Y remaining for custom use or further extensions

(density ?, velocity ?, flags ?, 2 copies of DDFs ?/?; each square = 1 Byte)

allows for 3 Million cells per 1 GB VRAM
traditional LBM (D3Q19) with FP64 requires ~344 Bytes/cell
FluidX3D (D3Q19) requires only 55 Bytes/cell with Esoteric-Pull+FP16

large cost saving: comparison of maximum single-GPU grid resolution for D3Q19 LBM

GPU VRAM capacity	1 GB	2 GB	3 GB	4 GB	6 GB	8 GB	10 GB	11 GB	12 GB	16 GB	20 GB	24 GB	32 GB	40 GB	48 GB	64 GB	80 GB	94 GB	128 GB	192 GB	256 GB
approximate GPU price	$25 GT 210	$25 GTX 950	$12 GTX 1060	$50 GT 730	$35 GTX 1060	$70 RX 470	$500 RTX 3080	$240 GTX 1080 Ti	$75 Tesla M40	$75 Instinct MI25	$900 RX 7900 XT	$205 Tesla P40	$600 Instinct MI60	$5500 A100	$2400 RTX 8000	$10k Instinct MI210	$11k A100	>$40k H100 NVL	? GPU Max 1550	~$10k MI300X	-
traditional LBM (FP64)	144³	182³	208³	230³	262³	288³	312³	322³	330³	364³	392³	418³	460³	494³	526³	578³	624³	658³	730³	836³	920³
FluidX3D (FP32/FP32)	224³	282³	322³	354³	406³	448³	482³	498³	512³	564³	608³	646³	710³	766³	814³	896³	966³	1018³	1130³	1292³	1422³
FluidX3D (FP32/FP16)	266³	336³	384³	424³	484³	534³	574³	594³	610³	672³	724³	770³	848³	912³	970³	1068³	1150³	1214³	1346³	1540³	1624³

cross-vendor multi-GPU support on a single computer/server

domain decomposition allows pooling VRAM from multiple GPUs for much larger grid resolution
GPUs don't have to be identical (not even from the same vendor), but similar VRAM capacity/bandwidth is recommended

domain communication architecture (simplified)

++   .-----------------------------------------------------------------.   ++++   |                              GPU 0                              |   ++++   |                          LBM Domain 0                           |   ++++   '-----------------------------------------------------------------'   ++++              |                 selective                /|             ++++             |/               in-VRAM copy               |              ++++        .-------------------------------------------------------.        ++++        |               GPU 0 - Transfer Buffer 0               |        ++++        '-------------------------------------------------------'        ++!!                            |     PCIe     /|                           !!!!                           |/    copy      |                            !!@@        .-------------------------.   .-------------------------.        @@@@        | CPU - Transfer Buffer 0 |   | CPU - Transfer Buffer 1 |        @@@@        '-------------------------' /'-------------------------'        @@@@                           pointer  X   swap                             @@@@        .-------------------------./ .-------------------------.        @@@@        | CPU - Transfer Buffer 1 |   | CPU - Transfer Buffer 0 |        @@@@        '-------------------------'   '-------------------------'        @@!!                           /|    PCIe      |                            !!!!                            |     copy     |/                           !!++        .-------------------------------------------------------.        ++++        |               GPU 1 - Transfer Buffer 1               |        ++++        '-------------------------------------------------------'        ++++             /|                selective                 |              ++++              |                in-VRAM copy              |/             ++++   .-----------------------------------------------------------------.   ++++   |                              GPU 1                              |   ++++   |                          LBM Domain 1                           |   ++++   '-----------------------------------------------------------------'   ++##                                    |                                    ####                      domain synchronization barrier                     ####                                    |                                    ##||   -------------------------------------------------------------> time   ||

domain communication architecture (detailed)

++   .-----------------------------------------------------------------.   ++++   |                              GPU 0                              |   ++++   |                          LBM Domain 0                           |   ++++   '-----------------------------------------------------------------'   ++++     |  selective in- /|  |  selective in- /|  |  selective in- /|    ++++    |/ VRAM copy (X)  |  |/ VRAM copy (Y)  |  |/ VRAM copy (Z)  |     ++++   .---------------------.---------------------.---------------------.   ++++   |    GPU 0 - TB 0X+   |    GPU 0 - TB 0Y+   |    GPU 0 - TB 0Z+   |   ++++   |    GPU 0 - TB 0X-   |    GPU 0 - TB 0Y-   |    GPU 0 - TB 0Z-   |   ++++   '---------------------'---------------------'---------------------'   ++!!          | PCIe /|            | PCIe /|            | PCIe /|         !!!!         |/ copy |            |/ copy |            |/ copy |          !!@@   .---------. .---------.---------. .---------.---------. .---------.   @@@@   | CPU 0X+ | | CPU 1X- | CPU 0Y+ | | CPU 3Y- | CPU 0Z+ | | CPU 5Z- |   @@@@   | CPU 0X- | | CPU 2X+ | CPU 0Y- | | CPU 4Y+ | CPU 0Z- | | CPU 6Z+ |   @@@@   '--------- /---------'--------- /---------'--------- /---------'   @@@@      pointer X swap (X)    pointer X swap (Y)    pointer X swap (Z)     @@@@   .---------/ ---------.---------/ ---------.---------/ ---------.   @@@@   | CPU 1X- | | CPU 0X+ | CPU 3Y- | | CPU 0Y+ | CPU 5Z- | | CPU 0Z+ |   @@@@   | CPU 2X+ | | CPU 0X- | CPU 4Y+ | | CPU 0Y- | CPU 6Z+ | | CPU 0Z- |   @@@@   '---------' '---------'---------' '---------'---------' '---------'   @@!!         /| PCIe |            /| PCIe |            /| PCIe |          !!!!          | copy |/            | copy |/            | copy |/         !!++   .--------------------..---------------------..--------------------.   ++++   |   GPU 1 - TB 1X-   ||    GPU 3 - TB 3Y-   ||   GPU 5 - TB 5Z-   |   ++++   :====================::=====================::====================:   ++++   |   GPU 2 - TB 2X+   ||    GPU 4 - TB 4Y+   ||   GPU 6 - TB 6Z+   |   ++++   '--------------------''---------------------''--------------------'   ++++    /| selective in-  |  /| selective in-  |  /| selective in-  |     ++++     |  VRAM copy (X) |/  |  VRAM copy (Y) |/  |  VRAM copy (Z) |/    ++++   .--------------------..---------------------..--------------------.   ++++   |        GPU 1       ||        GPU 3        ||        GPU 5       |   ++++   |    LBM Domain 1    ||    LBM Domain 3     ||    LBM Domain 5    |   ++++   :====================::=====================::====================:   ++++   |        GPU 2       ||        GPU 4        ||        GPU 6       |   ++++   |    LBM Domain 2    ||    LBM Domain 4     ||    LBM Domain 6    |   ++++   '--------------------''---------------------''--------------------'   ++##              |                     |                     |              ####              |      domain synchronization barriers      |              ####              |                     |                     |              ##||   -------------------------------------------------------------> time   ||

peak performance on GPUs (datacenter/gaming/professional/laptop)

single-GPU/CPU benchmarks
multi-GPU benchmarks

powerful model extensions

D3Q7 subgrid for thermal DDFs
in-place streaming with Esoteric-Pull for thermal DDFs
optional FP16S or FP16C compression for thermal DDFs with DDF-shifting
volume-of-fluid model
fully analytic PLIC for efficient curvature calculation
improved mass conservation
ultra efficient implementation with only 4 kernels additionally to stream_collide() kernel
optional computation of forces from the fluid on solid boundaries
stationary mid-grid bounce-back boundaries (stationary solid boundaries)
moving mid-grid bounce-back boundaries (moving solid boundaries)
equilibrium boundaries (non-reflective inflow/outflow)
temperature boundaries (fixed temperature)
boundary types
global force per volume (Guo forcing), can be modified on-the-fly
local force per volume (force field)
state-of-the-art free surface LBM (FSLBM) implementation:
thermal LBM to simulate thermal convection
Smagorinsky-Lilly subgrid turbulence LES model to keep simulations with very large Reynolds number stable
Π_αβ = Σ_i e_iα e_iβ (f_i - f_i^eq-shifted)

Q = Σ_αβ Π_αβ²
______________________
τ = ½ (τ₀ + √ τ₀² + ^(16√2)∕_(3π²) ^√Q∕_ρ )
particles with immersed-boundary method (either passive or 2-way-coupled, single-GPU only)

Solving the Visualization Problem

FluidX3D can do simulations so large that storing the volumetric data for later rendering becomes unmanageable (like 120GB for a single frame, hundreds of TeraByte for a video)
instead, FluidX3D allows rendering raw simulation data directly in VRAM, so no large volumetric files have to be exported to the hard disk (see my technical talk)
the rendering is so fast that it works interactively in real time for both rasterization and raytracing
rasterization and raytracing are done in OpenCL and work on all GPUs, even the ones without RTX/DXR raytracing cores or without any rendering hardware at all (like A100, MI200, ...)
if no monitor is available (like on a remote Linux server), there is an ASCII rendering mode to interactively visualize the simulation in the terminal (even in WSL and/or through SSH)
rendering is fully multi-GPU-parallelized via seamless domain decomposition rasterization
with interactive graphics mode disabled, image resolution can be as large as VRAM allows for (4K/8K/16K and above)
(interacitive) visualization modes:

flag wireframe / solid surface (and force vectors on solid cells or surface pressure if the extension is used)
velocity field (with slice mode)
streamlines (with slice mode)
velocity-colored Q-criterion isosurface
rasterized free surface with marching-cubes
raytraced free surface with fast ray-grid traversal and marching-cubes, either 1-4 rays/pixel or 1-10 rays/pixel

Solving the Compatibility Problem

FluidX3D is written in OpenCL 1.2, so it runs on all hardware from all vendors (Nvidia, AMD, Intel, ...):

world's fastest datacenter GPUs: MI300X, H100 (NVL), A100, MI200, MI100, V100(S), GPU Max 1100, ...
gaming GPUs (desktop/laptop): Nvidia GeForce, AMD Radeon, Intel Arc
professional/workstation GPUs: Nvidia Quadro, AMD Radeon Pro / FirePro, Intel Arc Pro
integrated GPUs
CPUs (requires installation of Intel CPU Runtime for OpenCL)
Intel Xeon Phi (requires installation of Intel CPU Runtime for OpenCL)
smartphone ARM GPUs

native cross-vendor multi-GPU implementation

uses PCIe communication, so no SLI/Crossfire/NVLink/InfinityFabric required
single-node parallelization, so no MPI installation required
GPUs don't even have to be from the same vendor, but similar memory capacity and bandwidth are recommended

works on Windows and Linux with C++17, with limited support also for macOS and Android
supports importing and voxelizing triangle meshes from binary .stl files, with fast GPU voxelization
supports exporting volumetric data as binary .vtk files
supports exporting triangle meshes as binary .vtk files
supports exporting rendered images as .png/.qoi/.bmp files; encoding runs in parallel on the CPU while the simulation on GPU can continue without delay

Single-GPU/CPU Benchmarks

Here are performance benchmarks on various hardware in MLUPs/s, or how many million lattice cells are updated per second. The settings used for the benchmark are D3Q19 SRT with no extensions enabled (only LBM with implicit mid-grid bounce-back boundaries) and the setup consists of an empty cubic box with sufficient size (typically 256³). Without extensions, a single lattice cell requires:

a memory capacity of 93 (FP32/FP32) or 55 (FP32/FP16) Bytes
a memory bandwidth of 153 (FP32/FP32) or 77 (FP32/FP16) Bytes per time step
363 (FP32/FP32) or 406 (FP32/FP16S) or 1275 (FP32/FP16C) FLOPs per time step (FP32+INT32 operations counted combined)

In consequence, the arithmetic intensity of this implementation is 2.37 (FP32/FP32) or 5.27 (FP32/FP16S) or 16.56 (FP32/FP16C) FLOPs/Byte. So performance is only limited by memory bandwidth. The table in the left 3 columns shows the hardware specs as found in the data sheets (theoretical peak FP32 compute performance, memory capacity, theoretical peak memory bandwidth). The right 3 columns show the measured FluidX3D performance for FP32/FP32, FP32/FP16S, FP32/FP16C floating-point precision settings, with the (roofline model efficiency) in round brackets, indicating how much % of theoretical peak memory bandwidth are being used.

If your GPU/CPU is not on the list yet, you can report your benchmarks here.

Expand

Additional Information

Version v2.19 (camera splines)
Type Other source code
Update Time 2024-12-19
size 50MB
From Github

Related Applications

waymo open dataset

2024-11-18
Sunamu

2024-12-14
MySchedule.py

2024-12-15
chat.petals.dev

2024-11-30
SmartTube

2024-12-14
viptools for eslam

2024-12-15

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
Sunamu

Other source code

Release 2.2.0
MySchedule.py

Other source code

Updates to the fetching of week codes
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All