STLBM
The open-source STLBM code uses the parallel algorithms feature, which is part of the C++ standard since C++17, to run lattice Boltzmann codes on multi-core CPUs and on GPUs. It is the same code for both platforms, it does not require OpenACC-style language extensions, it is hardware agnostic, and simple (just 600 lines of code for the cavity flow, including I/O).
Further information:
Implemented test case: lid-driven cavity
The 3D lid-driven cavity is used as one of the test cases for the accuracy and the performance of the STLBM code. The best measured performance so far is 3'700 MLUPS on a NVidia A100 GPU. This performance is obtained at a resolution of 200x200x200 or more, for either BGK or RR collision (the problem is memory bandwidth limited), and for double-precision floating-point numbers. Further performance measurements are shown in the GitLab project.
The images below are obtained at a Reynolds of 10'000, which is simulated with the recursive-regularized model (omega-bulk = 1) without subgrid-scale model. Left: vorticity on a logarithmic scale for an instantaneous snapshot (check out the video). Right: Averaged velocity profiles after 3 million iterations (simulated on a A100 GPU in just 14 hours).